Abstract
Background
Highthroughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.
Results
Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for highdimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double crossvalidation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.
Conclusions
The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.
Background
The mining of highdimensional data in which the number of features is much larger than the number of samples, has become increasingly important, especially in genomics, proteomics, biomedical imaging and other areas of systems biology [1]. The availability of high dimensional data along with new scientific problems have significantly challenged traditional statistical theory and reshaped statistical thinking [2].
The high dimensionality of functional genomic data sets poses problems to build classifiers. Because of the sparsity of data in high dimensional spaces, many classical methods of classification break down. For example, Fisher discrimination rule will be inapplicable because the within scatter matrix become singular if the number of variables is larger than the number of samples [3,4].
Another problem is caused by the small sample size. The number of samples is usually not adequate to be representative of the total population. Moreover classifiers built on small sample sets are often not stable and may have a large variance in the number of misclassification [5]. One common approach for this problem is to aggregate many classifiers instead of using a single one. There has been considerable interest recently in the application of aggregating methods in the classification of highdimension data [611]. The most wellknown method in this class of techniques is perhaps bootstrap aggregating (bagging). Breiman found that gains in accuracy could be obtained by bagging when the base learner is not stable [6]. However, Vu and BragaNeto argued that the use of bagging in classification of smallsample data increases computational cost, but is not likely to improve overall classification accuracy over other simpler classification rules [10]. Moreover, if the sample size is small, the gains achieved via a bagged ensemble may not compensate for the decrease in accuracy of individual models [11].
Crossvalidation is probably the most widely used method for estimating prediction error. In small sampled high dimension data modeling, kfold crossvalidation is often used [1]. The kfold crossvalidation estimate is a stochastic variable that depends on the partition of the data set. Full crossvalidation, that means performing allpossible ways of partitioning, will give an accurate estimation, but is computationally too expensive. Therefore, repeating kfold crossvalidation multiple times using different splits provides a good MonteCarlo estimate of the full crossvalidation [12]. This repeating procedure results in a lot of classifiers.
In this paper, we aggregated the classifiers obtained from principal component discriminant analysis (PCDA) with a double crossvalidation scheme [13]. PCDA is an adaptation of Fisher's linear discriminant analysis (FLDA) for highdimensional data. In PCDA, the dimensionality of the data is reduced by principal component analysis (PCA). In the reduced dimensional space the within scatter matrices is nonsingular and classical LDA can be performed [1316]. A double crossvalidation scheme was used to estimate both the number of principal components and the predictor error of the PCDA model [17]. The classifiers that were obtained from the different crossvalidation loops are aggregated to make a single classifier. This approach is tested on simulated data, gene expression, proteomics and metabolomics data. The results obtained from the research may provide insights into the use of aggregating learner in low sample, high dimensional biological data.
Methods
PCDA
Given a high dimensional data set A of size m × n, where m is the number of samples and n is the number of features (m << n), classical FLDA [18] finds the discriminating direction d_{n × 1 }that maximizes the ratio of the betweenclass scatter S_{b }against the withinclass scatter S_{w.}
Here r is the number of classes, and each class has m_{i }samples. M_{i }is the index set of samples in each class i. and are the class centroids and the global centroid respectively.
The discriminating direction d is the eigenvector corresponding to the largest eigenvalue of the matrix . Because the number of features n is larger than the number of samples m in high dimensional data, the matrix S_{w }is singular. This means that does not exist and FLDA cannot be applied directly.
To overcome the difficulties imposed by the singular covariance matrices, the data can be first projected onto a low dimension PCA subspace, and LDA is then applicable in this PCA subspace. The main goal of PCA is to reduce the dimensionality of a data, whilst retaining as much as possible of the information present in the original data. This reduction is achieved by a linear transformation to a new set of variables, the principal component (PC) scores. The combination of LDA with PCA yields principal component discriminant analysis (PCDA).
Aggregating PCDA with double crossvalidation
The optimal number value of reduced dimensions of PCA is usually determined by crossvalidation. The simplest form of crossvalidation is to split the data randomly into K mutually exclusive parts, building a model on all but one part, and to evaluate the model on the omitted part. This strategy allows for estimating the optimal model complexity; however, the resulting prediction performance estimate is often too optimistic since the same samples were also used to find the best number of PC's and thus they are not completely independent. It is therefore recommended to use a double crossvalidation approach [13,17,19,20]. As shown in Figure 1, first the original data set was divided into two parts, training set and test sets. The test set was not used in double crossvalidation scheme and it was employed afterwards to evaluate how good the built classifier really is. The training set was partitioned into K parts. Of the K parts, a single part is retained as the outer validation set, and the remaining K1 subsamples are used as inner training data and inner validation set. On the remaining K1 parts, a K1fold crossvalidation is performed to find the best number of PC components. This is a nested validation scheme. The inner validation set is used to determine the optimal number of principal components, and the outer validation set is used to find the crossvalidation error of the method. In summary, the double crossvalidation with PCDA is summarized in the following pseudo code
Figure 1. The partition of a data set for model selection and the estimation of the crossvalidation error and prediction error. In the inner loop crossvalidation, inner training set and inner validation set are used to determine the number of principal components (PC), and the model is fit on the inner training set. In the outer loop crossvalidation, the model is built on the inner training set and the inner validation set, and an outer validation set are used to estimate the cross validation error. In prediction, the model is built on the inner training set and the inner validation set, and the test set is used to obtain the prediction error.
Divide the training data set into K parts:
For i= 1 to K
For j= 1 to K1
Build PCDA models with different PCs
End
Find an optimal PC number
Build PCDA model with the optimal PC number
End
Obtain crossvalidation error.
Since the crossvalidation error accuracy would depend on the random assignment samples, a common practice is to stratify the folds themselves [12]. In stratified Kfold crossvalidation, the folds are created in a way that they contain approximately the same proportion of classes as the original dataset. With randomly chosen partitions of inner and outer validation set, we can repeat the double crossvalidation scheme to produce a lot of PCDA classifiers. The multiple versions of the predictors can be aggregated by majority voting, i.e., the winning class is the one being predicted by the largest number of predictors.
Data
Simulation
The simulated data contain two classes. Each class L_{i }(i= 1, 2) consists of 100 objects and each object has 590 features, and it is sampled from a multivariate normal distribution N (v_{i}, Ω) respectively, i = 1, 2. Here v_{i }is the mean of class L_{i}, and Ω represents the covariance of the simulated data. To make the simulation more closely to real data, we constructed the simulated data from the Gaucher proteomics data (see below). Suppose the means and covariances of two classes in the autoscaled Gaucher data are represented by vector u_{1}, u_{2}, matrix Ω_{1}, and Ω_{2 }respectively, and the mean v_{1 }and v_{2 }and covariance matrix Ω of the simulated data were calculated by the following equations.
Equation 4 and 5 are to ensure the separability of two classes, and equation 6 is to make two classes have the same common covariance matrix Ω.
By following the above procedure, we obtain the simulated data set of size 200 × 590. Before building PCDA classification model by double crossvalidation on the simulated data set, we separated the simulated data set into training set and test set as shown in Figure 1. In order to form training sets of differ sample sizes, we randomly selected 12, 30, 50, 75, 100 objects from 200 objects. In the test set, 100 objects were random selected without replacement from the data set after removing the training set. The whole selection procedure was repeated 100 times randomly. To make a reasonable comparison, we fix the random seeds in each selection procedure. In single PCDA, a double crossvalidation with tenfold in the outer loop and ninefold in the inner loop were used to obtain the optimal PC number and crossvalidation error. In aggregating PCDA, the PCDA approach was repeated 51 times with different crossvalidation splits to obtain an aggregated classifier. Besides, we also constructed a single PCDA model with double crossvalidation in the simulated data sets to compare the classification performance of PCDA with aggregating PCDA.
Leukemia gene expression data
Leukemia data from highdensity Affymetrix oligonucleotide arrays were previously analyzed in Golub and Tibshirani [21,22], and are available at http://www.broad.mit.edu/cgibin/cancer/datasets.cgi webcite. There are 7129 genes and 72 samples coming from two classes: 47 in class ALL (acute lymphocytic leukemia) and 25 in class AML (acute mylogenous leukemia). Among these 72 samples, 38 (27 in class ALL and 11 in class AML) are set to be training samples and 34 (20 in class ALL and 14 in class AML) are set as test samples. The data is meancentered before classification. It should be noted that the pretreatment step such as meancentering and autoscaling was always performed on the training data and then the test data was pretreated with by the mean and standard deviation obtained from the training set. Autoscaling means meancentering the data and scaling each column by its standard deviation.
Gaucher proteomics data
The data consist of serum protein profiles of 20 Gaucher patients and 20 controls [13]. Serum samples were surveyed for basic proteins with SELDITOFMSS making use of the anionic surface of CM10 PrtoeinChip. All preprocessing (spotspot calibration, baseline subtraction, peak detection) of the SELDITOFMS data was performed using Ciphergen software. The data set of size 40 × 590 is available at http://www.bdagroup.nl/content/Downloads/datasets/datasets.php webcite. One Gaucher sample (a female receiving enzyme replacement therapy) has been detected as an outlier and was removed. The spectra profiles were first normalized by dividing each profile by its median to arrive at comparable spectra. Subsequently, the data sets were autoscaled before classification.
Grape extract metabolomics data
The data set is from Unilever Food and Health Research, Vlaardingen, Netherlands, Thirty five healthy males were recruited to investigate the effect of grape extract supplementation on vascular function and other vascular health markers. The study has a doubleblind, placebo controlled randomized full crossover design with 3 treatments, a runin period, 3 interventions and 2 washout periods. 1D 1H NMR spectra of plasma: D2O (1:1 v/v) samples were recorded on a Bruker Advance 600 MHz NMR spectrometer according to a Standard Operating Procedure with a pulse sequence. All data were processed in Bruker XWINNMR software version 3.0 (Bruker BioSpin GmbH, Rheinstetten, Germany) and imported in AMIX software from Bruker. Due to some missing data, the final NMR data of 276 plasma samples were bucketed in the spectral region 09 ppm using a bucketwidth of 0.02 ppm.
The data set of size 276 × 412 of two classes was divided into two subsets, 200 samples in training set and 76 samples in prediction set, using the KennardStone method [23]. The KennardStone method was used to select objects to model such that they are uniformly scattered over the experimental space. In the training set and test set, the samples were assigned in such a way that the ratio of class membership is similar to the original data. The data sets were autoscaled before classification.
Results and Discussion
When aggregating works
Breiman [6] has noticed that the efficiency of aggregating depends on the stability of the prediction or classification rule. Each crossvalidation PCDA classifiers are constructed on different samples, so it is expected that there will be some variance in the prediction error. First, we applied aggregating PCDA on the simulated data sets. As shown in Figure 2, the classification performance of aggregated PCDA is usually better than that of PCDA itself. Here, the single PCDA itself uses ten fold outer crossvalidation to determine the crossvalidation errors and a nine fold inner crossvalidation to determine the optimal number of principal components. The aggregated PCDA was constructed by repeating single PCDA 51 times. The simulation results in Figure 2 themselves are a pro of aggregating. As the training sets and prediction sets follow same distribution, the cross validation error and prediction error are quite similar in Figure 2. A close look on Figure 2 also tells us, when the number of sample size is increasing, the classification rate is increased and the variation of the prediction error is reduced.
Figure 2. Learning curves of aggregating PCDA and PCDA. With the increasing of sample size for 12 to 100, the classification performance of aggregating PCDA and PCDA is increased significantly, and the variation of classification models also tends to be reduced. The classification performance is represented by 1 minus misclassification rate, and the variation of classification performances is represented by an error bar. The upper error ranges for each point in error bar is obtained with adding standard deviation of mean of classification performance and lower error ranges is obtained with subtracting standard deviation of mean of that. The figure shows that aggregating PCDA usually gives a better classification performance than PCDA. The classification of validation sets and test sets are quite similar since two sets follow the same distribution.
We further applied PCDA and aggregated PCDA on three real data sets. Figures 3 and 4 illustrate the variation of misclassification rate of the data sets in training and predictions.
Figure 3. Boxplot of crossvalidation errors for three real data sets. Misclassification rates are obtained from 1000 times repeating 10 fold double crossvalidations. The ratios of feature to samples in training sets are 38/7129 (Leukemia), 39/500(Gaucher) and 200/412 (Grape). The most stable case is from grape data, and the ratio of feature to sample is the lowest among all three data sets.
Figure 4. Boxplot of prediction errors for two real data sets. Misclassification rates are obtained from 1000 times repeating 10 fold double crossvalidations. The ratios of features to samples in test sets are 34/7129 (Leukemia) and 76/412 (Grape). The model of grape data is more stable since the ratio of feature to sample is lower.
The aggregated PCDA was constructed by repeating PCDA 1000 times. As shown in Figures 3 and 4, the median of the misclassification rate is indicated by the center line, and the first and third quartiles are the edges of the box area, which is known as the interquartile range. The extreme values (within 1.5 times the interquartile range from the upper or lower quartile) are the ends of the lines extending from the interquartile range. Points at a larger distance from the median than 1.5 times the interquartile range are plotted individually as plus sign. Due to the low sample size in the Gaucher data, a separate test set was not created. There are only two data sets giving the performance of the prediction of the test set in Figure 4. Obviously, the variations in the error rate of the PCDA models are quite large in the data sets, especially when ratio of feature to sample is high. The most stable case is from the grape data, and the ratio of feature to sample is the lowest among all three data sets. Table 1 and Table 2 also show that aggregating PCDA model often gives an improved performance over a single PCDA model in the three real data sets. In Table 1 and Table 2, the performance of a single PCDA is represented by the median of the misclassification rate.
Table 1. Crossvalidation errors evaluated by outer validation sets with PCDA
Table 2. Prediction errors evaluated by test sets with PCDA
The aggregated PCDA can make a good PCDA classifier better since the variance of misclassification rate can be reduced [2427]. A heuristic explanation is that the variance of the prediction error of the aggregated classifier is equal to or smaller than the error of the original classifier since majority voting is modeling averaging.
The dimension reduction step by PCA can not be guarantied to preserve all directions that contain discriminative information [28]. But in an aggregated PCDA model, the discarded discriminant information of one PCDA model can be remodeled from other PCDA model with different partition of training data sets by crossvalidation. So, aggregating PCDA itself may contain more discrimination information than single PCDA.
We also compared PCDA with the Support Vector Machine (SVM) classifier [29], and the results are shown in Table 1, 2, 3, 4. We found that the single PCDA classifier has a comparable result to the single SVM classifier. However, aggregating PCDA achieves better results than SVM, PCDA, and aggregating SVM classifiers.
Table 3. Crossvalidation errors evaluated by outer validation sets with SVM
Table 4. Prediction errors evaluated by test sets with SVM
When aggregating does not work
Aggregating may increase the bias of a learner since only a part of the training data are sampled by crossvalidation or bootstrapped for modeling. That is to say, the use of Kfold crossvalidation may have a negative effect on the accuracy of individual PCDA models. As shown in Figure 2, when the sample size is twelve, the performance of PCDA classifier is relatively bad and not stable. After aggregating, the classification performance did not achieve expected training and prediction performance yet, since basically in such case more samples are needed to build a precise model. Another situation which does not favor aggregating is case of very weak learners. A very weak learner means that the performance of learner is even worse than random guess. Aggregating such learner will make prediction even worse because averaging such learners will result in a learner that will give a wrong prediction in all cases. For example, if an observation is classified as a success about four times out of ten. After the majority voting, it will give 100% wrong.
Further notes
Although the efficiency of aggregating depends on the stability of the prediction, aggregating does not definitely make the predictor stable, and it stabilizes to a certain extent. As shown in Figure 5, there is a small margin of sample 3 and sample 20 of the Gaucher proteomics data. The difference between the fractions of times a case is correctly classified and the fraction of times it is incorrectly classified is called the "margin" for that case [28]. Larger margins are desirable because a more stable classification of that sample is implied. As seen in Figure 5, some samples are always corrected predicted and also some samples (10 and 22) are always wrongly predicted. On the other hand, the small margins in sample 3 and 20 tell us that these two samples have almost half chances to be corrected classified, and half chances to be incorrectly classified. These two "instable samples" result in an aggregating classifier that is not stable. Figure 6 also supports such findings as the misclassification rates fluctuate greatly with different numbers of aggregation.
Figure 5. Margins plot of thirtynine samples in Gaucher data. The margin plot tells the difference between the fractions of times a case is correctly classified and the fraction of times it is incorrectly classified. Sample 3 and Sample 20 have small margins, and it means these two samples have half of chances to be corrected classified and half of chance to be incorrectly classified.
Figure 6. The crossvalidation errors in different number of aggregation. In leukemia data, the misclassification rate keeps stable when the number of aggregating is more than 50. In grape data, the misclassification rate keeps stable when the number of aggregating is more than 100. In Gaucher data, the aggregating model is not stable.
Another question about aggregating PCDA is how many times resampling is enough? Figure 6 gives the misclassification rate in training with increasing number of aggregation. The number of aggregation starts from 20 to 1000, and increases by 20 each time. We observe in Figure 6 that the aggregated misclassification rate will keep stable after 100 replicas in leukemia and grape data. For Gaucher data, 200 replicas also give a reasonable estimation. To our experience, 50200 replicas are usually enough to get a stable value. Aggregating learner in this paper is obtained from crossvalidation, which is resampling without replacement. The conventional bagging is obtained from bootstrapping, which is resampling with replacement. As stated by Buja and Stuetzel [30], there is an equivalence between bagging based on resampling with and without replacement. So, the conclusion obtained in this paper in our opinion also holds in bagging approaches.
Another concern is whether aggregating PCDA can apply to multiclassification problem. Because the discrimination in PCDA is performed by LDA, the properties of LDA for multiclasses also hold. Since the decision boundaries in LDA are constructed in a pair wise manner [1], the conclusions drawn in this paper in principle are also valid for a multiclass problem. However, many discriminative methods are often most accurate and efficient when dealing with two classes only, but usually at reduced accuracy and efficiency for multiclassification [31]. The effects of aggregating multiclassifier still need further careful studies.
In addition, an interpretable model is usually required as it is important to identify which genes, proteins and metabolites contribute most to classifiers. The PCDA model has been already combined with rank products [13,16,32] to find important variables. In aggregating PCDA, we can repeat the same strategy too. For example, we aggregate 100 PCDA learners together. As a single PCDA yields 10 discriminant vectors in a 10 fold crossvalidation; 100 runs gives 1000 discriminant vectors in total. Then for all features the products of the 1000 ranks are calculated. After sorting, the features with the lowest rank products are the ones with the largest discriminative power.
Conclusions
The use of crossvalidation to study the performance of a classifier is an established method. If performed in a proper way crossvalidation provides roughly unbiased estimates of the prediction measures. However, the different partitions in crossvalidation can give rise to high variability of the model predictions. In this paper we show a way to overcome the variability by building one aggregated classifier from all the classifiers that were build in the repeating crossvalidations.
Aggregating learners can have several important benefits. Aggregating over a collection of fitted values can help compensate for overfitting. That is, the majority voting tends to cancel out results shaped by idiosyncratic features of the data. One can then obtain more stable and more honest assessments of how good the fit really is.
Aggregating learners also have some limits. When the sample size is very small, aggregating learner may have a large bias. So it is important for us to visualize the data to see if aggregating will be helpful or not.
In conclusion, we recommend the use of aggregating learner in high dimensional data analysis, but a careful look on data structure and comparison with base learner result.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors conceived the model and the structure of the paper. CJX performed the analysis and drafted the paper. All authors read and approved the final manuscript.
Acknowledgements
The authors thank Ewoud van Velzen, Unilever Food and Health Research Institute, Vlaardingen, Netherlands for supplying the Grape extracts data. This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI).
References

Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd edition. New York: Springer; 2009.

Fan JQ, Li RZ, Statistical challenges with high dimensionality: feature selection in knowledge discovery. In Proceedings of the international Congress of Mathematicians. Madrid, Spain: 2006 European Mathematical Society; 2006.

Fukunaga K: Introduction to Statistical Pattern Recognition. New York: Academic Press; 1990.

Chen LF, Liao HYM, Ko MT, Lin JC, Yu GJ: A new LDAbased face recognition system which can solve the small sample size problem.

Geurts P, Fillet M, de Seny D, Meuwis MA, Malaise M, Merville MP, Wehenkel L: Proteomic mass spectra classification using decision tree based ensemble methods.

Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarraybased cancer classification.

Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP: Prediction of clinical drug efficacy by classification of druginduced genomic expression profiles in vitro.

Vu TT, BragaNeto UM: Is Bagging Effective in the Classification of SmallSample Genomic and Proteomic Data?
EURASIP Journal on Bioinformatics and Systems Biology 2009, 2009:10.
Article ID 158368

Kohavi R: "A study of crossvalidation and bootstrap for accuracy estimation and model selection. [http://robotics.stanford.edu/users/ronnyk/] webcite
International Joint Conference on Artificial Intelligence (IJCAI) 1995.

Kotsiantis SB, Pintelas PE: Combining Bagging and Boosting.
International Journal of computational Intelligence 2004, 2004(1):323333.

Smit S, van Breemen MJ, Hoefsloot HCJ, Smilde AK, Aerts J, de Koster CG: Assessing the statistical validity of proteomics based biomarkers.

Hoogerbrugge R, Willig SJ, Kistemaker PG: Discriminantanalysis by double stage principal component analysis.

Belhumeur PN, Hespanha JP, Kriegman DJ: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection.

Hoefsloot HCJ, Smit S, Smilde AK: A classification model for the Leiden proteomics competition.
Stat Appl Genet Mol Biol 2008., 7(2)
Article 8

Stone M: CrossValidatory Choice and Assessment of Statistical Predictions J R Stat Soc B.
1974, 36:111147.

Vandeginste BGM, Massart DL, Buydens LMC, Jong SD, Lewi PJ, SmeyersVerbeke J: Handbook of Chemometrics and Qualimerics: Part B. Amsterdam: Elsevier; 1998.

Mertens BJA, De Noo ME, Tollenaar R, Deelder AM: Mass spectrometry proteomic diagnosis: Enacting the double crossvalidatory paradigm.

Filmoser P, Liebmann B, Varmuza K: Repeated double cross validation.

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.

Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Friedman JH: On bias, variance, 0/1  Loss, and the curseofdimensionality.

Berk RA: Statistical Learning from a Regression Perspective. New York: SpringerVerlag; 2008.

Yang J, Yang JY: Why can LDA be performed in PCA transformed space?

Vapnik V: The Nature of Statistical Learning Theory. SpringerVerlag; 1995.

Ding CHQ, Dubchak I: Multiclass protein fold recognition using support vector machines and neural networks.

Breitling R, Armengaud P, Amtmann A, Herzyk P, Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.