A stable gene selection in microarray data analysis

Yang, Kun; Cai, Zhipeng; Li, Jianzhong; Lin, Guohui

doi:10.1186/1471-2105-7-228

Research article
Open access
Published: 27 April 2006

A stable gene selection in microarray data analysis

Kun Yang¹,
Zhipeng Cai²,
Jianzhong Li¹ &
…
Guohui Lin²

BMC Bioinformatics volume 7, Article number: 228 (2006) Cite this article

10k Accesses
146 Citations
Metrics details

Abstract

Background

Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Gene selection is to detect the most significantly differentially expressed genes under different conditions, and it has been a central research focus. In general, a better gene selection method can improve the performance of classification significantly. One of the difficulties in gene selection is that the numbers of samples under different conditions vary a lot.

Results

Two novel gene selection methods are proposed in this paper, which are not affected by the unbalanced sample class sizes and do not assume any explicit statistical model on the gene expression values. They were evaluated on eight publicly available microarray datasets, using leave-one-out cross-validation and 5-fold cross-validation. The performance is measured by the classification accuracies using the top ranked genes based on the training datasets.

Conclusion

The experimental results showed that the proposed gene selection methods are efficient, effective, and robust in identifying differentially expressed genes. Adopting the existing SVM-based and KNN-based classifiers, the selected genes by our proposed methods in general give more accurate classification results, typically when the sample class sizes in the training dataset are unbalanced.

Background

DNA microarray is a technology that can simultaneously measure the expression levels of thousands of genes in a single experiment. It is commonly used for comparing the gene expression levels in tissues under different conditions, such as wild-type versus mutant, or healthy versus diseased [1]. Some of the genes are expected to be differentially modulated in tissues under different conditions, with their expression levels increased or decreased to signify the experimental conditions. These discriminatory genes are very useful in clinical applications such as recognizing diseased profiles. However, due to high cost, the number of experiments that can be used for classification purpose is usually limited. This small number of experiments, compared to the large number of genes in an experiment, wakes up "the curse of dimensionality" and challenges the classification task and other data analysis in general. It is well-known that quite a number of genes are house-keeping genes and many others could be unrelated to the classification task [2]. Therefore, an important step to effective classification is to identify the discriminatory genes thus to reduce the number of genes used for classification purpose. This step of discriminatory gene identification is generally referred to as gene selection. Gene selection is a pre-requisite in many applications [3]. It should be noted that, often, the number of unrelated genes is much larger than the number of discriminatory genes.

There are a variety of gene selection methods proposed in the last a few years [2, 4, 5]. Among them, some methods assume explicit statistical models on the gene expression data. For example, Baldi and Long [4] developed a Gaussian gene-independent model on the gene expression data, and implemented a t-test combined with a full Bayesian treatment for gene selection. These methods assuming certain models are referred to as model-based gene selection methods. Other methods do not assume any specific distribution model on the gene expression data and they are referred to as model-free gene selection methods. For example, Xiong et al. [2] suggested a method to select genes through the space of feature subsets using classification errors. Guyon et al. [5] proposed a gene selection approach utilizing support vector machines based on recursive feature elimination. It has been reported that the results of model-free gene selection methods may be influenced by the classification methods chosen for scoring the genes [6]. Nonetheless, model-based gene selection methods lack of adaptability, because it is unlikely possible to construct a universal probabilistic analysis model that is suitable for all kinds of gene expression data, where noise and variance may vary dramatically across different gene expression data [6]. In this sense, model-free gene selection methods are more desirable than model-based ones.

Gene selection is to provide a subset of a small number of discriminatory, or the most relevant, genes that can effectively recognize the class to which a testing sample belongs. That is, it is to provide a classifier such that the classification error is minimized. The known dataset that is used for learning the classifier, or equivalently for gene selection, is referred to as the training dataset. In a training dataset, every sample is labeled with its known class. Notice that if one class is significantly larger than the others, then the trained classifier might be biased. Therefore, the desired gene selection methods are those that are not affected by the sizes of classes in the training dataset. A gene selection method is called stable if the selected genes are kept the same when duplicating all the samples in any class in the training dataset.

In this paper, we propose two novel gene scoring functions s₁(·) and s₂(·) to design two stable gene selection methods GS1 and GS2 [see Additional file 5], respectively, to be detailed in the Methods section. According to the classification scheme proposed in [6], our proposed gene selection methods are single gene scoring approaches. These two gene scoring functions non-trivially incorporate the means and the variations of the expression values of genes in the samples belonging to a common class, based on a very general assumption that discriminatory genes are those having different means across different classes, small intra-class variations and relatively large inter-class variations. This spherical data assumption does not involve any specific statistical model, and in this sense, the resultant gene selection methods GS1 and GS2 could be regarded as model-free. They are also shown to be stable theoretically.

Results and discussion

We have applied our gene selection methods GS1 and GS2 based on the gene scoring functions s₁(·) and s₂(·), respectively, to a total of 8 publicly available microarray datasets [7]: the leukemia (LEU) dataset [8], the small round blue cell tumors (SRBCT) dataset [9], the malignant glioma (GLIOMA) dataset [10], the human lung carcinomas (LUNG) dataset [11], the human carcinomas (CAR) dataset [12], the mixed-lineage leukemia (MLL) dataset [13], the prostate (PROSTATE) dataset [14], and the diffuse large B-cell lymphoma (DLBCL) dataset [15], the first two of which have been used for several similar testings of gene selection methods. On each dataset, one portion was used as the training dataset for our methods to score the genes and the other portion was used as the testing dataset. For each specified number x we reported the classification accuracy, on the testing dataset, of the classifier based on the top ranked x genes using the training dataset. The quality of these top ranked x genes is justified on two aspects: 1) the classification accuracy of the resultant classifier on the testing datasets, and 2) for the first two datasets LEU and SRBCT, the stability when the training datasets were partially changed. All the experiments were conducted in MATLAB [16] environment on a Pentium IV PC with a 2.4 GHz processor and a 512 MB RAM.

The datasets

The LEU dataset contains in total 72 samples in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which contain 47 and 25 samples, respectively. Every sample contains 7,129 gene expression values. We adopted the pretreatment method proposed in [1] to remove about half the genes and subsequently every sample contains only 3, 571 gene expression values.

The SRBCT dataset contains in total 83 samples in four classes, the Ewing family of tumors (EWS), Burkitt lymphoma (BL), neuroblastoma (NB) and rhabdomyosarcoma (RMS) [9]. Every sample in this dataset contains only 2,308 gene expression values. Among the 83 samples, 29, 11, 18, and 25 samples belong to classes EWS, BL, NB and RMS, respectively.

The GLIOMA dataset [10] contains in total 50 samples in four classes, cancer glioblastomas (CG), non-cancer glioblastomas (NG), cancer oligodendrogliomas (CO) and non-cancer oligodendrogliomas (NO), which have 14,14, 7,15 samples, respectively. Each sample has 12625 genes. We adopted a standard filtering method [10], that is, genes with minimal variations across the samples were removed. For this dataset, intensity thresholds were set at 20 and 16,000 units. Genes whose expression levels varied < 100 units between samples, or varied < 3 fold between any two samples, were excluded. After preprocessing, we obtained a dataset with 50 samples and 4433 genes.

The LUNG dataset [11] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and we obtained a dataset with 203 samples and 3312 genes [11].

The CAR dataset [12] contains in total 174 samples in eleven classes, prostate, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocarcinomas, and lung squamous cell carcinoma, which have 26, 8, 26, 23,12,11, 7, 27, 6,14,14 samples, respectively. Each sample contains 12533 genes. In our experiment, we preprocessed dataset as described in [12]. We included only those probe sets whose maximum hybridization intensity (AD) in at least one sample was 200, all AD values ≤ 20, including negative AD values, were raised to 20, and the data were log transformed. After preprocessing, we obtained a dataset with 174 samples and 9182 genes.

The MLL dataset [13] contains in total 72 samples in three classes, acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), and mixed-lineage leukemia gene (MLL), which have 24, 28, 20 samples, respectively. In our experiment, intensity thresholds were set at 100 – 16000 units. Then the relative variation of expressions for each gene was determined by dividing the maximum expression for the gene among all samples (max) over the minimum expression (min), i.e. max/min. We filtered out the genes with max/min ≤ 5 or (max - min) ≤ 500, that is, for max/min fold variation, we excluded genes with less than 5-fold variation and, for (max - min) absolute variation, we excluded genes with less than 500 units absolute. After preprocessing, we obtained a dataset with 72 samples and 8685 genes.

The PROSTATE dataset [14] contains in total 102 samples in two classes tumor and normal, which have 52 and 50 samples, respectively. The original dataset contains 12600 genes. In our experiment, intensity thresholds were set at 100 – 16000 units, the same as in the MLL dataset. Then we filtered out the genes with max/min ≤ 5 or (max - min) ≤ 50. After preprocessing, we obtained a dataset with 102 samples and 5966 genes.

The DLBCL dataset [15] contains in total 77 samples in two classes, diffuse large B-cell lymphomas (DLBCL) and follicular lymphoma (FL) which have 58 and 19 samples, respectively. The original dataset contains 7129 genes. We set intensity thresholds at 20 – 16000 units, then we filtered out genes with max/min ≤ 3 or (max - min) ≤ 100. After preprocessing, we obtained a dataset with 77 samples and 6285 genes.

Among the above 8 datasets [see Additional files 3 and 4], the first two, LEU and SRBCT, have been used frequently for testing gene selection methods and classifiers. For each of them, if we use the ratio of the largest class size divided by the smallest class size to represent the level of unbalance, then the fifth dataset CAR is the most unbalanced. In our experiments, we have run every gene selection method on each of the 8 datasets to rank the genes, and for every x ≤ 100, the classification accuracies of the classifier built using the top ranked x genes have been collected [see Additional file 1]. We chose to present part of the classification accuracies on datasets SRBCT and CAR in details (as plots) and to present only three values for x, 30, 60, and 100, for all eight datasets (as tables).

Classification accuracies

Using the top ranked genes selected by a gene selection method, together with their expression values in the training dataset, one can build a classifier that will decide for each testing sample the class it belongs to. Only the expression values for those selected genes in the testing sample are used for such a decision making. This is a standard way to test the quality of those selected genes, to examine how well the resulting classifier performs. Note that testing samples are not included in the training dataset. To this purpose, we define the classification accuracy to be the percentage of the correct decisions made by the classifier on the testing samples. We have compared our methods GS1 and GS2 with two other model-free gene selection methods Cho's [17] and F-test [1]. In our experiments, we have adopted two ways to build a classifier using the selected genes, one is through Support Vector Machines (SVMs) [5] and the other is through K- Nearest-Neighbor (KNN) search [1]. Essentially, SVMs compute a decision plane to separate the set of chips (in the training dataset) having different class memberships, and use this plane to predict the class memberships for testing chips. There are a number of kernels used in SVMs models for decision plane computing and we chose a linear kernel as described in [5]. A KNN classifier ascertains the class of a testing sample by analyzing its K nearest neighbors in the training dataset [1]. We chose the Euclidean distance in our KNN classifier with K = 5 and predicted the class by majority vote [1]. The SVM we used in MATLAB is from [18] and we coded the KNN by ourselves. For ease of presentation, the achieved classifiers are referred to as the SVM-classifier and the KNN-classifier, respectively.

Figures 1 and 2 plot the classification accuracies of the SVM-classifier based on four gene selection methods GS1, GS2, Cho's, and F-test, on the SRBCT and CAR datasets, respectively. These classification accuracies were obtained through Leave-One-Out (LOO) cross validation. In LOO cross validation, one sample was left out as a testing sample and the remaining were used as the training dataset, and this was done for every sample in the dataset. We have also conducted 5-Fold cross validation, in which each dataset was randomly partitioned into 5 parts of equal size and in every experiment four parts were used as the training dataset (the fifth part was used as the testing dataset). This was done for every four parts in the dataset and the process (that is, random partition, training, and testing) was repeated for 100 times. The average accuracy over all these 500 testing datasets was taken as the 5-Fold cross validation classification accuracy. All plots of the 5-Fold (and the other LOO) cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods GS1, GS2, Cho's, and F-test, on the eight datasets are included in Additional file 1. Especially, columns 2-4 (and 6–8) in Tables 1, 2, 3, 4, 5, 6, 7, 8 record these cross validation classification accuracies, for only three numbers of top ranked genes, that is, 30, 60, and 100. Column 5 (and column 9) records the highest cross validation classification accuracies on these eight datasets ever achieved by the SVM-classifier and the KNN-classifier, respectively, as well as the associated numbers of selected genes (no more than 100 genes were used).

Table 1 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the SRBCT dataset. Listed are the accuracies when the numbers of selected genes are 30, 60, and 100, respectively, together with their standard deviations for 5-fold cross validation, and the best accuracy together with the number of selected genes.

Full size table

Table 2 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the CAR dataset.

Full size table

Table 3 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the LEU dataset.

Full size table

Table 4 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the GLIOMA dataset.

Full size table

Table 5 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the LUNG dataset.

Full size table

Table 6 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the DLBCL dataset.

Full size table

Table 7 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the MLL dataset.

Full size table

Table 8 The leave-one-out and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods, GS1, GS2, Cho's, and F-test, on the PROSTATE dataset.

Full size table

Note that in the 5-fold cross validation, the classification accuracy is calculated as the average of 500 classification accuracies on 500 testing datasets. We have also collected their standard deviations [see Additional file 2]. For three numbers 30, 60, and 100, the standard deviations are included in Tables 1, 2, 3, 4, 5, 6, 7, 8. Essentially, all these four gene selection methods, GS1, GS2, Cho's, and F-test, have very close standard deviations, and these standard deviations seem to be independent of classifier and dataset. Consequently, looking at all the classification accuracies as shown in Figures 1, 2 and Tables 1, 2, 3, 4, 5, 6, 7, 8, one general conclusion is that our gene selection methods, GS1 and GS2, perform at least comparably well to F-test and Cho's, on all 8 datasets using both the SVM-classifier and the KNN-classifier. Typically, our methods outperform significantly the other two methods on datasets SRBCT, GLIOMA, LUNG, and CAR, which have unbalanced class sizes.

Stability of the gene selection methods

Given a training dataset (in this case, we take the whole dataset as the training dataset), to test the stability of a gene selection method we duplicated all the samples in one class to produce a duplicated dataset. Our gene selection methods GS1 and GS2 are shown to be stable theoretically (cf. Methods section) and therefore are not subjects to such a test. For each of Cho's and F-test, it was applied on the duplicated datasets to report the same numbers of genes as it was applied to the original training dataset, and then the percentages of the common genes were recorded. Note that the LEU dataset and the SRBCT dataset give 2 and 4 duplicated datasets, respectively. Table 9 collects these percentages.

Table 9 The percentages of genes that were re-selected by Cho's and F-test on duplicated datasets, of the whole LEU and the SRBCT datasets, respectively.

Full size table

We have also performed a similar experiment to test the stability when a small portion of the samples were removed from the training dataset. For each class in a training dataset, it was divided into three parts of equal size and every time one part was removed from the dataset to obtain a reduced dataset. Then again, the percentages of the common selected genes using the original dataset and the reduced datasets were recorded. We tried in total 1000 random divisions and the average of 3000 percentages was taken to be the stability for this class. Table 10 collects these stability results for GS1, GS2, Cho's, and F-test. From these results, we can see that GS1, GS2, and F-test had very close stabilities on the reduced datasets, while Cho's had the least over all classes.

Table 10 The percentages of genes that were re-selected by GS1, GS2, Cho's, and F-test on reduced datasets, of the whole LEU and the SRBCT datasets, respectively.

Full size table

Conclusion

In this paper, we proposed two stable gene selection methods GS1 and GS2, which could also be regarded as model-free. From the comparisons made to one most recent method Cho's and one most classic method F-test on eight publicly available datasets, GS1 and GS2 selected genes whose resultant classifiers achieved at least equally good and most of the time better classification accuracies. Both leave-one-out and 5-fold cross validations confirmed our conclusion. We haven't run any biological experiments to verify each of the top ranked genes by our methods yet inconsistent to other methods. Nonetheless, we believe that our methods would be good potential substitutes to the ones currently in use as our methods are model-free and stable.

Methods

Assume in the training dataset there are in total p genes in the microarray chips, and assume we have in total n chips/samples that have been grouped into L classes. Let a_ijdenote the expression value of gene j in sample i. The training dataset can thus be represented as a matrix

A_n×p= (a_ij)_n×p.

We will define two gene scoring functions using entry values in matrix A_n×p. These two scoring functions might be considered to better use the means and the variations of the gene expression values.

Let n_kdenote the number of samples in class C_k, for k = 1, 2,..., L (i.e., $\sum_{k = 1}^{L} n_{k} = n$ ). Let ${\bar{a}}_{k j} = \frac{1}{n_{k}} \sum_{i \in C_{k}} a_{i j}$ , which is the average expression value of gene j in class C_k, for k = 1, 2,..., L. The expression vector ( ${\bar{a}}_{k 1}, {\bar{a}}_{k 2}, \dots, {\bar{a}}_{k p}$ ) is the centroid of class C_k. Correspondingly, the centroid matrix is

{\bar{A}}_{L \times p} = {({\bar{a}}_{k j})}_{L \times p} .

The mean of these centroids is $\hat{A} = ({\hat{a}}_{1}, {\hat{a}}_{2}, \dots, {\hat{a}}_{p})$ , where ${\hat{a}}_{j} = \frac{1}{L} \sum_{k = 1}^{L} {\bar{a}}_{k j}$ .

For sample i belonging to class C_k, the difference between the expression value of gene j and the class mean is x_ij= |a_ij- ${\bar{a}}_{k j}$ |. The matrix

X_n×p= (x_ij)_n×p

is the deviation matrix of the training dataset. Let ${\bar{x}}_{k j} = \frac{1}{n_{k}} \sum_{i \in C_{k}} x_{i j}$ denote the average deviation for samples in class C_kwith respect to the centroid. The centroid deviation matrix is

{\bar{X}}_{L \times p} = {({\bar{x}}_{k j})}_{L \times p} .

Intuitively, gene j has a strong discriminating power if the means ${\bar{a}}_{k j}$ , k = 1, 2,..., L, differ significantly and ${\bar{x}}_{k j}$ , k = 1, 2,..., L, indicating the intra-class variations, are all small.

For example, suppose we have a microarray expression matrix A_12×4 as shown, in which 12 samples have been known in 3 classes:

A_{12 \times 4} = (\begin{array}{r} \underline{\begin{array}{r} 0.65 & 0.2 & 0.2 & 0.7 \\ 0.85 & 1.0 & 1.4 & 1.0 \\ 0.9 & 1.2 & 0.8 & 1.3 \end{array}} \\ \underline{\begin{matrix} 0.9 & 0.7 & 0.6 & 0.5 \\ 1.1 & 1.4 & 2.0 & 0.7 \\ 1.5 & 1.8 & 1.2 & 1.6 \\ 1.3 & 0.9 & 1.0 & 1.2 \end{matrix}} \\ \begin{matrix} 1.2 & 1.0 & 0.8 & 1.5 \\ 1.5 & 1.7 & 1.6 & 0.2 \end{matrix} \\ \begin{matrix} 1.7 & 2.1 & 2.3 & 1.8 \\ 2.0 & 2.0 & 1.9 & 0.3 \\ 1.6 & 1.2 & 1.4 & 1.2 \end{matrix} \end{array}) .

Then the centroid matrix ${\bar{A}}_{3 \times 4}$ is

{\bar{A}}_{3 \times 4} = (\begin{matrix} 0.8 & 0.8 & 0.8 & 1.0 \\ 1.2 & 1.2 & 1.2 & 1.0 \\ 1.6 & 1.6 & 1.6 & 1.0 \end{matrix}),

and the mean of the centroids is

\hat{A}

(1)

= (1.2, 1.2, 1.2, 1.0).

The deviation matrix X_12×4 is

X_{12 \times 4} = (\begin{array}{r} \underline{\begin{array}{r} 0.15 & 0.6 & 0.6 & 0.3 \\ 0.05 & 0.2 & 0.6 & 0.0 \\ 0.1 & 0.4 & 0.0 & 0.3 \end{array}} \\ \underline{\begin{array}{l} 0.3 & 0.5 & 0.6 & 0.5 \\ 0.1 & 0.2 & 0.8 & 0.3 \\ 0.3 & 0.6 & 0.0 & 0.6 \\ 0.1 & 0.3 & 0.2 & 0.2 \end{array}} \\ \begin{matrix} 0.4 & 0.6 & 0.8 & 0.5 \\ 0.1 & 0.1 & 0.0 & 0.8 \end{matrix} \\ \begin{matrix} 0.1 & 0.5 & 0.7 & 0.8 \\ 0.4 & 0.4 & 0.3 & 0.7 \\ 0.0 & 0.4 & 0.2 & 0.2 \end{matrix} \end{array}),

and the intra-class average deviations are

{\bar{X}}_{3 \times 4} = (\begin{matrix} 0.1 & 0.4 & 0.4 & 0.2 \\ 0.2 & 0.4 & 0.4 & 0.4 \\ 0.2 & 0.4 & 0.4 & 0.6 \end{matrix}) .

Figures 3, 4, 5, 6 illustrate the expression values of these four genes across all 12 samples, with the intra-class means and average deviations also shown. There are three key ideas in our design of gene scoring functions, which will be exemplified through these four genes. First of all, gene 1 has quite different mean expression values across three classes, compared to gene 4 that has the same means. Therefore, gene 1 is intuitively better than gene 4 in terms of discriminating power. Note that the goal of gene selection is to select genes that have significantly different means across different classes. For each gene j, the quantity ${\hat{a}}_{j}$ is the mean of all the centroids on gene j and it represents all the samples. ${\hat{a}}_{j}$ is stable, that is, it would not change when the samples in one class are duplicated (since the number of classes, L, and all the means, ${\bar{a}}_{k j}$ , for k = 1, 2,..., L, do not change). We define the scatter of gene j to capture the inter-class variations, which takes in ${\hat{a}}_{j}$ as a component:

s c a t t e r (j) = \sqrt{\frac{1}{L} \sum_{k = 1}^{L} {({\bar{a}}_{k j} - {\hat{a}}_{j})}^{2}} + \frac{1}{2} \min_{w \neq v} | {\bar{a}}_{w j} - {\hat{a}}_{v j} |,

in which the square root is the standard (estimated) deviation of all the centroids on gene j. Clearly seen, scatter(j) is a stable function. More discriminatory genes are expected to have bigger scatter-values. In the following, we prove an upper bound and a lower bound for scatter(j).

Lemma 1 Given n arbitrary nonnegative numbers a₁, a₂,...,a_n, the inequality $\frac{1}{n}$ (a₁ + a₂ + ... + a_n) ≤ $\sqrt{\frac{1}{n} (a_{1}^{2} + a_{2}^{2} + \dots + a_{n}^{2})}$ holds, and it becomes equality if and only if a₁ = a₂ = ... = a_n.

Lemma 1 can be proven by a mathematical induction.

Lemma 2 Given n arbitrary nonnegative numbers sorted in order a₁ ≤ a₂ ≤ ... ≤ a_n, define $\hat{a} = \frac{1}{n} \sum_{i = 1}^{n} a_{i}$ , $\tilde{a} = \frac{1}{2}$ min_1≤i≤n-1(a_i+1- a_i), and $S = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(a_{i} - \hat{a})}^{2}}$ . Then,

\tilde{a}

(2)

≤ S ≤ (a_n- a₁) - $\tilde{a}$ .

PROOF. Note that both S and $\tilde{a}$ are nonnegative. Therefore, if $\tilde{a}$ = 0, then S ≥ $\tilde{a}$ holds trivially. In the other case, we have a₁ <a₂ < ... <a_n. Assume without loss of generality that the minimum is achieved at i = k, that is, $\tilde{a} = \frac{1}{2}$ (a_k+1- a_k). If $\hat{a}$ ∈ [a_k, a_k+1], from Lemma 1, we have

\frac{1}{2} ({(a_{k} - \hat{a})}^{2} + {(a_{k + 1} - \hat{a})}^{2}) \geq {(\frac{1}{2} ((\hat{a} - a_{k}) + (a_{k + 1} - \hat{a}))}^{2} = {\tilde{a}}^{2} .

For i ≠ k, k + 1, (a_i≥ $\hat{a}$ )² ≥ $\tilde{a}$ ². Therefore, $n S^{2} = \sum_{i = 1}^{n} {(a_{i} - \hat{a})}^{2} \geq n {\tilde{a}}^{2}$ . if $\hat{a}$ ∈ [a_p, a_p+1] but p ≠ k, similarly we will have $\frac{1}{2}$ ((a_p- $\hat{a}$ )² + (a_p+1- $\hat{a}$ )²) ≥ (a_p+1- a_p)² ≥ $\tilde{a}$ ² and for i ≠ p, p + 1, (a_i- $\hat{a}$ )² ≥ $\tilde{a}$ ². This proves that $\tilde{a}$ ≤ S.

Inequality S + $\tilde{a}$ ≤ a_n- a₁ holds again if $\tilde{a}$ = 0, since (a_i- $\hat{a}$ )² ≤ (a_n- a₁)² for every i. Therefore, we may assume that a₁ <a₂ < ... <a_n. A similar enlarging process gives S ≤ max{a_n- $\hat{a}$ , $\hat{a}$ - a₁}. Since

a_{n} - \hat{a} + \tilde{a} \leq a_{n} - \frac{1}{2} (a_{2} + a_{1}) + \frac{1}{2} (a_{2} - a_{1}) = a_{n} - a_{1}

and

\hat{a} - a_{1} + \tilde{a} \leq \frac{1}{2} (a_{n} + a_{n - 1}) - a_{1} + \frac{1}{2} (a_{n} - a_{n - 1}) = a_{n} - a_{1},

we conclude that S + $\tilde{a}$ ≤ a_n- a₁. □

According to Lemma 2, the following theorem on the bounds on scatter(j) holds.

Theorem 3 For gene j, define max(j) = max_w≠v $| {\bar{a}}_{w j} - {\bar{a}}_{v j} |$ and min(j) = min_w≠v $| {\bar{a}}_{w j} - {\bar{a}}_{v j} |$ . We have min(j) ≤ scatter (j) ≤ max(j).

A differentially expressed gene is expected to have not only large inter-class variations, which can be represented by its scatter-value, but also small intra-class variations. Secondly, we define a function based on the deviation matrix X_n×pand the centroid deviation matrix $\bar{X}$ _L×p:

μ ({\bar{X}}_{L \times p}, j) = {\hat{x}}_{j} = \frac{1}{L} \sum_{k = 1}^{L} {\bar{x}}_{k j} = \frac{1}{L} \sum_{k = 1}^{L} (\frac{1}{n_{k}} \sum_{i \in C_{k}} x_{i j}),

which is stable. Intuitively, discriminatory genes are expected to have smaller μ-values. In the example dataset, genes 1 and 2 have the same mean expression values across all three classes, that is, they have the equal scatter values. Nonetheless, μ( $\bar{X}$ _3×4, 1) = 0.167 and μ( $\bar{X}$ _3×4, 2) = 0.4, and thus gene 1 is better than gene 2 in this sense.

In the same example, we have μ( $\bar{X}$ _3×4, 3) = μ( $\bar{X}$ _3×4, 4) = 0.4. However, for gene 3, the centroids of three intra-class average deviations are the same, that is, $\bar{x}$ k₃ = 0.4 for k = 1, 2, 3; for gene 4, the scenario is totally different, $\bar{x}$ x₄ = 0.2, 0.4, 0.6 for k = 1, 2, 3. This raises a question of, basing on μ( $\bar{X}$ _L×p, j), what we can tell about the quality of gene j. The contradictory fact is that gene 3 has a smaller maximum intra-class average deviation and a larger minimum intra-class average deviation. To further differentiate the genes, thirdly, we define function d₁(j):

d_{1} (j) = \sqrt{\frac{1}{L} \sum_{k = 1}^{L} {\bar{x}}_{k j}^{2}} .

From Lemma 1, d₁(j) ≥ μ( $\bar{X}$ _L×p, j). d₁(j) is also stable, and in the above example we have d₁(3) <d₁(4), which indicates that function d₁(j) could be more sensitive and conservative than function μ( $\bar{X}$ _L×p, j) on judgment ability. Another stable function can be defined based on μ( $\bar{X}$ _L×p,j) is

d_{2} (j) = \sqrt{\frac{1}{L} \sum_{k = 1}^{L} (\frac{1}{n_{k}} \sum_{i \in C_{k}} x_{i j}^{2})} .

Intuitively, d₂(j) includes more details in its calculation than d₁(j) does. In the above example, gene 2 and gene 3 have the same mean expression values across all three classes: $\bar{x}$ _1j= $\bar{x}$ _2j= $\bar{x}$ _3j. Therefore, we have d₁(2) = d₁(3) but d₂(2) <d₂(3). Since intuitively gene 2 has a stronger separability than gene 3, d₂(j) could be even more sensitive than d₁(j).

The above two functions d₁(·) and d₂(·) basically consider the means of intra-class variations. The following two functions δ₁(j) and δ₂(j) are introduced to capture the variations of intra-class deviations, corresponding to d₁(·) and d₂(·), respectively:

\begin{matrix} δ_{1} (j) = \sqrt{\frac{1}{L} \sum_{k = 1}^{L} {({\bar{x}}_{k j} - μ ({\bar{X}}_{L \times p}, j))}^{2}}, & and & δ_{2} (j) = \sqrt{\frac{1}{L} \sum_{k = 1}^{L} \frac{1}{n_{k}} \sum_{i \in C_{k}} {({\bar{x}}_{i j} - μ ({\bar{X}}_{L \times p}, j))}^{2}} \end{matrix} .

Theorem 4 $\begin{matrix} δ_{1} (j) = \sqrt{d_{1} {(j)}^{2} - μ {({\bar{X}}_{L \times p}, j)}^{2}} & a n d & δ_{2} (j) = \sqrt{d_{2} {(j)}^{2} - μ {({\bar{X}}_{L \times p}, j)}^{2}} \end{matrix} .$

PROOF. The proof is easily done by simplifying the definition formulae for δ₁(j) and δ₂(j). □

Similar to functions d₁(j) and d₂(j), for an ideal differentially expressed gene j, both δ₁(j) and δ₂(j) are expected to have small values. Moreover, similar to the relation between d₁(j) and d₂(j), δ₂(j) is considered more sensitive than δ₁(j). We define function compact_k(j) = d_k(j) + δ_k(j), for k = 1, 2, to evaluate the intra-class variations for gene j. And we define the gene scoring function s_k(j) = compact_k(j)/scatter(j) to rank the genes according to their differentiability. Note that a smaller value of s_k(j) indicates a higher differentiability.

We denote the gene selection method using compact₁(j) = d₁(j) + δ₁(j) as GS1, and the other using compact₂(j) = d₂(j) + δ₂(j) as GS2. Both GS1 and GS2 are model-free and stable. In each of them, the scores for all genes are calculated and genes are sorted in non-decreasing order of their scores. Since the number of genes, p, is typically much larger than the number of samples, n, the overall running time to compute this order is O(p log p). In practice, there are several ways to select the informative genes using this order. For example, one may select the top ranked x genes for further analysis, or the top ranked x% genes, or all the genes with score no larger than some constant, among others.

F-test method

F-test method [1, 19] is also a single gene scoring approach. Besides the notations used in our methods, it uses $σ_{k}^{2}$ to denote the variance of expression value of gene j in the k-th class:

σ_{k}^{2} = \frac{\sum_{i \in C_{k}} {(a_{i j} - {\bar{a}}_{k j})}^{2}}{n_{k} - 1},

and $σ^{2} = \frac{\sum_{k = 1}^{L} n_{k} (n_{k} - 1) σ_{k}^{2}}{n - L}$ to denote the variance in the whole dataset. Gene j has a score defined to be:

F (j) = \frac{\sum_{k = 1}^{L} n_{k} ({\bar{a}}_{k j} - {\hat{a}}_{j}) / (L - 1)}{σ^{2}}

Cho's method

Using the same notations used as in the above, Cho's method [17] defines a weight factor w_ifor sample i, which is $\frac{1}{n_{k}}$ if sample i belongs to class k. Let $W = \sum_{i = 1}^{n} w_{i}$ . The weighted mean(j) for gene j is defined as

m e a n (j) = \sum_{i = 1}^{n} \frac{w_{i}}{W} x_{i j} .

The weighted standard deviation is defined as

s t d (j) = \sqrt{\frac{\sum_{i = 1}^{n} {(x_{i j} - m e a n (j))}^{2}}{(n - 1 / n) \sum_{i = 1}^{n} w_{i}}} .

Then the score of gene j is calculated as

C (j) = \frac{m e a n (j) \times s t d (j)}{s t d ({\bar{a}}_{j})},

where std( ${\bar{a}}_{j}$ ) is the standard deviation of centroid expression values ( ${\bar{a}}_{i j}, {\bar{a}}_{2 j}, \dots, {\bar{a}}_{L j}$ ).

References

Dudoit S, Fridlyand J, Speed TP: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 2002, 97: 77–87.
Article CAS Google Scholar
Xiong M, Fang X, Zhao J: Biomarker Identification by Feature Wrappers. Genome Research 2001, 11: 1878–1887.
PubMed Central CAS PubMed Google Scholar
Mukherjee S, Roberts SJ: A Theoretical Analysis of Gene Selection. Proceedings of IEEE Computer Society Bioinformatics Conference (CSB 2004) 2004, 131–141.
Google Scholar
Baldi P, Long AD: A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-test and Statistical Inferences of Gene Changes. Bioinformatics 2001, 17: 509–519.
Article CAS PubMed Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V: Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 2002, 46: 389–422.
Article Google Scholar
Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB: Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 2002, 18: 1454–1461.
Article CAS PubMed Google Scholar
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21: 631–643.
Article CAS PubMed Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999, 286: 531–537.
Article CAS PubMed Google Scholar
Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 2001, 7: 673–679.
Article PubMed Central CAS PubMed Google Scholar
Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN: Gene Expression-based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification. Cancer Research 2003, 63: 1602–1607.
CAS PubMed Google Scholar
Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of USA 2001, 98: 13790–13795.
Article CAS Google Scholar
Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF Jr, Hampton GM: Molecular Classification of Human Carcinomas by Use of Gene Expression Signatures. Cancer Research 2001, 61: 7388–7393.
CAS PubMed Google Scholar
Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 2002, 30: 41–47.
Article CAS PubMed Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D'Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002, 1: 203–209.
Article CAS PubMed Google Scholar
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Amgel M, Reich M, Pinkus GS, Ray TS, Kovall MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nature Medicine 2002, 8: 68–74.
Article CAS PubMed Google Scholar
The MathWorks[http://www.mathworks.com/]
Cho JH, Lee D, Park JH, Lee IB: New gene selection for classification of cancer subtype considering within-class variation. FEBS Letters 2003, 551: 3–7.
Article CAS PubMed Google Scholar
MATLAB Support Vector Machine Toolbox[http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/]
Ding C, Peng H: Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Proceedings of IEEE Computer Society Bioinformatics Conference (CSB'03) 2003, 523–530.
Google Scholar

Download references

Acknowledgements

The authors are grateful to the support from CFI and NSERC, and to two referees for their many helpful comments and suggestions that improve the presentation.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, China
Kun Yang & Jianzhong Li
Department of Computing Science, University of Alberta. Edmonton, Alberta, T6G 2E8, Canada
Zhipeng Cai & Guohui Lin

Authors

Kun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhipeng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Guohui Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guohui Lin.

Additional information

Authors' contributions

KY and ZC contributed equally to this work. They both participated in the design of the scoring functions and their implementations. ZC, JL, and GL drafted the manuscript. GL designed the framework, supervised the whole work, and finalized the manuscript. All authors read and approved the final manuscript.

Kun Yang, Zhipeng Cai contributed equally to this work.

Electronic supplementary material

12859_2005_967_MOESM1_ESM.TGZ

Additional File 1: Plots of the LOO and 5-fold cross validation classification accuracies of the SVM-classifier and the KNN-classifier built on the top ranked x genes, for x ≤ 100, on all eight datasets LEU, SRBCT, GLIOMA, CAR, LUNG, MLL, PROSTATE, and DLBCL. One plot corresponds to a type of cross validation classification accuracies of one classifier combined with each of the four gene selection methods on one dataset. There are in total 32 plots. (TGZ 73 KB)

12859_2005_967_MOESM2_ESM.TGZ

Additional File 2: Plots of standard deviations for the 5-fold cross validation classification accuracies. One plot corresponds to the two classifiers combined with a gene selection method. There are in total 4 plots. (TGZ 36 KB)

12859_2005_967_MOESM3_ESM.tgz

Additional File 3: Four microarray datasets (CAR, DLBCL, GLIOMA, and LEU) in MATLAB format, which are used in this work. (TGZ 9 MB)

12859_2005_967_MOESM4_ESM.tgz

Additional File 4: Four microarray datasets (LUNG, MLL, PROSTATE, and SRBCT) in MATLAB format, which are used in this work. (TGZ 7 MB)

Additional File 5: MATLAB codes for gene selection methods GS1 and GS2. (TGZ 2 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Authors’ original file for figure 22

Authors’ original file for figure 23

Authors’ original file for figure 24

Authors’ original file for figure 25

Authors’ original file for figure 26

Authors’ original file for figure 27

Authors’ original file for figure 28

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, K., Cai, Z., Li, J. et al. A stable gene selection in microarray data analysis. BMC Bioinformatics 7, 228 (2006). https://doi.org/10.1186/1471-2105-7-228

Download citation

Received: 29 October 2005
Accepted: 27 April 2006
Published: 27 April 2006
DOI: https://doi.org/10.1186/1471-2105-7-228

A stable gene selection in microarray data analysis

Abstract

Background

Results

Conclusion

Background

Results and discussion

The datasets

Classification accuracies

Stability of the gene selection methods

Conclusion

Methods

F-test method

Cho's method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us