Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)

Open Access Proceedings

Gene network modular-based classification of microarray samples

Pingzhao Hu1*, Shelley B Bull2 and Hui Jiang1

Author Affiliations

1 Department of Computer Science and Engineering, York University, Toronto, M3J 1P3, Canada

2 Prosserman Center for Health Research, Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto, M5G 1X5, Canada

For all author emails, please log on.

BMC Bioinformatics 2012, 13(Suppl 10):S17  doi:10.1186/1471-2105-13-S10-S17


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/13/S10/S17


Published:25 June 2012

© 2012 Hu et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Molecular predictor is a new tool for disease diagnosis, which uses gene expression to classify diagnostic category of a patient. The statistical challenge for constructing such a predictor is that there are thousands of genes to predict for the disease categories, but only a small number of samples are available.

Results

We proposed a gene network modular-based linear discriminant analysis approach by integrating 'essential' correlation structure among genes into the predictor in order that the modules or cluster structures of genes, which are related to the diagnostic classes we look for, can have potential biological interpretation. We evaluated performance of the new method with other established classification methods using three real data sets.

Conclusions

Our results show that the new approach has the advantage of computational simplicity and efficiency with relatively lower classification error rates than the compared methods in many cases. The modular-based linear discriminant analysis approach induced in the study has the potential to increase the power of discriminant analysis for which sample sizes are small and there are large number of genes in the microarray studies.

Background

With the development of microarrays technology, more and more statistical methods have been developed and applied to the disease classification using microarray gene expression data. For example, Golub et al. developed a "weighted voting method" to classify two types of human acute leukemias [1]. Radmacher et al. constructed a 'compound covariate prediction' to predict the BRCA1 and BRCA2 mutation status of breast cancer [2]. The family of linear discriminant analysis (LDA) has been widely applied in such high-dimensional data [3-6]. LDA computes the optimal transformation, which minimizes the within-class distance and maximizes the between-class distance simultaneously, thus achieving maximum discrimination. Many other works have also extended the LDA framework for handling the large p (number of genes) and small n (sample size) problem. For example, Shen et al. developed an eigengene based linear discriminant model by using a modified rotated spectral decomposition approach to select 'hub' genes [5]. Pang et al. proposed an improved diagonal discriminant method through shrinkage and regularization of variance, a method to borrow information across genes to improve the estimation of gene-specific variance [6].

Studies have shown that given the same set of selected genes, different classification methods often perform quite similarly and simple methods like diagonal linear discriminant analysis (DLDA) and k nearest neighbor (kNN) normally work remarkably well [3]. However, because the data points in microarray data sets are often from a very high-dimensional space and in general the sample size does not exceed this dimension, which presents unique challenges to feature selection and predictive modeling. Thus, finding the most informative genes is a crucial task in building predictive models from microarray gene expression data to handle the large p (number of genes) and small n (sample size ) problem. To tackle this issue, different clustering-based classification approaches were proposed to reduce the data dimensions.

Li et al. developed cluster-Rasch models, in which a model-based clustering approach was first used to cluster genes and then the discretized gene expression values were input into a Rasch model to estimate a latent factor associated with disease classes for each gene cluster [7]. The estimated latent factors were finally used in a regression analysis for disease classification. They demonstrated that their results were comparable to those previously obtained, but the discretization of continuous gene expression levels usually results in a loss of information. Hastie et al. proposed a tree harvest procedure for find additive and interaction structure among gene clusters, in their relation to an outcome measure [8]. They found that the advantage of the method could not be demonstrated due to the lack of rich samples. Dettling et al. presented an algorithm to search for gene clusters in a supervised way. The average expression profile of each cluster was considered as a predictor for traditional supervised classification methods [9]. Similar idea was further explored by Park et al. [10]. They took a two-step procedure: 1) hierarchical clustering and 2) Lasso. In the first step, they defined super-genes by averaging the genes within the clusters; In the second step, they used the super-gene expression profiles to fit regression models. However, using simple averages will discard information about the relative prediction strength of different genes in the same gene cluster [9]. Yu also compared different approaches to form gene clusters and the resulting information was used for providing sets of genes as predictors in regression [11]. However, clustering approaches are often subjective, and usually neglect the detailed relationship among genes.

Recently, gene co-expression networks have become a more and more active research area [12-15]. A gene co-expression network is essentially a graph where nodes in the graph correspond to genes, and edges between genes represent their co-expression relationship. The gene neighbor relations (such as topology) in the networks are usually neglected in traditional cluster analysis [14]. One of the major applications of gene co-expression network has been centered in identifying functional modules in an unsupervised way [12,13], which may be hard to distinguish members of different sample classes. Recent studies have shown that prognostic signature that could be used to classify the gene expression profiles from individual patients can be identified from network modules in a supervised way [15].

In this study, we propose a network modular-based LDA (named as MLDA) method for improving the prediction performances of DLDA, DQDA and among others. The major difference between our method and other LDA-based methods is that MLDA incorporates the gene network modules into LDA in a supervised way. We built the MLD prediction model using modular-specific features. As a comparison, we also implement a variant of super-gene based regression models [10]. We first define super-genes by extracting the first principal component (PC) within the network modules. We then use the super-gene expression profiles to fit a logistic regression (LR) model. We named the method as MPCLR.

Materials and methods

Data sets

Three real microarray data sets are used in evaluating the performance of our proposed algorithm and other established classification methods. The detailed description of these data sets is shown in Table 1. We got the preprocessed colon cancer microarray expression data from http://genomics-pubs.princeton.edu/ webcite. For prostate cancer and lung cancer microarray data sets, we downloaded their raw data from gene expression omnibus (http://www.ncbi.nlm.nih.gov/geo/ webcite) and preprocessed using robust multi-array average (RMA) algorithm [16].

Table 1. Descriptive characteristics of data sets used for classification

Seed-based network-module identification

To identify gene modules in a gene co-expression network, we modify the correlation-sharing method developed by Tibshirani and Wasserman [20], which was originally proposed to detect differential gene expression. Specifically, we first use a seed-based approach to identify correlation-shared gene modules from gene network. Each of these modules includes a differentially expressed gene between sample classes, which is treated as a seed, and a set of other genes highly co-expressed with the seed gene. The revised approach works in the following steps:

1: Build a co-expression network using Pearson correlation coefficient (r) [21].

2: Compute test statistic Ti(i = 1,2,..., p)for each gene i in the co-expression network using the standard t-statistics or a modified t-statistics, such as significance of microarrays (SAM) [22].

3: Rank the absolute test statistic values from the largest one to the smallest one and select the top m genes as seed genes.

4: Find the module membership s for each selected seed gene i* in the co-expression network. The module assignments can be characterized by a many to one mapping. That is, one seeks a particular encoder Cr(i*) that maximizes

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M1">View MathML</a>

(1)

Where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M2">View MathML</a>. The set of genes s for each seed gene i* is an adaptively chosen module, which maximizes the average (ave) differential expression signal around gene i*. The set of identified genes s should have absolute (abs) correlation (corr) with i* larger than or equal to r.

MLDA algorithm

We propose a new formulization of the traditional linear discriminant analysis. Specifically, we first use the seed-based approach to identify gene network modules. Then we perform LDA in each module. The linear predictors in all the identified modules are then summed up. The new modular-based classification approach returns signature components of tight co-expression with good predictive performance.

Let assume there are A and B two sample groups (such as disease and normal groups), which have nA and nB samples, respectively. The data for each sample j consists of a gene expression profile xj = (x1j,x2j,...,xpj), where xij be the log ratio expression measurement for gene i = 1,2,...,p and sample j = 1,2,...,n, n = nA+nB. We assume that expression profiles x from group k (k ∈ {A,B}) are distributed as N(μk,∑k). The multivariate normal distribution has mean vector μk and covariance matrix ∑k.

In a simplified way, we assume that ∑ = ∑A = ∑B = {σi,i'}i,i' = 12,...,p , where<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M3">View MathML</a>,σii' = σi'i and σii' is the pooled covariance estimate of gene i and gene i' for sample groups A and B. Therefore, when <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M4">View MathML</a> is a block-diagonal structure, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M5">View MathML</a>

where C is the number of blocks (gene modules) and<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M6">View MathML</a> is the estimated covariance matrix for block c(c = 1,2,...,C).

The linear predictor (LP) with block-diagonal covariance structure is given by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M7">View MathML</a>

(2)

Where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M8">View MathML</a> is the expression measurements of the genes in module c for a new sample to be predicted and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M9">View MathML</a> is the mean vector of the genes in module c. Obviously, linear discriminant analysis (LDA) and diagonal linear discriminant analysis (DLDA) [3] are the special cases of MLDA. That is, when C = 1, <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M10">View MathML</a>, where xT is the expression measurements of p genes for a new sample to be predicted, so MLDA is simplified to LDA; when C = p (that is, each module has only one gene), <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M11">View MathML</a>, where xi is the expression measurement of gene i for a new sample to be predicted, so MLDA is simplified to DLDA.

We estimate the mean vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M12">View MathML</a> of the genes in module c as<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M13">View MathML</a> and use the pooled estimate of the common covariance matrix in each module c

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M14">View MathML</a>

(3)

Where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M15">View MathML</a>, i,i' = 1,2...,pc and pc is the number of genes in the module c. <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M16">View MathML</a> is estimated as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M17">View MathML</a>

(4)

Where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M18">View MathML</a> i,i' = 1,2...,pc and i ≠ i', <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M19">View MathML</a> is the correlation estimate between gene i and gene i' in module c of sample group k.

c is inversible when n pc, that is,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M20">View MathML</a>

However, in some modules (say module c), it is possible that n <pc. In this case, ∑c is not inversible. We apply singular value decomposition (SVD) technology [23] to solve the problem. Assume ∑c is a pc ×pc covariance matrix, which can be discomposed uniquely as ∑c = UDVT, where U and V are orthogonal, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M21">View MathML</a> with <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M22">View MathML</a>. If ∑c is a pc × pc nonsingular matrix (iff σi 0 for all i(i = 1,2,...,pc)), then its inverse is given by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M23">View MathML</a> where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M24">View MathML</a>.

The rule to assign a new sample j to group k is, thus, based on:<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M25">View MathML</a>, sample j is assigned to group A; otherwise, it is assigned to group B.

MPCLR algorithm

In order to compare MLDA with other super-gens based classification approaches, we also implement a variant of super-gene based regression models [10]. MPCLR classification algorithm includes three stages: 1) construct correlation-sharing based gene network modules; 2) extract meta-gene expression profiles from the constructed modules using principal component analysis (PCA); 3) classify samples using PCA-based logistic regression model. Here we briefly described each of the three stages:

Stage 1: Construct seed-based gene network modules. This can be done using the same approach as used in MLDA algorithm described above.

Stage 2: Principal component analysis of correlation-shared expression profiles: To do this, for each of the seed-based gene network modules, we perform principal component analysis. Specifically, for a given gene module with pc genes, we assume <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M26">View MathML</a> be expression indices of pc genes in the jth sample. Let ∑ be covariance matrix of x with dimension pcxpc. All positive eigenvalue of ∑ are denoted by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M27">View MathML</a> . The first PC score of the jth sample is given by <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M28">View MathML</a>, where e1 is the eigenvector associated with λ1. Therefore, we can define the super-gene expression profile for n samples in a seed-based gene module as <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M29">View MathML</a>. The estimated values for the coefficient <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M30">View MathML</a>(eigenvector) of the first PC can be computed using singular value decomposition (SVD) [23]. Briefly, assume E be an nxpc matrix with normalized gene expression values of pc genes in a given module, so we can express the SVD of E as E = UDAT, where U = {u1,u2,...,ud} is a nxd matrix (d = rank(E)), <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M31">View MathML</a> is a d × d diagonal matrix where dk is kth eigenvalue of Et E, A = {e1,e2,...,ed} is a pcxd matrix where ek is eigenvector of associated with λk and coefficients for defining PC scores. Magnitude of loadings for the first principal component score can be viewed as an estimate of the amount of contribution from the module genes.

Stage 3: Classification using PCA-based logistic regression model: Assume Y is a categorical variable indicating the disease status (such as cancer or no cancer). Here we only focus on binary classification and suppose that Y = 1 denotes the presence and Y = 0 indicates the absence of the disease. Therefore, we can have the following supervised PCA-based logistic regression model:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M32">View MathML</a>

(5)

Where <a onClick="popup('http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/13/S10/S17/mathml/M33">View MathML</a>. PC1i*j is the first principal component score estimated from the seed gene module i* for sample j and represents the latent variable for the underlying biological process associated with this group of genes. The model was fitted using GLM function in stats R package.

Comparisons of different supervised classification methods

We compared the prediction performances of MLDA with other established supervised classification methods, which include diagonal quadratic discriminant analysis (DQDA), DLDA, one nearest neighbor method (1NN), support vector machines (SVM) with linear kernel and recursive partitioning and regression trees (Trees). We used the implementation of these methods in different R packages http://cran.r-project.org/ webcite, which are sma for DQDA and DLDA, class for 1NN, e1071 for SVM and rpart for Trees. Default parameters in e1071 and rpart for SVM and Tree were used, respectively. For other methods (DQDA, DLDA, 1NN, MPCLR and MLDA), there are no tuning parameters to be selected. In the comparisons, seed genes were selected using t-test and SAM, respectively. We evaluated the performances of DQDA, DLDA, 1NN, SVM and Trees based on different number of the selected seed genes and those of MPCLR and MLDA based on different number of gene modules, which were built on the selected seed genes.

Cross-validation

We performed 10-fold cross-validation to evaluate the performance of these classification methods. The basic principle is that we split all samples in a study into 10 subsets of (approximately) equal size, set aside one of the subsets from training and carried out seed gene selection, gene module construction and classifier fitting by the remaining 9 subsets. We then predicted the class label of the samples in the omitted subset based on the constructed classification rule. We repeated this process 10 times so that each sample is predicted exactly once. We determined the classification error rate as the proportion of the number of incorrectly predicted samples to the total number of samples in a given study. This 10-fold cross-validation procedure was repeated 10 times and the averaged error rate was reported.

Results

Tables 2, 3 and 4 list the prediction performances of different classification methods applied to microarray gene expression data sets for colon, prostate and lung cancers, respectively. Here the different number of top seed genes (5, 10, 15, 20, 30, 40, 50) was selected by t-test. Since it is generally time-consuming to search for genes which are not only correlated with a given seed gene but maximize their averaged test statistic value (Formula 1), in order to save time, we only tested 10 cutoffs of correlation r from 0.5 to 0.95 with interval 0.05. We observed that the averaged correlation of genes in the identified modules is usually between 0.65 and 0.85 with the number of genes in the modules from 2 to 56, suggesting that the genes in the modules are highly co-expressed.

Table 2. Mean error rates of classification methods applied to colon cancer data set

Table 3. Mean error rates of classification methods applied to prostate cancer data set

Table 4. Mean error rates of classification methods applied to lung cancer data set

As we can see, the proposed MLDA has relatively better or comparable classification performances among all being compared classification methods in the three data sets. The performance of MPCLR is not consistent in the three data sets. This is likely that the variation in the given data captured by the first PC may be different. Other methods with better classification performances are DLDA and SVM. In general, all these methods except Tree works well for both colon and lung cancer data sets. The performances of these methods in prostate cancer data are slightly worse than those in colon and lung cancer data sets, which may be due to clinical heterogeneity among samples.

We also used SAM to select seed genes and evaluated their prediction performance using the same procedure as described above. Similar prediction results are observed as shown in Table 4. Overall, the MLDA has slightly lower error rate than other being compared classification methods (Table 5).

Table 5. Mean error rates of classification methods applied to lung cancer data set

In many cases, we found that the simple method DLDA works well. Its performance is comparable with the advanced methods, such as SVM. We also observed that the performances of predictors with more genes are not necessarily better than those of the predictors with fewer genes. For example, when t-test was used to select the seed genes, the best performance was obtained with only 5 genes for MPCLR and MLDA predictors in colon cancer data set (Table 2), 10 genes for SVM predictor in prostate cancer data set (Table 3) and 30 genes for MLDA predictor in lung cancer data set (Table 4). When SAM was used to select the seed genes, the best performance was also obtained with 30 genes for SVM, MPCLR and MLDA predictors in lung cancer data set (Table 5).

Discussion and conclusions

In this study we developed a network modular-based approach for disease classification using microarray gene expression data. The core idea of the methods is to incorporate 'essential' correlation structure among genes into a supervised classification procedure, which has been neglected or inefficiently applied in many benchmark classifiers. Our method takes into account the fact that genes act in networks and the modules identified from the networks act as the features in constructing a classifier. The rationale is that we usually expect tightly co-expressed genes to have a meaningful biological explanation. For example, if gene A and gene B has high correlation, which sometimes hints that the two genes belong to the same pathway or functional module. The advantage of the method over other methods has been demonstrated by three real data sets. Our results show that the algorithm MLDA works well for small sample size classification. It performs relatively better than DLDA, 1NN, SVM and other classifiers in many situations. The modular LDA approach induced in the study have the potential to increase the power of discriminant analysis for which sample sizes are small and there are large number of genes in the microarray studies.

Our results are consistent with previous findings: The simple methods have comparable or better classification results than the more advanced or complicated methods [3]. This is likely due to the fact that there are more parameters to be estimated in the advanced methods than in the simple methods, while our data sets usually have much smaller number of samples than features/genes. We also tried to use more top genes (up to 100) in the classification models and similar result patterns (results were not shown) were observed as shown in Tables 2, 3, 4, 5. Although some previous studies showed that better results can be obtained when the number of top genes used in the prediction models are much larger than the number of samples, the improved performance may be due to over fitting effect. Moreover, for clinical purpose, it is better to include fewer number of genes rather than larger number of genes in the prediction models due to cost issues.

Previous studies have shown that the topological structure of a node (gene product) in a protein network is informative for functional module inference [21,24,25]. Moreover, some useful approaches have been developed to measure the topology similarity of pairs of nodes in weighted networks [21]. It will be interesting to explore the network topology-sharing based method rather than the correlation-sharing approach to identify seed-based gene network modules and place them into our network-based classification framework. The MLDA framework can be further extended in many ways. For example, it is possible to directly incorporate the modular-specific features in other advanced discriminant learning approaches (such as SVM). In the future we will explore these ideas in details.

List of abbreviations

DLDA: diagonal linear discriminant analysis; DQDA: diagonal quadratic linear discriminant analysis; KNN: k nearest neighbor; LDA: linear discriminant analysis; LR: logistic regression; MLDA: modular-based linear discriminant analysis; MPCLR: Modular-principal component based logistic regression; PC: Principal component; RMA: robust multi-array average; SAM: significance of microarrays; SVD: singular value decomposition; SVM: support vector machines.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PH designed and performed the analysis and wrote the manuscript. PH, SB and HJ designed the algorithms.

Acknowledgements

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 10, 2012: "Selected articles from the 7th International Symposium on Bioinformatics Research and Applications (ISBRA'11)". The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S10

The authors thank Dr. W He and S Colby for their helpful discussions and comments.

References

  1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

    Science 1999, 286:531-536. PubMed Abstract | Publisher Full Text OpenURL

  2. Radmacher MD, McShane LM, Simon R: A paradigm for class prediction using gene expression profiles.

    J Comput Biol 2002, 9:505-512. PubMed Abstract | Publisher Full Text OpenURL

  3. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data.

    J Am Stat Assoc 2002, 97:77-87. Publisher Full Text OpenURL

  4. Guo Y, Hastie T, Tibshirani R: Regularized linear discriminant analysis and its application in microarrays.

    Biostatistics 2007, 8:86-100. PubMed Abstract | Publisher Full Text OpenURL

  5. Shen R, Ghosh D, Chinnaiyan AM, Meng Z: Eigengene based linear discriminant model for gene expression data analysis.

    Bioinformatics 2006, 22:2635-2642. PubMed Abstract | Publisher Full Text OpenURL

  6. Pang H, Tong T, Zhao H: Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data.

    Biometrics 2009, 65:1021-1029. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Li H, Hong F: Cluster-Rasch models for microarray gene expression data.

    Genome Biol 2001, 2:RESEARCH0031. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Hastie T, Tibshirani R, Botstein D, Brown P: Supervised harvesting of expression trees.

    Genome Biol 2001, 2:RESEARCH0003. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Dettling D, Bühlmann P: Supervised clustering of genes.

    Genome Biol 2002, 3:RESEARCH0069. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Park MY, Hastie T, Tibshirani R: Averaged gene expressions for regression.

    Biostatistics 2007, 8:212-227. PubMed Abstract | Publisher Full Text OpenURL

  11. Yu X: Regression methods for microarray data. PhD thesis. Stanford University; 2005. OpenURL

  12. Elo L, Jarvenpaa H, Oresic M, Lahesmaa R, Aittokallio T: Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process.

    Bioinformatics 2007, 23:2096-2103. PubMed Abstract | Publisher Full Text OpenURL

  13. Presson A, Sobel E, Papp J, Suarez C, Whistler T, Rajeevan M, et al.: Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome.

    BMC Syst Biol 2008, 2:95. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  14. Horvath S, Dong J: Geometric interpretation of gene coexpression network analysis.

    PLoS Comput Biol 2008, 4:e1000117. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, et al.: Dynamic modularity in protein interaction networks predicts breast cancer outcome.

    Nat Biotechnol 2009, 27:199-204. PubMed Abstract | Publisher Full Text OpenURL

  16. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data.

    Nucleic Acids Res 2003, 31:e15. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

    Proc Natl Acad Sci USA 1999, 96:6745-6750. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Stuart RO, Wachsman W, Berry CC, Wang-Rodriguez J, Wasserman L, Klacansky I, et al.: In silico dissection of cell-type-associated patterns of gene expression in prostate cancer.

    Proc Natl Acad Sci USA 2004, 101:615-620. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, et al.: Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer.

    Nat Med 2007, 13:361-366. PubMed Abstract | Publisher Full Text OpenURL

  20. Tibshirani R, Wasserman L: Correlation-sharing for detection of differential gene expression.

    2006.

    arXivmath.STmath/0608061

  21. Zhang B, Horvath S: A general framework for weighted gene co-expression network analysis.

    Stat Appl Genet Mol Biol 2005, 4:Article17. PubMed Abstract OpenURL

  22. Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.

    Proc Natl Acad Sci USA 2001, 98:5116-5121. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Jolliffe IT: Principal Component Analysis. New York: Springer; 2002. OpenURL

  24. Lubovac Z, Gamalielsson J, Olsson B: Combining functional and topological properties to identify core modules in protein interaction networks.

    Proteins 2006, 64:948-959. PubMed Abstract | Publisher Full Text OpenURL

  25. Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions.

    Bioinformatics 2006, 22:1623-1630. PubMed Abstract | Publisher Full Text OpenURL