Abstract
Background
Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories.
Results
We propose a factor analysis model (FAM) for functional genomics and give a twostep algorithm, using genomewide expression data for yeast and a subset of GeneOntology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genomewide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance.
Conclusion
Our factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions.
Background
Functional genomics is often described as one of the most important challenges in the postgenomics era. Now that many genomes have been sequenced, the next step is to understand the functions of all the geneproducts. One of the most accessible genomewide modalities that has proven useful for this task is expression microarray experiment e.g., [13]. The general idea is to associate genes to functional categories by comparing their expression profiles to genes with known functional associations. This is usually supported by the biological argument that genes in the same or similar functional categories are often part of a functional pathway and therefore likely to be coregulated by the same set of transcription factors.
In a typical functional genomic experiment relying on expression data, a few dozen to a few hundred samples from an organism are processed and hybridized. The samples may be obtained from different tissues and organs, e.g., the mouse experiment of [3], or from different experimental conditions [1]. The expression data for p genes are then assembled in a p × n matrix where each row presents an expression profile of the corresponding gene across n samples.
A few distinct ways have been proposed to predict gene functions from microarray data. The authors of [1] run a multitude of clustering methods which generate thousands of (overlapping) clusters. For each functional category, every cluster is assigned an enrichment pvalue based ona hypergeometric distribution. Those clusters with pvalues below a certain threshold are used for prediction: an uncharacterized gene is localized among all enriched clusters and predicted to be involved in all functions these clusters entail. This is biologically sensible since a gene is usually involved in multiple functions. The authors of [2] use genegene correlations to build graphical models of gene expression data which are then used, among other things, for functional prediction. The predictions are done by first forming a set of shortest paths between genes, thresholded at some length to retain significance, and predicting with a constrained majority rule using the genes on the path.
One modelbased approach, perhaps more intuitive for statisticians, is to use one of many classification methods. In such an approach, one starts with the expression matrix, where genes correspond to observations and their expression profiles correspond to features. A set of K functional categories is specified which span all interesting and relevant biological functions for a given genome. Since any given gene can belong to multiple functional categories, the simplest approach is to build K separate binary classifiers, one for each function, and use them to predict whether each of the uncharacterized genes belongs to one of the K functional categories.
This general paradigm permits one to use many known classification models and algorithms. The authors of [3] apply the Support Vector Machine (SVM) classifier, popular in the machine learning literature, to a set of 55 microarrays of different mouse tissues to predict functions of over 12,000 mouse genes which currently have unknown status. They use a set of 992 functional categories defined by Gene Ontology – Biological Process (GOBP; see [4]), which is a hierarchical categorization of functions much finer in scale than the set of 42 YPD Cellular Role functions, largely independent of each other (in the sense that most genes belong to at most two functional categories), which was used by [1]. More details about GOBP are given below.
It is important to acknowledge that functions of geneproducts cannot, in many cases, be fully resolved from coexpression data alone: coexpression may be a result of other factors besides coregulation, e.g., separate pathways that happened to be activated in the same experiment, or simply chance. Similarly, transregulation factors may be acting more globally inducing coexpression of genes that are weakly connected or completely disconnected on a functional level. Some of these concerns can be alleviated by using expression data that represent a wide array of conditions, so that significant observed coexpression between genes is more likely to have resulted from close functional relationship of respective geneproducts than from other factors. Studies with such design, as those already cited [1,3], have been successful in that regard. As with many other genomewide studies that use highthroughput technologies, this proposal can potentially be highly useful as an initial hypothesisgenerating step.
In this paper, we propose a new approach to predicting gene functions using expression data and GOBP annotations. It is worth noting that our framework can, in principle, be extended to utilize further sources of information, besides expression data (see "Discussion" section). Such approaches, sometimes dubbed "genomic data fusion" models, have been proposed in the literature and show some promise of improving prediction accuracy. In [5], the authors utilize phylogenetic profiles and fused domain methods in addition to experimental data that include mRNA expression to derive a large number of candidate proteinprotein links and hence predict function in yeast Saccharomyces cerevisiae. A method which is similar in spirit but more refined in terms of an approach to assigning interactions from various genomic sources, has been proposed by [6], who use phenotype data from gene knockout experiments, cellular localization data, and information from proteinprotein interaction databases in addition to expression data. Each data source contributes a bit matrix that codes all the proteinprotein relationships. Depending on the manner of inferring the relationships (e.g., various pvalue thresholds of Pearson correlation coefficient for expression data), and on the manner in which various bit matrices are combined, one can control the contribution of each data source and arrive at the final association matrix and hence predict gene functions. A moremodeloriented approach to data fusion is described in [7] and [8], which propose the Bolztmann machine and SVMs, respectively, for geneproduct function prediction using various genomic data sources.
Gene Ontology – biological processes
We use the GOBP annotations to provide us with a set of functional categories and the associations of characterized genes to these categories. Gene Ontology [9] is a popular information resource for genetic research. The Biological Process part (GOBP), which is relevant for the expression data, has currently almost 10,000 processes arranged in a Directed Acyclic Graph (DAG) structure. The top node in this DAG (Biological Process, accession number GO:0008150) has six children (Development, Behaviour, Physiological Process, Viral Life Cycle, Cellular Process, Regulation of Biological Process) which constitute major groupings of biological processes in all organisms. Each of these is subdivided into more specialized characterizations. The seventh top child is a BiologicalProcess Unknown node, which has no further descendants; it is applied to currently uncharacterized genes. The arcs in the GOBP DAG usually denote subdivisions of a process characterization (commonly called "is_a" relationships), although sometimes they also denote partitions of a process into smaller steps (commonly called "part_of" relationships). The GOBP is a crossorganism ontology; for any specific organism only a part of its structure will usually be relevant. For example, for simple eukaryotes such as yeast studied in [1], the whole subgraphs under Viral Life Cycle and Behaviour are not relevant and thus have no genes associated to them.
In addition to providing a continually evolving ontology structure, the GO consortium also maintains and curates the association data. For most model organisms, one can obtain a table where each known gene or Open Reading Frame (ORF) is associated with a set of biological processes. These associations are built and maintained based on reliable evidence such as major publications or directly confirmable laboratory experiments, although there is a range of "credibility" scores. These scores are called Evidence Codes and are maintained alongside the association data. They could potentially be used in the analysis to weigh different associations according to their credibility, but we have chosen not to make this distinction in this research.
Motivation and summary of our study
A problem with the typical classification approach is its need to train very large numbers of classifiers on thousands of genes. This is further complicated by the need to use multiple crossvalidation experiments for each classifier to fine tune its control parameters. These requirements are computationally prohibitive for most laboratories. The authors of [3], for example, had access to a powerful computer cluster to perform extensive computations, an atypical resource in many biological labs.
We propose a factor analysis model (FAM) to predict functions for uncharacterized genes using microarray expression data and a subset of the GOBP functional categories. We show that our factor analysis approach provides significant computational advantages without compromising prediction accuracy. It is worthwhile to notice that FAM directly models the expressioninduced correlation structure among genes. This correlation structure coupled with small signaltonoise ratios of microarray data, can negatively impact the statistical efficiency of some classifiers that treat genes as independent observations [10,11]. Finally, we show that our model has the potential for incorporating important gene ontology information and deal with a number of major challenges in using statistical classification techniques for functional genomics (see "Discussion" section).
Results
This section gives an overview of our prediction model, experiments (including descriptions of our data, preprocessing steps, benchmarks and performance evaluation metrics) and computational results.
The model
Our model, FAM, applies factor analysis to predict gene functions using microarray data.
The typical approach
We first describe the typical computational approach, e.g., [3], more formally. Let Z be the p × n matrix of gene expression data; the (i,j)^{th }element of that matrix, z_{ij}, is the expression level of gene i from experiment (or sample) j. Without loss of generality, we shall assume that the GOBP data contain the functional annotations of the first p_{1 }<p genes. In particular, with a total of K different functional categories, this information can be stored in a binary p_{1 }× K matrix, which we call Y_{1}. Its (i, k)^{th }element y_{ik }equals 1 if gene i is associated with function k, and equals 0 otherwise. We use Y_{2 }to denote the unknown functional annotation matrix for the remaining p_{2 }= p  p_{1 }genes; thus Y_{2 }is the p_{2 }× K matrix that needs to be predicted.
Using the matrix Y_{1 }and the first p_{1 }rows of Z, we can build predictive models to associate gene functions (Y) from the microarray data (Z). Typically, K different models {F_{k }: k = 1, 2, ..., K} are constructed independently, one for each functional category, i.e., each column of Y. Since the elements of Y_{1 }are binary, each model F_{k }is simply a classifier. In theory, any classification algorithm can be used here, the most popular being SVMs, e.g., [3]. Each model F_{k }is then applied to the remaining p_{2 }rows of Z to predict whether or not these genes are associated with function k.
The factor analysis model (FAM)
Let be the expression level of p different genes from the j^{th }microarray experiment. Using a factor analysis model [[12], Chapter 9], we model the expression level as
where are latent variables that affect the expression level. Others have applied latent variable models to analyze microarray data as well [13,14], although most of these studies use latent variable models in unsupervised ways and not directly to predict gene function.
In the absence of experimental covariates describing each experiment, we may treat each experiment as equivalent with both random noise and experimental differences modeled by the error part, . If experiments are controlled and experimental covariates are available, the above model may incorporate them by making λ_{il }a function of covariates. This is one of the extensions to this model being currently studied by our group.
In matrix form, Eq. (1) can be written as:
g^{(j) }= Λf^{(j) }+ ε^{(j)}, (2)
or
Suppose Var(g) = S. In factor analysis, it is typically assumed that Var(f) = I is the identity matrix; Var(ε) = E = diag{} is a diagonal matrix; and that ε and f are uncorrelated. We then obtain the following equation from (2):
Var(g) = ΛVar(f)Λ^{T }+ Var(ε)
or
S = ΛΛ^{T }+ E. (3)
The matrix S can be estimated directly from the microarray data. Following general conventions in factor analysis, we take S to be the correlation matrix rather than the covariance matrix. This is equivalent to standardizing the data prior to the analysis and is customary in factor analysis [12].
A twostep algorithm
Our strategy for using this model to predict the gene functions is as follows: We first estimate Λ with using principal factor analysis. The number of latent variables, L, can be determined empirically by evaluating the goodness of ^{T }as an approximation to S [[12], p. 262–267], but we simply retain all factors with nonzero singular values.
Let _{1 }be the top p_{1 }rows of corresponding to the characterized genes and let _{2 }be the bottom p_{2 }rows of corresponding to the uncharacterized genes. We then derive a mapping from _{1 }to Y_{1}, say F, and apply F to _{2}. We can picture the entire process as follows:
In this modeling paradigm, ℓ(·,·) denotes a loss function between the known associations Y_{1 }and the predicted associations F(·;θ) based on the estimated loadings _{1}. The mapping F(·;θ) is assumed to be completely specified by a collection of parameters which we simply write as θ, where θ is estimated by minimizing the loss function.
Details on the implementation of this twostep algorithm is given in the "Method" section below.
Experiments
In this section, we discuss the setup of our computational experiments, including descriptions of our data set, our preprocessing steps, and how performance is evaluated. We also briefly describe a benchmark SVM algorithm that we use for the purpose of comparison.
Data
We applied our model to the microarray data for yeast described in [1], but with functional annotations from the December 2004 release of the GOBP table and the Saccharomyces Genome Database (SGD; also part of the GO consortium). The authors of [1] used a much simpler YPD classification scheme, which is not publicly available. Their microarray data contains 300 experiments from the Rosetta Compendium using genetic variants of yeast, as well as an additional publicly available data of 124 arrays from a set of three different experiments from other labs. Thus in total, there are 424 arrays with a common set of 6064 genes and ORFs. The full description of the data can be found in [1] and references therein.
Data preprocessing
We obtained the data from the authors of [1], where the data were already normalized and combined into one data matrix. Our further preprocessing steps included the following:
1. We deleted ORFs which are not listed in the newest (December 2004) SGD association table maintained by GO. Most of those had a "Dubious ORF" characterization in the SGD database.
2. We ignored the uncharacterized genes (and ORFs) – genes annotated to the node "Biological Process Unknown" in SGD.
3. We imputed missing values in the expression data using an adaptation of the knearest neighbor imputing algorithm for expression data described in [15]. We modified it so that the putative nearest neighbour gene profile, considered for imputing a given missing value, did not need to have a full expression profile as long as: (a) it had nonmissing data in the column under consideration and (b) it had at least 80% expression values that were not missing. We used k = 10 nearest neighbours.
4. We restricted the set of GOBP functional categories to those with at least 15 but no more than 100 genes associated to it. Since the goal is to produce functional predictions for genes to enable biologists to conduct more guided confirmatory experiments, we chose not to consider categories which had too few or too many genes associated to them. On the one hand, categories with too many genes annotated to them are less informative and hence not as interesting. On the other hand, categories with too few genes annotated to them cannot be credibly predicted.
5. Restricting GO categories produced a small list of "stranded" genes – genes which were otherwise known but whose annotations happened to be exclusively in the categories deleted from the list. We removed these "stranded" genes from our data set.
This resulted in 3224 genes and 369 functional categories. Specifically, we formed a 3224 × 424 expression data matrix Z (see [1]) and a 3224 × 369 association (response) matrix Y [see 1].
Additional File 1. GOBP annotations Tabdelimited file with 3225 lines each with 370 columns, including gene names (first column) and GO accession numbers (first line), for 3224 genes and 369 GOBP categories used in the text. Contains binary annotation entries.
Format: TXT Size: 2.3MB Download file
Benchmark
To compare the performance of our method, we also applied the popular approach in functional genomics: functionwise binary classification using Support Vector Machines (SVMs). SVM is a wellknown method in machine learning (see [16] for links to much of the literature, tutorials and software) and is used by [3] to classify mouse genes.
We chose Gist 2.1.1 [17] as our SVM software due to its tailored interface for microarray data and because of its use in [3]. We tested several different values via crossvalidation for the following tuning parameters in SVM: the type of kernel (linear, polynomial, and radial basis functions), the misclassification penalty (diagfactor in Gist), and the width factor (widthfactor in Gist) that controls the bandwidth of the radial basis kernel, which is calculated as (width) × (width factor). The "width" is taken to be the median distance from a data point to its nearest neighbor of the opposite class; the "width factor" is specified by the user. Further details on the parameter selection process are elaborated in the "Methods" section.
Performance evaluation
Although classifiers are typically evaluated by the misclassification rate in the statistics and machine learning literature, it is not a useful criterion for functional genomics from a biologist's point of view. Since the most interesting functional categories are those that are fairly rare, the vast majority of genes would not be associated with such a category. This means that even poor classifiers can achieve very low misclassification rates by simply predicting that no genes are associated with that category (i.e. by classifying everything to the majority class). Instead, suppose the prediction algorithm ranks genes according to predicted likelihood of association to that category (e.g., FAM can rank genes using estimated posterior probabilities via naïve Bayes – see the "Method" section). Biologists are most interested in using a computational algorithm to generate a list of candidate genes which can be further investigated in the lab. Since such investigations are often labourious and costly, a highly relevant measure of success should determine whether the algorithm ranks the truly associated genes ahead of the unassociated ones.
In the biomedical literature, prediction accuracy is often evaluated by sensitivity and specificity [18]. Suppose we focus on m <p topranked genes. Sensitivity is the probability that a gene appears in this topranked list given that it is truly associated, whereas specificity is the probability that a gene does not appear in this list given that it is not associated. They both depend on the number m, the size of the topranked list. The 2 × 2 Table 1 shows how these quantities can be calculated. Hence, we have:
The number m is typically determined by external factors such as how much funding there is to perform confirmatory experiments. Full confirmation or rejection of a putative association predicted from a genomewide scan will typically involve classical experiments from molecular biology or biochemistry over a longer time span. A given lab may decide to pursue a few of these candidate genes, taking into account their ranking from prediction tools as well as additional information such as specific reagents needed and lab's area of expertise. However an intermediate screening step which we have in mind, such as quantitative RTPCR, may be performed on a larger subset of candidates.
Table 1. Illustration of performance metrics. A hypothetical table illustrating 4 possibilities in predicting one GO category.
In order to compare two prediction algorithms, one needs to decide on the threshold m. Objectively derived statistical thresholds are hard to estimate and do not always reflect the reality of the biologist. We decided on a fixed threshold of m such that the topranked list consists of 10% of the genes, which we felt was a realistic compromise. For example, yeast has about 2000 uncharacterized genes, and attempting to furtherscreen the top 200 of them by lowerthroughput methods such as quantitative RTPCR would be feasible in a typical lab.
For predicting rare functional categories, sensitivity is more important than specificity. Again, refer to Table 1. By definition, s is small compared to p for rare functional categories. Clearly, b <s. Therefore,
which is roughly constant for fixed m. This means all algorithms will have very similar specificity values.
In practice, the choice of m must reflect an acceptable tradeoff between falsely detecting an unassociated gene and failing to detect an associated gene. Given that our choice of m is somewhat arbitrary, it is important for us to consider a more global performance measure. According to [18], the receiver operating characteristic (ROC) curve is currently the bestdeveloped statistical tool for evaluating ranking algorithms; it "describes the [entire] range of tradeoffs that can be achieved" by a given algorithm.
The area under the ROC curve – often referred to simply as AUC in the biomedical literature – is a widely used summary measure of the ROC curve. Let g and be a randomly selected pair of associated and unassociated genes, and r_{g }and be their respective rankings. It is wellknown [18,19] that AUC is equal to the probability that g and are correctly ordered by the ranking algorithm, i.e., AUC = P(r_{g }<). Therefore, AUC is a good (and thresholdfree) performance measure for determining whether an algorithm can rank truly associated genes ahead of the unassociated ones. The use of AUC as a performance criterion is also becoming more popular in the machine learning community [20]. As the averaged results later illustrate, SVM and FAM tend to perform comparably on the AUC scale.
Computational results
We validated our method with a fourfold crossvalidation (CV) experiment. We used four folds (rather than ten folds, for example) because of the need to retain a reasonable number of associated genes in each fold for each functional category, which we verified post hoc. It is important to note that the CV folds were obtained by an unconstrained randomization process.
More specifically, the 3224 genes were randomly divided into 4 groups of 806 each. In each CV run, one of the four folds is designated as the test set and the corresponding part of the association matrix, Y, removed. The remaining three quarters of the data were designated as the training set. Then both FAM and SVM were fitted to the training data and then applied to the test data to obtain predictions. For FAM, the prediction was based on the estimated posterior probability (see Method section for more details); for SVM, the prediction was based on the signed distance to the separating hyperplane. The test genes would then be ranked by each method for each function separately and the three performance metrics (sensitivity and specificity for the 10% topranked genes, and area underneath the ROC curve) calculated.
Computational details
FAM is implemented in R [21]. Executing the fourfold CV experiment, run serially across the 369 functional categories, took less than 5 CPU minutes on a dualCPU (3 GHz Pentium) Linux desktop with 2 GB of RAM. Gist is implemented in ANSI C. Executing the fourfold CV experiment, run in parallel across the 369 functional categories, took at least 12 hours on a 32 CPU (l.3 GHz Itanium II each) SGI Altix server with 64 GB of RAM (where each of the processors were fully loaded for the 12 hours). This corresponds to approximately 23040 CPU minutes (12 hours × 60 × 32 CPUs). Thus, compared to FAM's running time of 5 CPU minutes, training the SVMs took approximately 4000 times longer.
Prediction results
Tables 2 and 3 show the main results from our fourfold CV experiments. Table 2 compares the performance of FAM and SVM, averaged across all 369 functional categories. Table 3 makes pairwise comparisons of the predictions made by FAM and SVM for the 369 GO functions in the four CV experiments. For each of the 369 GO functions, the performance of FAM and SVM are compared, and Table 3 reports the number of GO functions for which FAM's predictions are better than, tied with, or worse than that of SVM. The last row, denoted by "ACV," compares FAM and SVM using averaged performance metrics over the four CV folds, which give rise to much fewer ties.
Table 2. crossvalidated performance of FAM and SVM, averaged over all 369 GO functions. For each of the four CV folds, the other three folds are used to train FAM and SVM. The performance metrics listed are averaged over all 369 GO functions.
Table 3. pairwise comparison of predictions made by FAM and SVM for 369 GO functions in 4 crossvalidation experiments. For each of the 369 GO functions, the performance of FAM and SVM are compared. "Win," "Tie" and "Lose" refer to the number of GO functions for which FAM's predictions are better than, tied with, or worse than that of SVM, respectively. The last row, denoted by "ACV," compares FAM and SVM using performance metrics which are averaged over the 4 crossvalidation folds. As expected, this gives rise to fewer ties.
The general conclusion is that the two methods are very similar in terms of their predictive performances (as measured by the three performance metrics). FAM is slightly better than SVM based on the ACV pairwise comparison (Table 3). These differences are not statistically significant, but we are greatly encouraged by these results given the tremendous computational advantages of FAM.
Tables 4 and 5 are similar to Tables 2 and 3, except the 369 functional categories are partitioned into 4 groups based on how informative they are. We say that a GO function is more informative if fewer genes are associatedwith it. The first group (H) consists of categories with only 15 to 20 associated genes (out of a total of 3224) – these functional categories are highly informative; there are 95 such categories. The second and third group consist, respectively, of categories with 21 to 40, and 41 to 60 associated genes, which we label as Medium High (MH) and Medium Low (ML) on the informativeness scale. There are 144 and 71 such functional categories in these two groups, respectively. The least informative group (L) contains functions with over 60 associated genes; there are 59 such GO categories.
Table 4. averaged crossvalidation performance of FAM and SVM for 369 GO functions classified into 4 groups based on how informative they are. A GO function is said to be more informative if fewer numbers of genes are associated with it. N = number of GO functions in each group. H, MH, ML, L denote the 4 groups with high, mediumhigh, mediumlow, and low levels of informativeness. For each GO function, the averaged crossvalidation (CV) performance metrics of FAM and SVM are used. These (averaged) CV performance metrics are then averaged over all GO functions within each category. Notice that the last row is identical to that in Table 1.
Table 5. pairwise comparison of predictions made by FAM and SVM for 369 GO functions classified into 4 groups based on how informative they are. A GO function is said to be more informative if fewer numbers of genes are associated with it. N = number of GO functions in each group. H, MH, ML, L denote the 4 groups with high, medium high, medium low, and low levels of informativeness. For each GO function, the performance of FAM and SVM are compared using averaged performance metrics over the 4 crossvalidation folds. "Win," "Tie" and "Lose" refer to the number of GO functions for which FAM's predictions are better than, tied with, or worse than that of SVM, respectively. Notice that the last row is identical to that in Table 2.
To illustrate, we randomly choose one GO function from each of the four groups described above (H, MH, ML, and L) and display the corresponding sample ROC curves of the two methods in Figure 1. The curves produced by FAM and SVM are quite similar; naturally, we would expect them to have similar areas underneath as well. We also observe that the ROC curves of FAM tend to slightly dominate in the early part, where specificity is high. This explains why FAM tends to perform slightly better in terms of sensitivity at low thresholds (see Table 2), although the differences here are quite small. Full prediction results for FAM and SVM are in [2] and [3], respectively.
Figure 1. four sample ROC curves. We group 369 GO functions into four groups based on how informative they are. We say a GO function is more informative if fewer genes are associated with it. H, MH, ML, L denote the 4 groups with high, mediumhigh, mediumlow, and low levels of informativeness. Here, we show sample ROC curves for four randomly selected GO functions, one from each of the four groups. G0:0000002 = "mitochondrial genome maintenance" (from group H); G0:0007568 = "aging" (from group MH); G0:0006626 = "protein targeting to mitochondrion" (from group ML); and G0:0006399 = "tRNA metabolism" (from group L). Solid line = FAM; dashed line = SVM.
Additional File 2. FAM predictions Tabdelimited file with 3225 lines each with 370 columns, including gene names (first column) and GO accession numbers (first line), for 3224 genes and 369 GOBP categories used in the text. Contains FAM prediction scores realized when a gene was part of the validation set.
Format: GZ Size: 8.7MB Download file
Additional File 3. FAM predictions Tabdelimited file with 3225 lines each with 370 columns, including gene names (first column) and GO accession numbers (first line), for 3224 genes and 369 GOBP categories used in the text. Contains SVM prediction scores realized when a gene was part of the validation set.
Format: GZ Size: 3.7MB Download file
Discussion
In terms of using microarray expression data to identify uncharacterized genes that may be associated with various biological functions, FAM appearsto be as effective as SVM, a powerful machine learning tool. However, our FAMbased algorithm is faster than the SVM by a factor of 4000.
More importantly, the FAM framework has a great potential for incorporating important gene ontology information and dealing with a number of major challenges in using statistical classification techniques for functional genomics.
FAM as a functional randomeffects model
The hierarchical nature of how the functional categories are organized in GOBP imply that some of these functional categories are highly correlated. A very promising and important extension of FAM is the possibility of taking this correlation structure into account. More specifically, with a total of K different functional categories, we can model the expression level explicitly as being dependent on the K functional categories, i.e.,
where is now a (nonlatent) random effect for function k. Thus, there are two random components in this model: the functiondependent random factors, , realized independently for each microarray experiment j, and the error term, , which is realized independently for each microarray (j) and each gene (i).
As stated above, these (functional) random effects can no longer be assumed to be independent. Let Φ ≡ Var(f) ≠ I be the covariance matrix of these random effects. Eq. (3) then becomes
S = ΛΦΛ^{T }+ E. (6)
When properly estimated, the number λ_{ik }in this model can be interpreted as the "coefficient of association" between gene i and function k and used directly to predict gene functions.
Hence, FAM has the potential to explicitly account for the fact that both the gene expression profiles and the functional categories are correlated. By contrast, applying a classifier such as SVM separately for each functional category implicitly assumes that both are independent.
The challenge lies in the estimation of the covariance matrix Φ based on the DAG structure of GOBP. This is a nontrivial but interesting challenge. So far, we have experimented with three different methods to obtain a similarity matrix: Lin similarity [22], which has previously been applied in microarray data [7]; Resnik similarity [23], which has also been applied to microarray data [24]; and a flexible Jiang and Conrath similarity [25], which has been extended to specifically apply to estimate genegene similarity using GOBP by [26]. Such similarity matrices then need to be transformed to a valid (i.e., positivedefinite) covariance matrix, which can be used as an estimate of Φ. After experimenting with these different techniques, we obtained very different estimates of Φ. At the current state, it is difficult to evaluate which estimate is better than others. We will report more details in our future work.
FAM with functional constraints
If it is known that there is no association between gene i and function k, we can set λ_{ik }= 0 in Eq. (6) and estimate the remaining entries of Λ from data. This is closely related to confirmatory factor analysis [27]. Using the notation above, we wish to find values for the unknown entries of the matrix Λ that optimally "fits" Eq. (6). For example, we can use the Frobenious norm as the loss function and minimize the function:
S  ΛΦΛ^{T } E_{F}. (7)
If λ_{ik }are restricted to be 0 or 1, this problem would be a nonlinear binary optimization problem, which is computationally taxing but potentially solvable due to recent breakthroughs in integer programming [28]. The continuous relaxation (when the restrictions λ_{ik }= 0 or 1 are removed) is still a nontrivial problem since the objective function is nonconvex with respect to Λ [29]. Nevertheless, we explored various techniques to tackle nonconvex integer programming problems, including approximating the solution by a semidefinite programming formulation (SDP) – a technique that has been successful in many nonconvex optimization problems [30]. Although the Lagrangian relaxation of minimizing (7) can be modeled as an SDP problem [30], the objective value is often nonpositive. Thus, this relaxation did not lead to a usable solution.
There has been some recent advancements in the area of nonconvex integer programming problems, where researchers from chemical engineering and optimization have joined forces to develop a practical software [31]. In the future, improvements in computation time of these algorithms may allow us to solve our problem to global optimality.
Other extensions
Finally, even by considering just the basic FAM framework (3) alone leads to a number of different algorithms. For example, we are currently exploring alternatives to principal factor analysis such as maximum likelihood or Bayesian factor analysis. As mentioned earlier, our FAMbased algorithm has a big computational advantage. Alternatives mentioned such as maximum likelihood and Bayesian factor analysis are computationally more demanding but we expect that they would retain their computational advantage over training hundreds of SVMs to a data set comprised of thousands of gene expression vectors.
Conclusion
We now summarize the main contributions of our work. First, we proposed a model and an algorithm based on factor analysis (FAM) for predicting gene functions using microarray expression data. We evaluated the performance of our algorithm with data previously analyzed by [1]. By running a fourfold crossvalidation experiment using only characterized genes, we showed that our method is not only orders of magnitude faster, but performs comparably against a sophisticated, stateoftheart classifier that is being widely used in functional genomics. Second, we showed that our FAM framework will allow us to incorporate important gene ontology information and hence deal with a number of major challenges in using statistical classification techniques for functional genomics. Most importantly, it will allow us to directly account for the fact that the functional categories are inherently correlated and exploit this correlation structure for prediction.
Method
This section elaborates on the details of our implementation both for FAM and SVM.
Algorithm implementation
Here, we give more details for how we implemented the twostep algorithm based on our factor analysis model.
Step 1: Principal factor analysis
To estimate Λ, we choose and that minimize some loss criterion ℓ(S, ΛΛ + E). For example, one may choose ℓ(A, B) to be the Frobenious norm of the difference A  B and choose and that solve:
In our procedure, we use the most popular algorithm for fitting a factor analysis model known as Principal Factor Analysis [[12], p. 261], but one could also use maximum likelihood or Bayesian factor analysis to estimate these quantities (see the "Discussion" section). Principal factor analysis produces an approximate solution, , to Eq. (8), where the columns of are uncorrelated.
Step 2: Naïve Bayes
The fact that the columns of are uncorrelated allows us to use very simple procedures to construct the mapping F, e.g., the naïve Bayes method [[32], Section 6.6.3], which fully exploits the uncorrelated nature of the inputs. In order to apply naïve Bayes, we break the mapping F into L components, F = {F_{l}; l = 1, 2, ..., L}; each F_{l }uses the naïve Bayes method to map to the l^{th }column of Y.
The Naïve Bayes method
Let y ∈ {0,1} be the class label and x ∈ ℝ^{L}, the vector of predictors. Suppose p_{0}(x) and p_{1}(x) are the density functions for the two classes. The posterior probability that an observation belongs to class 1 is simply
where π_{1 }and π_{0 }are prior probabilities. The naive Bayes method simply assumes the density functions p_{1 }and p_{0 }to be of a product form:
In other words, the elements of the input vector x are assumed to be independent. The marginal density functions p_{0,l }and p_{1,l }can be estimated with standard (univariate) density estimators.
For any given functional category, often only a few genes are associated with it, meaning that class 1 is relatively rare. Under such circumstances, it is often necessary to impose additional parametric assumptions on the density functions before they can be estimated. We assume (as is also commonly done) that both p_{0,l }and p_{1,l }are Gaussian density functions with the same variance. This means uncorrelated inputs are the same as independent inputs and that p_{0,l }and p_{1,l }are completely specified by three parameters: μ_{0,l}, μ_{1,l}, and . Under such assumptions, naïve Bayes becomes equivalent to Diagonal Linear Discriminant Analysis [[33], DLDA].
Other than using a penalized variance estimate for (see below), we use the formulae derived by [33]. DLDA classifies an observation x to class 1 when
where
The lefthand side of (10) is monotonically equivalent to the posterior probability (9).
To estimate , we use a penalized estimate, , which is similar in spirit to variance estimates used in [34]. In particular,
where s_{0 }= 0.01 is a small positive constant and is the standard pooled variance estimate. As noted by [34], the resulting estimate is numerically more stable than .
SVM parameter selection
Due to the heavy computational burden of finetuning SVM parameters for each of the 369 functional categories, we tuned the parameters to a subset of the categories. Namely, we applied the same parameter values to functional categories with similar levels of informativeness. To select the parameters for SVM, we relied on the more global performance measure, AUC. As stated in the "Performance Evaluation" section, AUC does not depend on the threshold m and it summarizes the entire range of tradeoffs that can be achieved by a ranking algorithm.
We trained SVMs on ten functional categories chosen as follows: We sorted all 369 functional categories according to their informativeness or gene association levels (again, we refer to a functional category as more informative if fewer genes are associated with it). From that sorted list, we split the categories into ten groups based on the deciles. For example, the first group is comprised of the most informative functions and the tenth group is comprised of the least informative functions. To construct the "training set", we chose one function from each of these groups, and chose the SVM parameter values corresponding to the highest AUC value for each of them.
We tested all three kernels (linear, radial basis function, and polynomial) with diagfactor = {0.01, 0.02, 0.05, 0.1, 0.5, 1, 2, 5, 10}. For radial basis function (RBF) kernels, we tested widthfactor = {0.1, 0.2, 0.25, 0.5, 0.75, 1, 1.5, 2} and for polynomial kernels, we tested power = {2, 3} which corresponds to the degree of the polynomial kernel. For each of the ten training set categories, the AUC values were plotted in contour plots to determine the highest value. If this optimal parameter set was on the boundary of the tested parameter range, the range of the parameter values were expanded. This process was repeated until we found a range of parameters in which the optimal parameter set did not lie on the boundary of our contour plots. Additional values tested include diagfactor = {0.001, 0.005, 12, 15, 20} and widthfactor = {3, 5, 7}.
To test whether the finetuned parameters generalizes for other functional categories, we measured their performance on a separate "test set". From each of the ten groups described above, we selected five additional functional categories as the "test set". The AUC values of the test set areshown in Table 6. Two different sets of parameters were applied to the SVM:"default", which corresponds to Gist's default parameters (diagfactor = 0.1 and widthfactor = 1) with RBF kernels (we chose this kernel since it appear to perform well across all functions we tested); and "tuned", which corresponds to the parameter values finetuned to the training set.
Table 6. AUC of SVM on the testing set. The testing set is comprised of 50 functional categories classified into 10 groups according to how informative they are. A GO function is said to be more informative if fewer numbers of genes are associated with it. An SVM was trained for each of these categories using two sets of parameters. In this table, Group 1 consists of the most informative functions whereas Group 10 consists of the least informative functions. The "default" column reports to the prediction performance on the AUC scale using default control parameters in our SVM software Gist. The "tuned" column reports the prediction performance on the AUC scale using control parameters optimized on the corresponding the training set. The AUC values listed are averaged across the four CV folds. See the substraction "SVM" Parameter Selection" in the "Method" section for further detail.
Table 6 illustrates that there is little difference between using the default and the finetuned parameters with respect to AUC values. Using default parameters resulted in higher AUC values than using the tuned parameters in 5 out of 10 groups, and the mean AUC across all groups were the same. Since the Gist default parameters appear to be robust across different functional categories for this particular data set, we decided to use the default parameters to train the SVMs for all 369 categories in our experiments.
FAM parameter selection
FAM formally has one parameter: a penalty on diagonal variances, s_{0}, mentioned in section The Naïve Bayes Method. We have not performed an extensive optimization of this parameter and settled on s_{0 }= 0.01, since it seems that FAM is not very sensitive to its value.
Authors' contributions
RK was responsible for defining the problem and preparing the data. All authors contributed equally to formulating the general modeling paradigm, investigating different implementations, and writing the paper. For specific tasks, RK worked on the implementation of FAM; RS worked on SVM experiments; and MZ worked on performance evaluation.
Acknowledgements
All authors are partially supported by the Natural Science and Engineering Research Council (NSERC) of Canada. RK was also supported by the Canadian Institutes of Health and Research (CIHR) – New Emerging Team grant. We thank Tim Hughes, for providing us with the data, and Tim Hughes, Quaid Morris, Nikolaos Mitsakakis and Adam Zagdanski for helpful discussions. We would also like to thank the three anonymous referees for their helpful comments which have greatly enhanced the clarity of the paper.
References

Wu L, Hughes T, Davierwala A, Robinson M, Stoughton R, Altschuler S: Large Scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters.
Nature Genetics 2002, 31:255260. PubMed Abstract  Publisher Full Text

Zhou X, Kao MC, Wong W: Transitive functional annotation by shortestpath analysis of geneexpression data.
Proceedings of the National Academy of Sciences 2002, 99(20):1278388. Publisher Full Text

Zhang W, Morris Q, Chang R, Shai O, Bakowski M, Mitsakakis N, Mohammad N, Robinson M, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng W, Krogan N, Greenblatt J, Fehlings M, van derKooy D, Aubin J, Bruneau B, Rossant J, Blencowe B, Frey B, Hughes T: The functional landscape of mouse gene expression.
Journal of Biology 2004., 3(21) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

the Gene Ontology [http://www.geneontology.org] webcite

Marcotte E, Pellegrini M, Thompson M, Yeates T, Eisenberg D: A combined algorithm for genomewide prediction of protein function.
Nature 1999, 402:8386. PubMed Abstract  Publisher Full Text

Kemmeren P, Kockelkorn T, Bijma T, Donders R, Holstege F: Predicting gene function through systematic analysis and quality assessment of highthroughput data.
Bioinformatics 2005, 21(8):16441652. PubMed Abstract  Publisher Full Text

Chen Y, Xu D: Global protein function annotation through mining genomescale data in yeast Saccharomyces cerevisiae.
Nucleic Acids Research 2004, 32(21):64146424. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lanckriet G, De Brie T, Cristianini N, Jordan M, Noble W: A statistical framework for genomic data fusion.
Bioinformatics 2004, 20(16):26262635. PubMed Abstract  Publisher Full Text

Harris M, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, Richter J, Rubin G, Blake J, Bult C, Dolan M, Drabkin H, Eppig J, Hill D, Ni L, Ringwald M, Balakrishnan R, Cherry J, Christie K, Costanzo M, Dwight S, Engel S, Fisk D, Hirschman J, Hong E, Nash R, Sethuraman A, Theesfeld C, Botstein D, Dolinski K, Feierbach B, Berardini T, Mundodi S, Rhee S, Apweiler R, Barrell D, Camon E, Dimmer E, Lee V, Chisholm R, Gaudet P, Kibbe W, Kishore R, Schwarz E, Sternberg P, Gwinn M, Hannick L, Wortman J, Berriman M, Wood V, de la Cruz N, Tonellato P, Jaiswal P, Seigfried T, White R: The Gene Ontology (GO) database and informatics resource.

McLachlan G: Further results on the effect of intraclass correlation among training samples in discriminant analysis.
Pattern Recognition 1976, 8:273275. Publisher Full Text

Tubbs J: Effect of autocorrelated training samples on Bayes's probabilities of misclassification.
Pattern Recognition 1980, 12:351354. Publisher Full Text

Mardia KV, Kent JT, Bibby JM: Multiviariate Analysis. London, Great Britain: Academic Press; 1979.

Rogers S, Girolami M, Campbell C, Breitling R: The Latent Process Decomposition of cDNA Microarray Data Sets.
ACM/IEEE Transactions on Computational Biology and Bioinformatics 2005, 2(2):143156. Publisher Full Text

Frey B, Mohammad N, Morris Q, Zhan W, Robinson M, Mnaimneh S, Chang R, Pan Q, Sat E, Rossant J, Bruneau B, Aubin J, Blencowe B, Hughes T: Genomewide analysis of mouse transcript using exon microarrays and factor graphs.
Nature Genetics 2005, 37(9):991997. PubMed Abstract  Publisher Full Text

Troyanskaya O, Cantor M, Sherlock G, Eisen M, Brown P, Botstein D: Imputing Missing Data for Gene Expression Arrays.
Bioinformatics 2001, 17(6):52025. PubMed Abstract  Publisher Full Text

Kernel Machines [http://www.kernelmachines.org] webcite

Noble WS, Pavlidis P: Gist: Support Vector Machine and Kernel Principal Components Analysis Software Toolkit. [http://microarray.genomecenter.columbia.edu/gist/] webcite
Columbia University; 2002.

Pepe MS: The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003.

Hanley JA, McNeil BJ: The Meaning and Use of the Area Under an ROC curve.
Radiology 1982, 143:2936. PubMed Abstract

Cortes C, Mohri M: AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems 16. Edited by Thrun S, Saul L, Schölkopf B. Cambridge, MA: MIT Press; 2004.

R Development Core Team: R: A language and environment for statistical computing. [http://www.rproject.org] webcite
R Foundation for Statistical Computing, Vienna, Austria 2004.
[ISBN 3900051070]

Lin D: An informationtheoretic definition of similarity. In Proceedings of 15th International Conference on Machine Learning, San Francisco. Morgan Kaufmann; 1998:296304.

Resnik P: Using information content to evaluate semantic similarity in a taxonomy.
Proceedings of the 14th International Joint Conference on Artificial Intelligence 1995, 448453.

Bolshakova N, Azuaje F: Cluster validation techniques forgenome expression data.
Signal Process 2003, 83(4):825833. Publisher Full Text

Jiang J, Conrath D: Semantic similarity based on corpus statistics and lexical taxonomy.
Proceedings of the International Conference on Research in Computational Linguistics, Taiwan 1998.

Couto F, Silva M, Coutinho P: Implementation of a Functional Semantic Similarity Measure between GeneProducts. [http://www.di.fc.ul.pt/techreports] webcite
Tech Rep DI/FCUL TR 0329 Department of Informatics, University of Lisbon; 2003.

Jöreskog K: A General Approach to confirmatory maximum likelihood factor analysis.
Psychometrika 1969, 34(2):183202. Publisher Full Text

Bertsimas D, Weismantel R: Optimization Over Integers. Belmont, MA: Dynamic Ideas; 2005.

Bazaraa M, Sherali HD, Shetty CM: Nonlinear Programming: Theory and Algorithms. New York: John Wiley and Sons; 1993.

Wolkowicz H, Saigal R, Vandenberghe L: Handbook of Semidefinite Programming. Norwell, MA: Kluwer Academic Press; 2000.

Computational INfrastructure for Operations Research [http://www.coinor.org] webcite

Hastie TJ, Tibshirani RJ, Friedman JH: The Elements of Statistical Learning: DataMining, Inference and Prediction. SpringerVerlag; 2001.

Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for the classification of tumors using gene expression data.
Journal of the American Statistical Association 2002, 97(457):7788. Publisher Full Text

Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiarion response.
Proceedings of the National Academy of Sciences 2001, 98(9):51165121. Publisher Full Text