<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-358</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements</p>
         </title>
         <aug>
            <au id="A1" ce="yes">
               <snm>Lan</snm>
               <fnm>Hui</fnm>
               <insr iid="I1"/>
               <email>lanhui@cs.toronto.edu</email>
            </au>
            <au id="A2" ce="yes">
               <snm>Carson</snm>
               <fnm>Rachel</fnm>
               <insr iid="I2"/>
               <email>r.carson@utoronto.ca</email>
            </au>
            <au id="A3">
               <snm>Provart</snm>
               <mi>J</mi>
               <fnm>Nicholas</fnm>
               <insr iid="I2"/>
               <email>nicholas.provart@utoronto.ca</email>
            </au>
            <au id="A4" ca="yes">
               <snm>Bonner</snm>
               <mi>J</mi>
               <fnm>Anthony</fnm>
               <insr iid="I1"/>
               <email>bonner@cs.toronto.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science, University of Toronto, 40 St George St, Toronto, ON M5S 2E4, Canada</p>
            </ins>
            <ins id="I2">
               <p>Department of Cell and Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Wilcocks St, Toronto, ON M5S 3B2, Canada</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>358</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/358</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17888165</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-358</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>06</day>
               <month>10</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>21</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>21</day>
               <month>9</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Lan et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p><it>Arabidopsis thaliana </it>is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Using in house and publicly available data, we assembled a large set of gene expression measurements for <it>A. thaliana</it>. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC<sub>50 </sub>and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions &#8211; in this case, predictions of genes involved in stress response in plants &#8211; and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in <it>A. thaliana </it>that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Assigning functions to unannotated genes, identified by genome sequencing and other methods, is the goal of functional genomics. Many approaches have been proposed for large-scale prediction of gene function <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. These approaches are largely based on physical association, genetic interaction, sequence relationships and patterns of gene expression. Predicting gene functions based on large-scale gene expression measurements is an attractive strategy since many pathways display coordinated transcriptional regulation <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B7">7</abbr></abbrgrp>. Although previous studies show that supervised learning methods can be used to predict gene function based on gene expression in microorganisms such as the yeast <it>Saccharomyces cerevisiae </it>and in mammals such as mice <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>, it remains unknown to what extent this is true in plants.</p>
         <p>With the <it>A. thaliana </it>genome completely sequenced <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, functional annotation of the genes remains a key challenge for biologists. Currently, approximately 50% of the 28,000 genes have not been assigned any function <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Thus, the extent to which supervised learning methods can be used to infer gene function in <it>A. thaliana </it>is a timely and important question. Little work has been done in this area, two exceptions being <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>.</p>
         <p>In <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, a method is developed to infer gene function from microarray data and predicted protein-protein interactions. The method is similar to Nearest Neighbor algorithms <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> in that the predicted function(s) of a gene are based on the function(s) of nearby genes. Here, the "nearness" of one gene to another is based on a normalized Pearson correlation of their expression profiles and on putative interactions of their protein products. In addition, the method is extended to the discovery of biological pathways, and is applied to predicting the signaling pathway of phosphatidic acid as a second messenger in <it>A. thaliana</it>.</p>
         <p>In <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, a decision tree algorithm is applied to the problem of predicting the function of protein sequences in <it>A. thaliana</it>. Six sources of data were used: sequence, expression, SCOP, secondary structure, InterPro and sequence similarity. One conclusion of the study is that the decision tree algorithm was unable to extract much information from the expression data. The authors suggest that this is because the expression data came from unrelated and highly-specific experiments with just a few readings per gene each. They also suggest that because many more expression data sets are now available for <it>A. thaliana</it>, results may improve when using this type of data in the future.</p>
         <p>The present study aims to identify unannotated genes in <it>A. thaliana </it>that are potentially involved in plant response to stress. In the context of plants, a stress (biotic or abiotic) causes a decrease in plant growth or yield. We investigated the prediction of gene function in <it>A. thaliana </it>based solely on gene expression data using a variety of basic supervised learning methods, namely Logistic Regression (LR), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes (NB) and K-Nearest Neighbors (KNN). We also investigated the effect on the learning methods of preprocessing the expression data using Principal Component Analysis (PCA). Finally, we improved the performance of the basic learning methods by combining them using a weighted voting (WV) scheme. This work has enabled our collaborators, biologists in the Department of Cell and Systems Biology at the University of Toronto, to carry out directed biological experiments for determining gene function. In addition to these biological results, the paper illustrates how various machine-learning methods have had to be adapted to fit this bioinformatics application.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Microarray data and the Gene Ontology</p>
            </st>
            <p>In this study, we used two microarray data sets: one from the Botany Array Resource at the University of Toronto <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, and the other from the AtGenExpress Consortium <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>, archived at NASCArrays <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. These data sets include over 1000 expression-level experiments for <it>Arabidopsis</it>, and using all of them would give a data set with dimensionality over 1000. Since the performance of statistical and machine-learning methods tends to decrease with dimensionality, we chose only those experiments that are specifically stress-related. Even so, the covariance matrix of the resulting data set is singular, which is a problem for many of the machine-learning methods. The singularity is probably due to dependencies between the expression levels under control conditions, since removing the controls from the data sets solved the problem. To compensate, we tried applying the learning algorithms to expression-level ratios (<it>i.e</it>., ratios of experimental to control conditions). However, we found that the results were better when ratios were not used (data not shown). This is probably because the classifiers look for genes that respond similarly to the known stress-associated genes, so it is not so important to include the controls. In addition, since many of the features are time-courses, there is still a "time zero" control included for the values, providing a baseline measurement. The results reported in this article are therefore based on absolute expression levels without controls.</p>
            <p>From the Toronto data set, we selected 54 features corresponding to experiments conducted primarily to study plant environmental and stress physiology, plant physiology, plant-microbe and plant-insect interactions. From the AtGenExpress data set, we selected 236 features, including various abiotic stresses (e.g., osmotic stress, heat stress, cold stress, salt stress, drought stress, UV-B stress, wounding stress, water-deprivation stress and oxidative stress). We combined the selected features into a single data set. The resulting data set consists of gene expression levels for 22,746 genes under 54 + 236 = 290 different experimental conditions.</p>
            <p>We used terms from the Gene Ontology for Biological Processes (GOBP) to represent gene function. For example, the GOBP term <it>GO:0006950 [response to stress] </it>refers to genes that respond to stress. In general, the Gene Ontology (GO) provides a dynamic controlled vocabulary for describing genes and gene products in any organism <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. "Biological Process" is one of three broad GO categories (the other two being "Molecular Function" and "Cellular Component"). GOBP terms are organized into a directed acyclic graph (DAG) to reflect the hierarchical relationships between the terms. Parent GOBP terms are subdivided into increasingly specific child GOBP terms.</p>
            <p>Since our study focussed on stress, we were concerned with gene functions at or below the term <it>GO:0006950 [response to stress] </it>in the GOBP hierarchy. This GOBP term has 19 child terms, such as <it>GO:0009409 [response to cold]</it>, <it>GO:0009408 [response to heat]</it>, and <it>GO:0009414 [response to water deprivation]</it>. Since gene function becomes more and more specific as we move down the GOBP hierarchy, fewer and fewer genes have any given annotation. The result is that for specific types of stress, our data set contains many negatives and few positives. In the best case, for the term <it>GO:0009613 [response to pest, pathogen or parasite]</it>, over 97% of the training data consists of negatives. The typical case is even worse. In fact, looking at all 19 types of stress, 5 types have no positives at all, and of the remaining 14 types, the median number of negatives is 99.2% of the training data. This highly unbalanced data made accurate prediction of gene function difficult. For this reason, we narrowed our study to the top stress term, <it>GO:0006950 [response to stress]</it>. To get positive training samples for this term, we propagated all genes in its offspring upward to it in the hierarchy. After up-propagation, the top stress term has 1,031 genes, or almost 9% of the total genes in the training data. The training data therefore contains 9% positives and 91% negatives.</p>
            <p>Using GOBP terms to annotate all genes in <it>A. thaliana </it>is an ongoing project started in 2002 by TAIR <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>. The gene annotations (updated weekly) can be downloaded from TAIR <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. The predictions reported in this paper are based on the version for March 10, 2007. Using these annotations, we categorized the genes into <it>annotated </it>genes and <it>unannotated </it>genes. The annotated genes are those which have at least one GOBP annotation; the unannotated genes are those which have no GOBP annotations. In addition, a gene was treated as unannotated if its only annotation is the top GOBP category, <it>GO:0008150 [biological process]</it>, since the function of such a gene is unknown. The result was 11,553 annotated genes and 11,193 unannotated genes in our data set.</p>
            <p>The annotated genes formed the training data, in which a gene was called positive if it is annotated as a stress gene, and negative otherwise. The unannotated genes formed the prediction data. It should be noted that this approach probably introduces some false negatives into the training data, because genes not known to have a particular function are considered to be negative, even though future experiments could reveal them to have that function. That is to say, "unknown" is treated as "negative". However, the number of such false negatives should be small, since only a small number of genes participate in any given biological process. That is, most negatives are true negatives.</p>
         </sec>
         <sec>
            <st>
               <p>Predicting gene function using basic learning methods</p>
            </st>
            <p>Using a variety of basic learning methods, we trained a number of classifiers to distinguish between genes that do and do not respond to stress, based on their patterns of gene expression in the training data. We then applied each classifier to the prediction data to estimate the function of the unannotated genes. In addition, we used cross validation to evaluate the performance of each classifier and to estimate the precision of each prediction.</p>
            <p>We used five supervised learning methods: Logistic Regression (LR), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes (NB) and K-Nearest Neighbors (KNN) <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> (see Methods). These methods were chosen because they are representative of the most basic supervised learning methods, the goal being to explore simple methods first. These methods are widely understood, take little computation time, and the results provide a benchmark against which more sophisticated methods can be compared. Moreover, as we show below, the results provided by these methods are good enough to enable biologists to conduct targeted laboratory experiments.</p>
            <p>Each of the five methods is discriminative. That is, the classifiers learned by the methods assign a real number (called a discriminant value) to each gene, reflecting the classifier's certainty that the gene responds to stress. Formally, a discriminative classifier is a function, <inline-formula><m:math name="1471-2105-8-358-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaaaa@2E11@</m:annotation></m:semantics></m:math></inline-formula>, from genes to discriminant values. In our case, each gene is represented as a 290-dimensional vector, <b>x</b>, whose components are the expression levels of the gene under the 290 experimental conditions. Thus, if <b>x </b>is a vector representing a gene, then <it>dv </it>= <inline-formula><m:math name="1471-2105-8-358-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaaaa@2E11@</m:annotation></m:semantics></m:math></inline-formula>(<b>x</b>) is the discriminant value assigned to the gene by the classifier. Finally, a decision threshold, <it>&#964;</it>, is chosen, and the gene is predicted to respond to stress if and only if <it>dv </it>> <it>&#964;</it>.</p>
            <sec>
               <st>
                  <p>Unsupervised, semi-supervised and transductive learning</p>
               </st>
               <p>In addition to these supervised learning methods, we preprocessed the gene expression data using Principal Components Analysis (PCA), a form of unsupervised learning, to reduce the dimensionality of the data (see Methods). For this purpose, we combined the expression-level measurements for all genes (both annotated and unannotated) into one large data set, and applied PCA to the entire set. We are therefore doing a form of semi-supervised learning <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>, in which unsupervised learning uses the entire data set (ignoring annotations), and then supervised learning uses the annotated data. This increases the effectiveness of learning by increasing the amount of training data used in the unsupervised phase <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>. In our case, the unannotated data is also the prediction data, which means that information about the prediction data is used during (unsupervised) training. This is possible because we know all the prediction data in advance. That is, we know the expression levels for all the genes in <it>Arabidopsis </it>whether they are annotated or not. We are therefore doing a form of transductive learning <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B31">31</abbr></abbrgrp>, in which the entire prediction set is known during training and is exploited to predict its annotations. This has the added computational advantage of simplifying the way PCA is done during cross validation (see Methods).</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Estimating classifier performance</p>
            </st>
            <p>To evaluate the performance of discriminative classifiers, it is common to use receiver operating characteristic (ROC) curves <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. A ROC curve plots the true positive rate (TP) of a classifier against the false positive rate (FP) for various decision thresholds. It therefore shows the quality of a classifier not at one threshold, but at many, and provides more information than a simple miss-classification rate (as in <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> for example). In practice, however, biologists are not usually interested in having more than a few dozen false positives, especially in unbalanced data such as ours, in which the number of false positives can rapidly overwhelm the number of true positives. We therefore use so-called ROC<sub>50 </sub>curves <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, a variant of ROC curves in which the horizonal axis only goes up to 50 false positives. The area under a ROC<sub>50 </sub>curve is the ROC<sub>50 </sub>score <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, and is a measure of the overall usefulness of a classifier.</p>
            <p>To estimate ROC<sub>50 </sub>curves for our classifiers, we used 20-fold cross-validation (see Methods). Because cross-validation relies on a random split of the training data into folds (20 folds in our case), there is a certain randomness to the estimated ROC<sub>50 </sub>curve. To provide more accurate results, we performed cross-validation ten times, each time with a different (randomly selected) 20-fold split of the data (see Methods). Each 20-fold split results in a slightly different ROC<sub>50 </sub>curve. In some cases, we plot all ten of these curves, to give an idea of the uncertainty in classifier performance (Figure <figr fid="F1">1</figr>). In cases where this would result in overly cluttered graphs, we simply present the average of the ten ROC<sub>50 </sub>curves (Figures <figr fid="F2">2</figr> to <figr fid="F7">7</figr>, each of which show several average ROC<sub>50 </sub>curves).</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>ROC<sub>50 </sub>curves</p>
               </caption>
               <text>
                  <p><b>ROC<sub>50 </sub>curves</b>. Estimated ROC<sub>50 </sub>curves of the combined classifier (WV), showing ten different estimates (dashed curves) and their average (solid curve).</p>
               </text>
               <graphic file="1471-2105-8-358-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Logistic Regression (LR)</p>
               </caption>
               <text>
                  <p><b>Logistic Regression (LR)</b>. Seven ROC<sub>50 </sub>curves for Logistic Regression with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC<sub>50 </sub>score.</p>
               </text>
               <graphic file="1471-2105-8-358-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Linear Discriminant Analysis (LDA)</p>
               </caption>
               <text>
                  <p><b>Linear Discriminant Analysis (LDA)</b>. Seven ROC<sub>50 </sub>curves for Linear Discriminant Analysis with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC<sub>50 </sub>score.</p>
               </text>
               <graphic file="1471-2105-8-358-3"/>
            </fig>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Quadratic Discriminant Analysis (QDA)</p>
               </caption>
               <text>
                  <p><b>Quadratic Discriminant Analysis (QDA)</b>. Seven ROC<sub>50 </sub>curves for Quadratic Discriminant Analysis with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC<sub>50 </sub>score.</p>
               </text>
               <graphic file="1471-2105-8-358-4"/>
            </fig>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Naive Bayes (NB)</p>
               </caption>
               <text>
                  <p><b>Naive Bayes (NB)</b>. Seven ROC<sub>50 </sub>curves for Naive Bayes with varying amounts of dimensionality reduction using PCA. In the legend, p is the PCA-reduced dimension, and s is the ROC<sub>50 </sub>score.</p>
               </text>
               <graphic file="1471-2105-8-358-5"/>
            </fig>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>K-Nearest Neighbours (KNN)</p>
               </caption>
               <text>
                  <p><b>K-Nearest Neighbours (KNN)</b>. Five ROC<sub>50 </sub>curves for K-Nearest Neighbours for various values of K. The legend gives the ROC<sub>50 </sub>score, s, for each value of K.</p>
               </text>
               <graphic file="1471-2105-8-358-6"/>
            </fig>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Comparison of methods</p>
               </caption>
               <text>
                  <p><b>Comparison of methods</b>. The ROC<sub>50 </sub>curve (purple) for the combined classifier using weighted voting (WV), and the best ROC<sub>50 </sub>curves from each of Figures 2 to 6. In the legend, p is the PCA-reduced dimension of the data, and s is the ROC<sub>50 </sub>score.</p>
               </text>
               <graphic file="1471-2105-8-358-7"/>
            </fig>
            <p>We generated ROC<sub>50 </sub>curves for each supervised learning method combined with various amounts of dimensionality reduction. Using PCA, we reduced the original 290 dimensions to 5, 10, 15, 20, 40 and 100 dimensions, respectively. In this way, the original data set was transformed into six separate data sets of various dimensions. Each basic learning method (except KNN) was applied to the original data set and to each of the six reduced data sets. Thus, for each basic learning method (except KNN), we trained and tested seven different classifiers. In the case of KNN, we used only the original, unreduced data, but with five different values of K. Altogether, we trained and tested a total of 4 &#215; 7 + 5 = 33 different classifiers. Figures <figr fid="F2">2</figr> to <figr fid="F6">6</figr> show the estimated performance of these basic classifiers. Each figure shows a number of ROC<sub>50 </sub>curves, each derived using cross-validation averaged over a number of random splits of the data, as described above. Unlike traditional ROC curves, the axes of these curves give the number of true and false positives, instead of the proportion. The red dash-dot line near the bottom of each figure shows the expected performance of a random classifier (<it>i.e</it>., a classifier that ignores the expression data and guesses whether or not a gene responds to stress by essentially flipping a coin). The ROC<sub>50 </sub>scores for the curves are shown in the legend of each figure.</p>
            <p>As the figures show, in some cases the classifiers perform not much better than random, but in most cases they perform significantly better. The figures also show that the performance of each classification method depends heavily of the amount of dimensionality reduction used. Notice in particular that in some cases, the classifier trained on the reduced data has a much higher ROC<sub>50 </sub>score than the classifier trained on the original, unreduced data. This is especially true for NB and QDA. For instance, the classifiers trained on the original data have low ROC<sub>50 </sub>scores of 182.3 for NB and 115.2 for QDA. This is comparable to the random classifier, whose ROC<sub>50 </sub>score is 122.5. However, reducing the dimensionality of the data to 15 increases their ROC<sub>50 </sub>scores to 1373.1 and 1651.0, respectively. This shows the importance of dimensionality reduction. In contrast, KNN performs well for all the values of K that we used.</p>
            <p>Figure <figr fid="F7">7</figr> compares the basic classification methods by plotting the best performance of each. That is, for each of the basic classification methods, the ROC<sub>50 </sub>curve with the highest ROC<sub>50 </sub>score is reproduced in Figure <figr fid="F7">7</figr>. In addition, the figure shows the performance of a classification method that uses a weighted voting scheme (WV) to combine the 33 basic classifiers into a single, composite classifier. Notice that this composite classifier performs best of all. The next section describes how this composite classifier is constructed.</p>
         </sec>
         <sec>
            <st>
               <p>Improving prediction accuracy by combining classifiers</p>
            </st>
            <p>Combining different classifiers in prediction can be thought of as combining different opinions in decision making. The advantage is that a group opinion is better than a single opinion if the single opinions are correctly weighted and combined. In machine-learning systems, classifiers are often combined by weighted voting, in which the discriminant value of the combined classifier is a linear combination of the discriminant values of the individual classifiers. Formally, given a set of basic classifiers, <inline-formula><m:math name="1471-2105-8-358-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:mn>1</m:mn></m:msub><m:mo>,</m:mo><m:mo>&#8230;</m:mo><m:mo>,</m:mo><m:msub><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:mi>M</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcamaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaeS47IWKaeiilaWIafmOzayMbaKaadaWgaaWcbaGaemyta0eabeaaaaa@3599@</m:annotation></m:semantics></m:math></inline-formula>, and a set of weights, <it>w</it><sub>1</sub>, &#8230;, <it>w</it><sub><it>M</it></sub>, the combined classifier, <inline-formula><m:math name="1471-2105-8-358-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaaaa@2E11@</m:annotation></m:semantics></m:math></inline-formula>, is defined by the equation <inline-formula><m:math name="1471-2105-8-358-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:mo stretchy="false">(</m:mo><m:mi>x</m:mi><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mstyle displaystyle="true"><m:msub><m:mo>&#8721;</m:mo><m:mi>m</m:mi></m:msub><m:mrow><m:msub><m:mi>w</m:mi><m:mi>m</m:mi></m:msub><m:msub><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:mi>m</m:mi></m:msub><m:mo stretchy="false">(</m:mo><m:mi>x</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaiabcIcaOGqabiab=Hha4jabcMcaPiabg2da9maaqababaGaem4DaC3aaSbaaSqaaiabd2gaTbqabaGccuWGMbGzgaqcamaaBaaaleaacqWGTbqBaeqaaOGaeiikaGIae8hEaGNaeiykaKcaleaacqWGTbqBaeqaniabggHiLdaaaa@3EC3@</m:annotation></m:semantics></m:math></inline-formula>. In our case, <it>M </it>= 33, as described above.</p>
            <p>By judiciously choosing the weights, <it>w</it><sub>1</sub>, &#8230;, <it>w</it><sub><it>M</it></sub>, the performance of the combined classifier can be maximized. Various methods are available for doing this, such as model averaging and stacking <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Using these methods on our data sets, we found that the ROC curve of the combined classifier was usually better than the ROC curves of the basic classifiers, as expected. Unfortunately, we also found that the ROC<sub>50 </sub>curve of the combined classifier was usually worse (data not shown). We hypothesized that this is because our data sets are highly unbalanced. Intuitively, model averaging and stacking try to choose weights so as to correctly classify as much data as possible. In our case, this means trying to correctly classify the vast number of negative samples in our data sets, even if this means misclassifying the small number of positives. In other words, these methods try to minimize the total number of false positives, even though we only care about the first fifty.</p>
            <p>To choose appropriate weights for our combined classifier, we used the heuristic that classifiers that perform well should be given more weight than classifiers that perform poorly. In our case, since we want to maximize the ROC<sub>50 </sub>score of the combined classifier, we want to give high weight to classifiers with high ROC<sub>50 </sub>scores. There are many ways to do this, but we found that it was sufficient to estimate and normalize the ROC<sub>50 </sub>score of each basic classifier, and use this as its weight. That is, we used <inline-formula><m:math name="1471-2105-8-358-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>w</m:mi><m:mi>m</m:mi></m:msub><m:mo>=</m:mo><m:msub><m:mover accent="true"><m:mi>s</m:mi><m:mo>^</m:mo></m:mover><m:mi>m</m:mi></m:msub><m:mo>/</m:mo><m:mstyle displaystyle="true"><m:msub><m:mo>&#8721;</m:mo><m:mi>m</m:mi></m:msub><m:mrow><m:msub><m:mover accent="true"><m:mi>s</m:mi><m:mo>^</m:mo></m:mover><m:mi>m</m:mi></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemyBa0gabeaakiabg2da9iqbdohaZzaajaWaaSbaaSqaaiabd2gaTbqabaGccqGGVaWldaaeqaqaaiqbdohaZzaajaWaaSbaaSqaaiabd2gaTbqabaaabaGaemyBa0gabeqdcqGHris5aaaa@3B09@</m:annotation></m:semantics></m:math></inline-formula>, where <inline-formula><m:math name="1471-2105-8-358-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>s</m:mi><m:mo>^</m:mo></m:mover><m:mi>m</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGZbWCgaqcamaaBaaaleaacqWGTbqBaeqaaaaa@2FBA@</m:annotation></m:semantics></m:math></inline-formula> is an estimate of the ROC<sub>50 </sub>score of classifier <it>f</it><sub><it>m</it></sub>. Note that with these weights, if each <inline-formula><m:math name="1471-2105-8-358-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:mi>m</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcamaaBaaaleaacqWGTbqBaeqaaaaa@2FA0@</m:annotation></m:semantics></m:math></inline-formula>(<b>x</b>) is a number between 0 and 1 (as with our classifiers), then so is <inline-formula><m:math name="1471-2105-8-358-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>f</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGMbGzgaqcaaaa@2E11@</m:annotation></m:semantics></m:math></inline-formula>(<b>x</b>). Also, this method automatically gives low weight to classifiers that use an inappropriate amount of dimensionality reduction, since such classifiers have low ROC<sub>50 </sub>scores. In this way, the combined classifier incorporates not only the best combination of supervised learning methods, but also the best amounts of dimensionality reduction for each method.</p>
            <p>To train and evaluate the combined classifier, we used <it>two </it>sets of validation data. After the basic classifiers were trained, one validation set was used to estimate their ROC<sub>50 </sub>scores. The combined classifier was then constructed using these scores, and the second validation set was used to estimate its ROC<sub>50 </sub>curve. Thus, the validation data for the basic classifiers is part of the training data for the combined classifier. To do this in a cross-validation setting, we used what amounts to nested cross-validation (see Methods). As shown in Figure <figr fid="F7">7</figr>, the resulting combined classifier has a higher ROC<sub>50 </sub>score than any of the basic classifiers from which it is made.</p>
            <p>Figure <figr fid="F1">1</figr> gives another view of the performance of the combined classifier. Here, the thin dashed lines are a superposition of ten different curves, where each one is a different estimate of the combined classifier's true ROC<sub>50 </sub>curve. As described earlier, each estimate of a classifier's ROC<sub>50 </sub>curve includes some randomness, due to the random choice of folds during cross-validation. The ten dashed curves in Figure <figr fid="F1">1</figr> are derived from ten different cross-validations, each one using a different set of folds. The thick solid line in the figure is the average of the other ten curves. Because averaging reduces variance, the average curve is a more accurate estimate of the true ROC<sub>50 </sub>curve (i.e., has lower variance) than any of the other ten curves. The diagonal dash-dot line near the bottom of the plot shows the expected performance of a random classifier.</p>
            <p>ROC and ROC<sub>50 </sub>curves plot the number of true positives against the number of false positives. However, in applications such as ours, the <it>precision </it>is also of interest. Precision is the proportion of true positives (TP) among the predicted positives (PP). (It is also the complementary false discovery rate, 1-FDR <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>.) Precision is important since each prediction is a potential experiment, and as a matter of economics, a biologist needs an estimate of how many of the experiments will succeed. This is especially important in situations, such as ours, where the number of real negatives is much greater than the number of real positives, and so there is a real possibility of having a huge number of failed experiments.</p>
            <p>Figure <figr fid="F8">8</figr> plots estimated precision against the number of predictions for the first hundred predictions. Notice that as the number of predictions increases (<it>i.e</it>., as the classifier's decision threshold is lowered), the precision decreases, meaning that fewer of the predictions are expected to be true. As in Figure <figr fid="F1">1</figr>, the thin dashed lines are a superposition of ten different curves, each one an estimate of the true precision curve, and the thick solid line is their average. Also, the horizontal dash-dot line near the bottom of the plot is the expected precision of a random classifier, and its height is equal to the ratio of the number of positives (<it>i.e</it>., stress genes) to the total number of samples (<it>i.e</it>., genes) in the training data. Since all the estimated precision curves are well above the horizontal dash-dot line, the performance of the combined classifier for the first hundred predictions is significantly better than random. Also, since Figures <figr fid="F1">1</figr> and <figr fid="F8">8</figr> show small variance, and since the variance of the average curves will be even less, the combined classifier should have stable prediction performance.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Precision curves</p>
               </caption>
               <text>
                  <p><b>Precision curves</b>. Estimated precision curves of the combined classifier (WV), showing ten different estimates (dashed curves) and their average (solid curve).</p>
               </text>
               <graphic file="1471-2105-8-358-8"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Stress-response predictions</p>
            </st>
            <p>We trained the combined classifier on our Arabidopsis data set, using all 22,746 genes for Principal Components Analysis, and the 11,553 annotated genes for supervised learning, as described above. We then applied the classifier to the 11,193 unannotated genes, to get a set of 11,193 predictions (see Methods). Table <tblr tid="T1">1</tblr> shows the top fifty predictions. Each row in the table is a prediction: the first (leftmost) entry is the rank of the prediction (1 being the top prediction); the second entry identifies a gene; the third entry is a discriminant value (measuring the likelihood that the gene responds to stress); and the fourth entry is the estimated precision of the prediction and all predictions above it (<it>i.e</it>., the fraction of these predictions expected to be true). As an example, consider the 23rd row of the table, the row for gene At1g09950. Since the estimated precision in this row is given as 0.7044, we expect that about 70% of the top 23 genes respond to stress, <it>i.e</it>., 16 genes.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>The top 50 predictions of the combined classifier ordered by discriminant value</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="center">
                        <p>No.</p>
                     </c>
                     <c ca="center">
                        <p>Gene name</p>
                     </c>
                     <c ca="center">
                        <p>Dv</p>
                     </c>
                     <c ca="center">
                        <p>Pr</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>At1g61340</p>
                     </c>
                     <c ca="center">
                        <p>0.7879</p>
                     </c>
                     <c ca="center">
                        <p>0.8491</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>At1g72660</p>
                     </c>
                     <c ca="center">
                        <p>0.7315</p>
                     </c>
                     <c ca="center">
                        <p>0.8423</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>At5g04340</p>
                     </c>
                     <c ca="center">
                        <p>0.7269</p>
                     </c>
                     <c ca="center">
                        <p>0.8405</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>At1g19180</p>
                     </c>
                     <c ca="center">
                        <p>0.7219</p>
                     </c>
                     <c ca="center">
                        <p>0.8448</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>At2g01520</p>
                     </c>
                     <c ca="center">
                        <p>0.7017</p>
                     </c>
                     <c ca="center">
                        <p>0.8311</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>At2g36220</p>
                     </c>
                     <c ca="center">
                        <p>0.6987</p>
                     </c>
                     <c ca="center">
                        <p>0.8293</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>At5g10695</p>
                     </c>
                     <c ca="center">
                        <p>0.6912</p>
                     </c>
                     <c ca="center">
                        <p>0.8138</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>8</p>
                     </c>
                     <c ca="center">
                        <p>At3g10020</p>
                     </c>
                     <c ca="center">
                        <p>0.6850</p>
                     </c>
                     <c ca="center">
                        <p>0.8030</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>At3g16050</p>
                     </c>
                     <c ca="center">
                        <p>0.6778</p>
                     </c>
                     <c ca="center">
                        <p>0.8000</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>At4g18280</p>
                     </c>
                     <c ca="center">
                        <p>0.6673</p>
                     </c>
                     <c ca="center">
                        <p>0.7945</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>At1g11210</p>
                     </c>
                     <c ca="center">
                        <p>0.6636</p>
                     </c>
                     <c ca="center">
                        <p>0.7955</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>12</p>
                     </c>
                     <c ca="center">
                        <p>At5g64510</p>
                     </c>
                     <c ca="center">
                        <p>0.6514</p>
                     </c>
                     <c ca="center">
                        <p>0.7900</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>At3g09350</p>
                     </c>
                     <c ca="center">
                        <p>0.6412</p>
                     </c>
                     <c ca="center">
                        <p>0.7807</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>At5g42380</p>
                     </c>
                     <c ca="center">
                        <p>0.6357</p>
                     </c>
                     <c ca="center">
                        <p>0.7718</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>At3g44860</p>
                     </c>
                     <c ca="center">
                        <p>0.6278</p>
                     </c>
                     <c ca="center">
                        <p>0.7623</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>At1g73260</p>
                     </c>
                     <c ca="center">
                        <p>0.6252</p>
                     </c>
                     <c ca="center">
                        <p>0.7583</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>At1g16850</p>
                     </c>
                     <c ca="center">
                        <p>0.6186</p>
                     </c>
                     <c ca="center">
                        <p>0.7452</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>18</p>
                     </c>
                     <c ca="center">
                        <p>At1g78070</p>
                     </c>
                     <c ca="center">
                        <p>0.6185</p>
                     </c>
                     <c ca="center">
                        <p>0.7439</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>At3g01830</p>
                     </c>
                     <c ca="center">
                        <p>0.6098</p>
                     </c>
                     <c ca="center">
                        <p>0.7398</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>At5g19875</p>
                     </c>
                     <c ca="center">
                        <p>0.6094</p>
                     </c>
                     <c ca="center">
                        <p>0.7402</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>21</p>
                     </c>
                     <c ca="center">
                        <p>At3g62260</p>
                     </c>
                     <c ca="center">
                        <p>0.6040</p>
                     </c>
                     <c ca="center">
                        <p>0.7213</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>22</p>
                     </c>
                     <c ca="center">
                        <p>At1g03070</p>
                     </c>
                     <c ca="center">
                        <p>0.5961</p>
                     </c>
                     <c ca="center">
                        <p>0.7106</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>23</p>
                     </c>
                     <c ca="center">
                        <p>At1g09950</p>
                     </c>
                     <c ca="center">
                        <p>0.5942</p>
                     </c>
                     <c ca="center">
                        <p>0.7044</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>At1g19020</p>
                     </c>
                     <c ca="center">
                        <p>0.5867</p>
                     </c>
                     <c ca="center">
                        <p>0.6928</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>At1g07430</p>
                     </c>
                     <c ca="center">
                        <p>0.5866</p>
                     </c>
                     <c ca="center">
                        <p>0.6919</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>26</p>
                     </c>
                     <c ca="center">
                        <p>At1g76960</p>
                     </c>
                     <c ca="center">
                        <p>0.5860</p>
                     </c>
                     <c ca="center">
                        <p>0.6901</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>27</p>
                     </c>
                     <c ca="center">
                        <p>At1g30070</p>
                     </c>
                     <c ca="center">
                        <p>0.5838</p>
                     </c>
                     <c ca="center">
                        <p>0.6819</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>28</p>
                     </c>
                     <c ca="center">
                        <p>At2g05510</p>
                     </c>
                     <c ca="center">
                        <p>0.5799</p>
                     </c>
                     <c ca="center">
                        <p>0.6726</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>29</p>
                     </c>
                     <c ca="center">
                        <p>At3g50930</p>
                     </c>
                     <c ca="center">
                        <p>0.5796</p>
                     </c>
                     <c ca="center">
                        <p>0.6726</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>30</p>
                     </c>
                     <c ca="center">
                        <p>At1g67360</p>
                     </c>
                     <c ca="center">
                        <p>0.5767</p>
                     </c>
                     <c ca="center">
                        <p>0.6691</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>31</p>
                     </c>
                     <c ca="center">
                        <p>At5g09530</p>
                     </c>
                     <c ca="center">
                        <p>0.5758</p>
                     </c>
                     <c ca="center">
                        <p>0.6703</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>32</p>
                     </c>
                     <c ca="center">
                        <p>At3g53230</p>
                     </c>
                     <c ca="center">
                        <p>0.5737</p>
                     </c>
                     <c ca="center">
                        <p>0.6663</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>33</p>
                     </c>
                     <c ca="center">
                        <p>At3g55970</p>
                     </c>
                     <c ca="center">
                        <p>0.5694</p>
                     </c>
                     <c ca="center">
                        <p>0.6586</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>34</p>
                     </c>
                     <c ca="center">
                        <p>At4g27657</p>
                     </c>
                     <c ca="center">
                        <p>0.5676</p>
                     </c>
                     <c ca="center">
                        <p>0.6549</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>35</p>
                     </c>
                     <c ca="center">
                        <p>At4g38080</p>
                     </c>
                     <c ca="center">
                        <p>0.5658</p>
                     </c>
                     <c ca="center">
                        <p>0.6458</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>36</p>
                     </c>
                     <c ca="center">
                        <p>At1g17380</p>
                     </c>
                     <c ca="center">
                        <p>0.5651</p>
                     </c>
                     <c ca="center">
                        <p>0.6448</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>37</p>
                     </c>
                     <c ca="center">
                        <p>At4g27652</p>
                     </c>
                     <c ca="center">
                        <p>0.5647</p>
                     </c>
                     <c ca="center">
                        <p>0.6445</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>At1g68500</p>
                     </c>
                     <c ca="center">
                        <p>0.5588</p>
                     </c>
                     <c ca="center">
                        <p>0.6204</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>39</p>
                     </c>
                     <c ca="center">
                        <p>At1g76650</p>
                     </c>
                     <c ca="center">
                        <p>0.5573</p>
                     </c>
                     <c ca="center">
                        <p>0.6146</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>40</p>
                     </c>
                     <c ca="center">
                        <p>At2g15960</p>
                     </c>
                     <c ca="center">
                        <p>0.5549</p>
                     </c>
                     <c ca="center">
                        <p>0.6074</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>41</p>
                     </c>
                     <c ca="center">
                        <p>At1g14870</p>
                     </c>
                     <c ca="center">
                        <p>0.5520</p>
                     </c>
                     <c ca="center">
                        <p>0.6017</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>42</p>
                     </c>
                     <c ca="center">
                        <p>At1g49450</p>
                     </c>
                     <c ca="center">
                        <p>0.5497</p>
                     </c>
                     <c ca="center">
                        <p>0.5991</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>43</p>
                     </c>
                     <c ca="center">
                        <p>At1g13930</p>
                     </c>
                     <c ca="center">
                        <p>0.5467</p>
                     </c>
                     <c ca="center">
                        <p>0.5942</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>44</p>
                     </c>
                     <c ca="center">
                        <p>At2g32190</p>
                     </c>
                     <c ca="center">
                        <p>0.5453</p>
                     </c>
                     <c ca="center">
                        <p>0.5914</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>45</p>
                     </c>
                     <c ca="center">
                        <p>At4g23493</p>
                     </c>
                     <c ca="center">
                        <p>0.5429</p>
                     </c>
                     <c ca="center">
                        <p>0.5879</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>46</p>
                     </c>
                     <c ca="center">
                        <p>At2g28400</p>
                     </c>
                     <c ca="center">
                        <p>0.5418</p>
                     </c>
                     <c ca="center">
                        <p>0.5842</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>47</p>
                     </c>
                     <c ca="center">
                        <p>At1g48720</p>
                     </c>
                     <c ca="center">
                        <p>0.5399</p>
                     </c>
                     <c ca="center">
                        <p>0.5780</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>48</p>
                     </c>
                     <c ca="center">
                        <p>At3g02480</p>
                     </c>
                     <c ca="center">
                        <p>0.5384</p>
                     </c>
                     <c ca="center">
                        <p>0.5721</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="center">
                        <p>At2g43620</p>
                     </c>
                     <c ca="center">
                        <p>0.5376</p>
                     </c>
                     <c ca="center">
                        <p>0.5677</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>50</p>
                     </c>
                     <c ca="center">
                        <p>At4g14270</p>
                     </c>
                     <c ca="center">
                        <p>0.5373</p>
                     </c>
                     <c ca="center">
                        <p>0.5676</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Pr, estimated precision; Dv, discriminant value.</p>
               </tblfn>
            </tbl>
            <p>Figures <figr fid="F9">9</figr> and <figr fid="F10">10</figr> provide visual evidence supporting these predictions. Each figure shows a heat map. These maps, known as "electronic Northerns" (or e-Northerns), were generated using the Expression Browser tool of the Botany Array Resource (BAR) and the AtGenExpress Stress Series (shoot) data set<abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. The program contains expression data for more than 22,000 genes across more than 1000 samples collected from NASCArrays, AtGenExpress Consortium, and the Department of Botany at the University of Toronto <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B36">36</abbr></abbrgrp>. Each row in an e-Northern is a gene, and each column is an experiment. The colour at a point represents the relative expression level of the gene during the experiment. More specifically, the colour represents the log<sub>2 </sub>of the ratio of the average of replicate treatments relative to the average of corresponding controls. Yellow means that under the experimental conditions, the gene had the same expression level as the control. (The wide, yellow vertical stripes are the controls.) Red means that the gene had a higher expression level than the control (up-regulation), and blue means it had a lower expression level (down-regulation). A gene that shows significant up-regulation (or down-regulation) under stress conditions is likely to be involved in response to stress. Thus, unlike cross validation, electronic Northerns provide a means of evaluating the quality of predictions based on the prediction data, not just the training data. The e-Northerns of Figures <figr fid="F9">9</figr> and <figr fid="F10">10</figr>, for instance, are based entirely on prediction data. In these e-Northerns, the experiments exposed the plant to various stress conditions, such as heat, cold, drought, UV-B radiation, etc. Figure <figr fid="F9">9</figr> is the e-Northern for the top-50 predictions of our combined classifier, <it>i.e</it>., for the 50 genes predicted to most likely to respond to stress. For comparison, Figure <figr fid="F10">10</figr> is the e-Northern for 50 genes chosen at random from the prediction set. Note that there is much more colour in Figure <figr fid="F9">9</figr> than in Figure <figr fid="F10">10</figr>, especially red. This suggests that our combined classifier has indeed extracted meaningful gene expression patterns for genes that respond to stress.</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Electronic Northern analysis</p>
               </caption>
               <text>
                  <p><b>Electronic Northern analysis</b>. E-Northern of the top 50 predictions.</p>
               </text>
               <graphic file="1471-2105-8-358-9"/>
            </fig>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>Electronic Northern analysis</p>
               </caption>
               <text>
                  <p><b>Electronic Northern analysis</b>. E-Northern of 50 randomly selected genes.</p>
               </text>
               <graphic file="1471-2105-8-358-10"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Gene knockout experiments</p>
            </st>
            <p>From the predictions of the combined classifier, three genes were chosen for biological analysis using gene knockout experiments. Here, we present the results for one of these genes, At1g16850, which show it to be necessary for the normal response to temperature and NaCl. Our results also confirm that the other two genes, At1g11210 and At4g39675, are involved in a variety of stress responses (data not shown).</p>
            <p>The criteria used to choose candidate genes for subsequent biological analysis were: 1) the gene must be expressed in either root or shoot, 2) gene expression should be strongly increased in response to abiotic stress, such as cold, drought, osmotic and salt stresses, 3) T-DNA knockout lines &#8211; in which a given gene's expression has been eliminated &#8211; should available from the Salk Institute <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>, and 4) the gene should not have an annotated function nor be present in any patent database. Further bioinformatics analysis was performed using Athena for promoter motif prediction <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>, Expression Angler for co-expressed gene analysis <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and eFP browser for electronic representation of gene expression patterns <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>.</p>
            <sec>
               <st>
                  <p>Stress response</p>
               </st>
               <p>The increased presence of anthocyanin levels in plants lacking a functional copy of the At1g16850 gene during cold stress of 4C indicates that this gene is involved in cold stress response (Figure <figr fid="F11">11</figr>). The same effect is seen at 30C, indicating that this gene is also associated with response to heat stress (Figure <figr fid="F11">11</figr>). Interestingly, At1g16850 is normally expressed during the later stages of seed maturation, towards seed dessication, and hence may play a role in seed dormancy. This sort of bifunctionality is seen with other stress response genes, which have documented roles in the cold, heat and salt stress pathways, e.g. RD29A (Response to Desiccation) and LEA (Late Embryogensis Abundant) protein <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>. These proteins have also been found to accumulate during seed maturation <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp> and are in fact co-expressed with At1g16850 under stress conditions and during seed maturation, as determined using the Expression Angler algorithm <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
               <fig id="F11">
                  <title>
                     <p>Figure 11</p>
                  </title>
                  <caption>
                     <p>Gene knockout experiments</p>
                  </caption>
                  <text>
                     <p><b>Gene knockout experiments</b>. 10 day old wild-type and mutant plants after exposure for 7 days at 14. (a) The mutant cotyledons appear darker than wild-type due to increased anthocyanin levels. (b) mutant and wild-type seeds 24 h after sowing on agar plates. Mutant seeds have the appearance of lighter colour compared to wild-type. (c) Quantification of anthocyanin levels measuring A535. Bars indicate standard error of 5 replicate measurements. * indicates significantly different at <it>p </it>&lt; 0.05</p>
                  </text>
                  <graphic file="1471-2105-8-358-11"/>
               </fig>
               <p>In addition to modulating a response to temperature, plants lacking a functional At1g16850 exhibit a defective root growth phenotype under increasing salt concentrations (Figure <figr fid="F12">12</figr>). This phenotype, combined with previous microarray studies <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, which found At1g16850 induction at 250 mM NaCl, gives clear indication that At1g16850 is also part of the salt stress response pathway.</p>
               <fig id="F12">
                  <title>
                     <p>Figure 12</p>
                  </title>
                  <caption>
                     <p>Gene knockout experiments</p>
                  </caption>
                  <text>
                     <p><b>Gene knockout experiments</b>. Root growth on 50 mM NaCl, relative to growth on 0 mM NaCl, on 10 day old wild-type and mutant plants transferred to 50 mM NaCl medium. Error bars indicate the standard error of 5 replicates. <it>n </it>= 25 measurements per treatment and genotype. * indicates significantly different at <it>p </it>&lt; 0.001</p>
                  </text>
                  <graphic file="1471-2105-8-358-12"/>
               </fig>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>In this study, we evaluated and compared five basic supervised learning methods (LR, LDA, QDA, NB and KNN) for gene function prediction in <it>A. thaliana </it>based solely on gene expression data. The major advantage of supervised methods over unsupervised methods is that by including prior knowledge of class information, supervised methods can ignore uninformative features and select informative features that are useful for separating classes. In this study, we focussed on finding genes that respond to stress, as represented by the term <it>GO:0006950 [response to stress] </it>in the GOBP hierarchy. Using a training set of genes of known function, we used the basic learning methods to predict the stress response of genes of unknown function. We estimated the accuracy of the predictions using ROC<sub>50 </sub>scores derived through cross validation. We found, for instance, that KNN performs well for various values of K. For the other learning methods, the performance depends greatly on whether the data is preprocessed using PCA, and on how much its dimensionality is reduced. Using various values of K and various amounts of dimensionality reduction, we trained and tested a total of 33 basic classifiers.</p>
         <p>We also investigated combining the basic classifiers using weighted voting. Our method of constructing the combined classifier chooses not only the best combination of supervised learning methods, but also the best amount of dimensionality reduction for each method. Our results show that the combined classifier outperforms all the basic classifiers in predicting whether a gene responds to stress. This can be attributed to the relative robustness of methods for combining classifiers. Intuitively, any single learning method represents a single view of the data, while a combination method represents multiple views strategically combined. The proper choice of combining method is important to the success of a combined classifier. For example, model averaging and stacking are well-known methods for combining classifiers <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>; however, we found that while they did improve on the overall ROC curves of the basic classifiers, the ROC<sub>50 </sub>curve was often worse (data not shown). In contrast, our weighted voting method using ROC<sub>50 </sub>scores as weights is simple, provides improved accuracy in predicting stress response in <it>A. thaliana</it>, and we would expect it to provide improved accuracy in other organisms and for other gene functions.</p>
         <p>Using electronic Northern analysis, we observed significant up-regulation and down-regulation of many of our predictions. The strong up- and down-regulation are also present among the stress-response genes in the training data (data not shown). In contrast, randomly selected genes show much less up- and down-regulation. This visually confirms that the combined classifier is able to distinguish between stress and non-stress genes. Moreover, unlike cross-validation, this confirmation is based on the prediction data, not the training data.</p>
         <p>Using gene knockout experiments &#8211; in which a given gene's expression is eliminated &#8211; we tested three of our predictions. We presented the results for one of these genes, At1g16850, which show it to be involved in the stress response pathways to cold (4C), chill (14C) and NaCl. We have also confirmed the biological stress responsive roles of the other two genes, At1g11210 and At4g39675 (data not shown). Further biological studies will determine the pattern of expression in specific cell and tissues types of the plant and the exact physiological role of these genes.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Preprocessing of raw gene expression data</p>
            </st>
            <p>The gene expression data from the Botany Array Resource at the University of Toronto contain <it>detection calls</it>: P (present), M (marginal) and A (absent). The detection call determines whether a transcript is reliably detected (present), partially detected (marginal), or not detected (absent). The following is an example for the gene <it>At3g24440 </it>under three selected conditions:</p>
            <p>AT3G24440 : 243.10 P : 120.90 A : 109.40 M</p>
            <p>We simply removed these detection calls (P, A, and M) in this study. In addition, gene expression levels were log transformed. The transformed data have approximately normal distributions while the raw data have approximately exponential distributions (data not shown). Many of the learning methods used in this study were designed with normal data in mind.</p>
         </sec>
         <sec>
            <st>
               <p>Basic supervised learning methods</p>
            </st>
            <p>Each of the learning methods described below trains a discriminative classifier. We used the methods to train binary classifiers in which the two classes correspond to genes that respond to stress (Class 1) and genes that do not (Class 0). Given a vector, <b>x</b>, of gene expression measurements, each classifier returns a discriminant value, <it>dv</it>(<b>x</b>), reflecting the classifier's confidence that the gene belongs to Class 1. The gene is assigned to Class 1 if and only if <it>dv</it>(<b>x</b>) > <it>&#964;</it>, where <it>&#964; </it>is a decision threshold. For the classifiers LR, LDA, QDA and NB, the discriminate value is an estimate of <it>p</it>(<it>k </it>= 1|<b>x</b>), the posterior probability that the gene is in Class 1. For KNN, the discriminant value is simply a number between 0 and 1.</p>
            <sec>
               <st>
                  <p>LR (Logistic Regression)</p>
               </st>
               <p>Given a set of classes, LR models the log likelihood ratio for any pair of classes as a linear function of the test vector, <b>x</b>, and thus defines linear decision boundaries between the classes. In the case of just two classes, the model has the simple form</p>
               <p>
                  <display-formula id="M1">
                     <m:math name="1471-2105-8-358-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>l</m:mi>
                              <m:mi>o</m:mi>
                              <m:mi>g</m:mi>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                    <m:mo>|</m:mo>
                                    <m:mi>x</m:mi>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:mi>k</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>0</m:mn>
                                    <m:mo>|</m:mo>
                                    <m:mi>x</m:mi>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mfrac>
                              <m:mo>=</m:mo>
                              <m:msub>
                                 <m:mi>&#946;</m:mi>
                                 <m:mn>0</m:mn>
                              </m:msub>
                              <m:mo>+</m:mo>
                              <m:msubsup>
                                 <m:mi>&#946;</m:mi>
                                 <m:mn>1</m:mn>
                                 <m:mi>T</m:mi>
                              </m:msubsup>
                              <m:mi>x</m:mi>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaieGacqWFSbaBcqWFVbWBcqWFNbWzdaWcaaqaaiabdchaWjabcIcaOiabdUgaRjabg2da9iabigdaXiabcYha8Hqabiab+Hha4jabcMcaPaqaaiabdchaWjabcIcaOiabdUgaRjabg2da9iabicdaWiabcYha8jab+Hha4jabcMcaPaaacqGH9aqpiiGacqqFYoGydaWgaaWcbaGaeGimaadabeaakiabgUcaRiab9j7aInaaDaaaleaacqaIXaqmaeaacqWGubavaaGccqGF4baEaaa@4DC2@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>and hence,</p>
               <p>
                  <display-formula id="M2">
                     <m:math name="1471-2105-8-358-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>p</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>k</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>1</m:mn>
                              <m:mo>|</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msup>
                                       <m:mi>e</m:mi>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>&#946;</m:mi>
                                             <m:mn>0</m:mn>
                                          </m:msub>
                                          <m:mo>+</m:mo>
                                          <m:msubsup>
                                             <m:mi>&#946;</m:mi>
                                             <m:mn>1</m:mn>
                                             <m:mi>T</m:mi>
                                          </m:msubsup>
                                          <m:mi>x</m:mi>
                                       </m:mrow>
                                    </m:msup>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mn>1</m:mn>
                                    <m:mo>+</m:mo>
                                    <m:msup>
                                       <m:mi>e</m:mi>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>&#946;</m:mi>
                                             <m:mn>0</m:mn>
                                          </m:msub>
                                          <m:mo>+</m:mo>
                                          <m:msubsup>
                                             <m:mi>&#946;</m:mi>
                                             <m:mn>1</m:mn>
                                             <m:mi>T</m:mi>
                                          </m:msubsup>
                                          <m:mi>x</m:mi>
                                       </m:mrow>
                                    </m:msup>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWGRbWAcqGH9aqpcqaIXaqmcqGG8baFieqacqWF4baEcqGGPaqkcqGH9aqpdaWcaaqaaiabdwgaLnaaCaaaleqabaacciGae4NSdi2aaSbaaWqaaiabicdaWaqabaWccqGHRaWkcqGFYoGydaqhaaadbaGaeGymaedabaGaemivaqfaaSGae8hEaGhaaaGcbaGaeGymaeJaey4kaSIaemyzau2aaWbaaSqabeaacqGFYoGydaWgaaadbaGaeGimaadabeaaliabgUcaRiab+j7aInaaDaaameaacqaIXaqmaeaacqWGubavaaWccqWF4baEaaaaaaaa@4E33@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>
                  <display-formula id="M3">
                     <m:math name="1471-2105-8-358-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>p</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>k</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>0</m:mn>
                              <m:mo>|</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mn>1</m:mn>
                                 <m:mrow>
                                    <m:mn>1</m:mn>
                                    <m:mo>+</m:mo>
                                    <m:msup>
                                       <m:mi>e</m:mi>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>&#946;</m:mi>
                                             <m:mn>0</m:mn>
                                          </m:msub>
                                          <m:mo>+</m:mo>
                                          <m:msubsup>
                                             <m:mi>&#946;</m:mi>
                                             <m:mn>1</m:mn>
                                             <m:mi>T</m:mi>
                                          </m:msubsup>
                                          <m:mi>x</m:mi>
                                       </m:mrow>
                                    </m:msup>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWGRbWAcqGH9aqpcqaIWaamcqGG8baFieqacqWF4baEcqGGPaqkcqGH9aqpdaWcaaqaaiabigdaXaqaaiabigdaXiabgUcaRiabdwgaLnaaCaaaleqabaacciGae4NSdi2aaSbaaWqaaiabicdaWaqabaWccqGHRaWkcqGFYoGydaWgaaadbaacbaGae0xmaedabeaaliab=Hha4baaaaaaaa@4356@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>and <it>p</it>(<it>k </it>= 1|<b>x</b>) + <it>p</it>(<it>k </it>= 0|<b>x</b>) = 1. The parameters <it>&#946;</it><sub>0 </sub>and <it>&#946;</it><sub>1 </sub>are fitted to the training data using maximum likelihood <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>LDA (Linear Discriminant Analysis)</p>
               </st>
               <p>LDA models the classes as multivariate Gaussians, where each class is assumed to have the same covariance matrix. The density function for class <it>k </it>is therefore given by</p>
               <p>
                  <display-formula id="M4">
                     <m:math name="1471-2105-8-358-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>g</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mn>1</m:mn>
                                 <m:mrow>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:mn>2</m:mn>
                                          <m:mi>&#960;</m:mi>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mi>p</m:mi>
                                          <m:mo>/</m:mo>
                                          <m:mn>2</m:mn>
                                       </m:mrow>
                                    </m:msup>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mrow>
                                             <m:mo>|</m:mo>
                                             <m:mi>&#931;</m:mi>
                                             <m:mo>|</m:mo>
                                          </m:mrow>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:mn>1</m:mn>
                                          <m:mo>/</m:mo>
                                          <m:mn>2</m:mn>
                                       </m:mrow>
                                    </m:msup>
                                 </m:mrow>
                              </m:mfrac>
                              <m:msup>
                                 <m:mi>e</m:mi>
                                 <m:mrow>
                                    <m:mo>&#8722;</m:mo>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:mi>x</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:msub>
                                             <m:mi>&#956;</m:mi>
                                             <m:mi>k</m:mi>
                                          </m:msub>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                       <m:mi>T</m:mi>
                                    </m:msup>
                                    <m:msup>
                                       <m:mi>&#931;</m:mi>
                                       <m:mrow>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mn>1</m:mn>
                                       </m:mrow>
                                    </m:msup>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:mi>x</m:mi>
                                    <m:mo>&#8722;</m:mo>
                                    <m:msub>
                                       <m:mi>&#956;</m:mi>
                                       <m:mi>k</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                    <m:mo>/</m:mo>
                                    <m:mn>2</m:mn>
                                 </m:mrow>
                              </m:msup>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGNbWzdaWgaaWcbaGaem4AaSgabeaakiabcIcaOGqabiab=Hha4jabcMcaPiabg2da9maalaaabaGaeGymaedabaGaeiikaGIaeGOmaidcciGae4hWdaNaeiykaKYaaWbaaSqabeaacqWGWbaCcqGGVaWlcqaIYaGmaaGcdaabdaqaaiabfo6atbGaay5bSlaawIa7amaaCaaaleqabaGaeGymaeJaei4la8IaeGOmaidaaaaakiabdwgaLnaaCaaaleqabaGaeyOeI0IaeiikaGIae8hEaGNaeyOeI0Iae4hVd02aaSbaaWqaaiab=TgaRbqabaWccqGGPaqkdaahaaadbeqaaiabdsfaubaaliabfo6atnaaCaaameqabaGaeyOeI0IaeGymaedaaSGaeiikaGIae8hEaGNaeyOeI0Iae4hVd02aaSbaaWqaaiab=TgaRbqabaWccqGGPaqkcqGGVaWlcqaIYaGmaaaaaa@5C4A@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>&#956;</it><sub><it>k </it></sub>is the mean vector for class <it>k</it>, &#931; is the common covariance matrix, and <it>p </it>is the dimensionality of <b>x</b>. It can be shown <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> that the discriminant function for class <it>k </it>is equivalent to the following function:</p>
               <p>
                  <display-formula id="M5">
                     <m:math name="1471-2105-8-358-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>&#948;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:msup>
                                 <m:mi>x</m:mi>
                                 <m:mi>T</m:mi>
                              </m:msup>
                              <m:msup>
                                 <m:mi>&#931;</m:mi>
                                 <m:mrow>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                              </m:msup>
                              <m:msub>
                                 <m:mi>&#956;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo>&#8722;</m:mo>
                              <m:mfrac>
                                 <m:mn>1</m:mn>
                                 <m:mn>2</m:mn>
                              </m:mfrac>
                              <m:msubsup>
                                 <m:mi>&#956;</m:mi>
                                 <m:mi>k</m:mi>
                                 <m:mi>T</m:mi>
                              </m:msubsup>
                              <m:msup>
                                 <m:mi>&#931;</m:mi>
                                 <m:mrow>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                              </m:msup>
                              <m:msub>
                                 <m:mi>&#956;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo>+</m:mo>
                              <m:mi>l</m:mi>
                              <m:mi>o</m:mi>
                              <m:mi>g</m:mi>
                              <m:msub>
                                 <m:mi>&#960;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaacciGae8hTdq2aaSbaaSqaaiabdUgaRbqabaGccqGGOaakimqacaGF4bGaeiykaKIaeyypa0Jaa4hEamaaCaaaleqabaGaemivaqfaaOGaeu4Odm1aaWbaaSqabeaacqGHsislcqaIXaqmaaGccqWF8oqBdaWgaaWcbaGaem4AaSgabeaakiabgkHiTmaalaaabaGaeGymaedabaGaeGOmaidaaiab=X7aTnaaDaaaleaacqWGRbWAaeaacqWGubavaaGccqqHJoWudaahaaWcbeqaaiabgkHiTiabigdaXaaakiab=X7aTnaaBaaaleaacqWGRbWAaeqaaOGaey4kaSccdiGaa0hBaiaa99gacaqFNbGae8hWda3aaSbaaSqaaiabdUgaRbqabaaaaa@624C@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>&#960;</it><sub><it>k </it></sub>is the prior probability of class <it>k</it>. The decision boundaries and therefore linear. The parameters <it>&#960;</it><sub><it>k</it></sub>, <it>&#956;</it><sub><it>k </it></sub>and &#931; are estimated by applying maximum likelihood to the training data <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, giving</p>
               <p>
                  <display-formula id="M6">
                     <m:math name="1471-2105-8-358-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>&#960;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mi>n</m:mi>
                                       <m:mi>k</m:mi>
                                    </m:msub>
                                 </m:mrow>
                                 <m:mi>n</m:mi>
                              </m:mfrac>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFapaCdaWgaaWcbaGaem4AaSgabeaakiabg2da9maalaaabaGaemOBa42aaSbaaSqaaiabdUgaRbqabaaakeaacqWGUbGBaaaaaa@357A@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>
                  <display-formula id="M7">
                     <m:math name="1471-2105-8-358-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>&#956;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>x</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                       <m:mo>&#8712;</m:mo>
                                       <m:mi>k</m:mi>
                                    </m:mrow>
                                 </m:munder>
                                 <m:mrow>
                                    <m:mfrac>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>x</m:mi>
                                             <m:mi>i</m:mi>
                                          </m:msub>
                                       </m:mrow>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>n</m:mi>
                                             <m:mi>k</m:mi>
                                          </m:msub>
                                       </m:mrow>
                                    </m:mfrac>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF8oqBdaWgaaWcbaGaem4AaSgabeaakiabg2da9maaqafabaWaaSaaaeaaieqacqGF4baEdaWgaaWcbaGaemyAaKgabeaaaOqaaiabd6gaUnaaBaaaleaacqWGRbWAaeqaaaaaaeaacqWGPbqAaeqaniabggHiLdaaaa@3A86@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>
                  <display-formula id="M8">
                     <m:math name="1471-2105-8-358-i14" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>&#931;</m:mi>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munder>
                                    <m:mo>&#8721;</m:mo>
                                    <m:mi>k</m:mi>
                                 </m:munder>
                                 <m:mrow>
                                    <m:mstyle displaystyle="true">
                                       <m:munder>
                                          <m:mo>&#8721;</m:mo>
                                          <m:mrow>
                                             <m:msub>
                                                <m:mi>x</m:mi>
                                                <m:mi>i</m:mi>
                                             </m:msub>
                                             <m:mo>&#8712;</m:mo>
                                             <m:mi>k</m:mi>
                                          </m:mrow>
                                       </m:munder>
                                       <m:mrow>
                                          <m:mfrac>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:msub>
                                                   <m:mi>x</m:mi>
                                                   <m:mi>i</m:mi>
                                                </m:msub>
                                                <m:mo>&#8722;</m:mo>
                                                <m:msub>
                                                   <m:mi>&#956;</m:mi>
                                                   <m:mi>k</m:mi>
                                                </m:msub>
                                                <m:mo stretchy="false">)</m:mo>
                                                <m:msup>
                                                   <m:mrow>
                                                      <m:mo stretchy="false">(</m:mo>
                                                      <m:msub>
                                                         <m:mi>x</m:mi>
                                                         <m:mi>i</m:mi>
                                                      </m:msub>
                                                      <m:mo>&#8722;</m:mo>
                                                      <m:msub>
                                                         <m:mi>&#956;</m:mi>
                                                         <m:mi>k</m:mi>
                                                      </m:msub>
                                                      <m:mo stretchy="false">)</m:mo>
                                                   </m:mrow>
                                                   <m:mi>T</m:mi>
                                                </m:msup>
                                             </m:mrow>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>n</m:mi>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mi>K</m:mi>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                          </m:mfrac>
                                       </m:mrow>
                                    </m:mstyle>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqqHJoWucqGH9aqpdaaeqbqaamaaqafabaWaaSaaaeaacqGGOaakieqacqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabgkHiTGGaciab+X7aTnaaBaaaleaacqWGRbWAaeqaaOGaeiykaKIaeiikaGIae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcqGF8oqBdaWgaaWcbaGaem4AaSgabeaakiabcMcaPmaaCaaaleqabaGaemivaqfaaaGcbaGaeiikaGIaemOBa4MaeyOeI0Iaem4saSKaeiykaKcaaaWcbaGaem4zaC2aaSbaaWqaaiabdMgaPbqabaWccqGHiiIZcqWGRbWAaeqaniabggHiLdaaleaacqWGRbWAaeqaniabggHiLdaaaa@532D@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>n </it>is the total number of training samples, <it>n</it><sub><it>k </it></sub>is the number of training samples in class <it>k</it>, and <it>K </it>is the number of classes. In this study, <it>K </it>= 2.</p>
            </sec>
            <sec>
               <st>
                  <p>QDA (Quadratic Discriminant Analysis)</p>
               </st>
               <p>QDA is a generalization of LDA in which each class has its own covariance matrix, S<sub><it>k</it></sub>. In this case, it can be shown <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> that the discriminant function for class <it>k </it>is equivalent to the following function:</p>
               <p>
                  <display-formula id="M9">
                     <m:math name="1471-2105-8-358-i15" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:msub>
                                 <m:mi>&#948;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mo>&#8722;</m:mo>
                              <m:mfrac>
                                 <m:mn>1</m:mn>
                                 <m:mn>2</m:mn>
                              </m:mfrac>
                              <m:mi>l</m:mi>
                              <m:mi>o</m:mi>
                              <m:mi>g</m:mi>
                              <m:mrow>
                                 <m:mo>(</m:mo>
                                 <m:mrow>
                                    <m:mrow>
                                       <m:mo>|</m:mo>
                                       <m:mrow>
                                          <m:msub>
                                             <m:mi>&#931;</m:mi>
                                             <m:mi>k</m:mi>
                                          </m:msub>
                                       </m:mrow>
                                       <m:mo>|</m:mo>
                                    </m:mrow>
                                 </m:mrow>
                                 <m:mo>)</m:mo>
                              </m:mrow>
                              <m:mo>&#8722;</m:mo>
                              <m:mfrac>
                                 <m:mn>1</m:mn>
                                 <m:mn>2</m:mn>
                              </m:mfrac>
                              <m:msup>
                                 <m:mrow>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:mi>x</m:mi>
                                    <m:mo>&#8722;</m:mo>
                                    <m:msub>
                                       <m:mi>&#956;</m:mi>
                                       <m:mi>k</m:mi>
                                    </m:msub>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                                 <m:mi>T</m:mi>
                              </m:msup>
                              <m:msubsup>
                                 <m:mi>&#931;</m:mi>
                                 <m:mi>k</m:mi>
                                 <m:mrow>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                              </m:msubsup>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo>&#8722;</m:mo>
                              <m:msub>
                                 <m:mi>&#956;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>+</m:mo>
                              <m:mi>l</m:mi>
                              <m:mi>o</m:mi>
                              <m:mi>g</m:mi>
                              <m:msub>
                                 <m:mi>&#960;</m:mi>
                                 <m:mi>k</m:mi>
                              </m:msub>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF0oazdaWgaaWcbaGaem4AaSgabeaakiabcIcaOGqabiab+Hha4jabcMcaPiabg2da9iabgkHiTmaalaaabaGaeGymaedabaGaeGOmaidaaGqaciab9XgaSjab99gaVjab9DgaNnaabmaabaWaaqWaaeaacqqHJoWudaWgaaWcbaGaem4AaSgabeaaaOGaay5bSlaawIa7aaGaayjkaiaawMcaaiabgkHiTmaalaaabaGaeGymaedabaGaeGOmaidaaiabcIcaOiab+Hha4jabgkHiTiab=X7aTnaaBaaaleaacqWGRbWAaeqaaOGaeiykaKYaaWbaaSqabeaacqWGubavaaGccqqHJoWudaqhaaWcbaGaem4AaSgabaGaeyOeI0IaeGymaedaaOGaeiikaGIae4hEaGNaeyOeI0Iae8hVd02aaSbaaSqaaiabdUgaRbqabaGccqGGPaqkcqGHRaWkcqqFSbaBcqqFVbWBcqqFNbWzcqWFapaCdaWgaaWcbaGaem4AaSgabeaaaaa@6300@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>The decision boundaries are therefore quadratic. Again, the parameters are estimated by applying maximum likelihood to the training data <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>NB (Naive Bayes)</p>
               </st>
               <p>NB is based on the independent variable assumption: for each class, the variables in the feature vector <b>x </b>are assumed to be independent. This assumption allows the class conditional density <it>p</it>(<it>x</it><sub><it>i</it></sub>|<it>k</it>) to be estimated separately for each variable, <it>x</it><sub><it>i</it></sub>. In essence, NB reduces the problem of multi-dimensional density estimation to that of one-dimensional density estimation. Given a class, <it>k</it>, each variable in the feature vector <b>x </b>= (<it>x</it><sub>1</sub>, <it>x</it><sub>2</sub>, ..., <it>x</it><sub><it>p</it></sub>)<sup><it>T </it></sup>is independent; so</p>
               <p>
                  <display-formula id="M10">
                     <m:math name="1471-2105-8-358-i16" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>p</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo>|</m:mo>
                              <m:mi>k</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>=</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8719;</m:mo>
                                    <m:mi>i</m:mi>
                                    <m:mi>p</m:mi>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>x</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mo>|</m:mo>
                                    <m:mi>k</m:mi>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakieqacqWF4baEcqGG8baFcqWGRbWAcqGGPaqkcqGH9aqpdaqeWbqaaiabdchaWjabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaem4AaSMaeiykaKcaleaacqWGPbqAaeaacqWGWbaCa0Gaey4dIunaaaa@4324@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>Using Bayes Rule, we obtain</p>
               <p>
                  <display-formula id="M11">
                     <m:math name="1471-2105-8-358-i17" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>p</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>k</m:mi>
                              <m:mo>|</m:mo>
                              <m:mi>x</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mo>&#8733;</m:mo>
                              <m:mi>p</m:mi>
                              <m:mo stretchy="false">(</m:mo>
                              <m:mi>k</m:mi>
                              <m:mo stretchy="false">)</m:mo>
                              <m:mstyle displaystyle="true">
                                 <m:munderover>
                                    <m:mo>&#8719;</m:mo>
                                    <m:mi>i</m:mi>
                                    <m:mi>p</m:mi>
                                 </m:munderover>
                                 <m:mrow>
                                    <m:mi>p</m:mi>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:msub>
                                       <m:mi>x</m:mi>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mo>|</m:mo>
                                    <m:mi>k</m:mi>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                              </m:mstyle>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqGGOaakcqWGRbWAcqGG8baFieqacqWF4baEcqGGPaqkcqGHDisTcqWGWbaCcqGGOaakcqWGRbWAcqGGPaqkdaqeWbqaaiabdchaWjabcIcaOiabdIha4naaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaem4AaSMaeiykaKcaleaacqWGPbqAaeaacqWGWbaCa0Gaey4dIunaaaa@4818@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>p</it>(<it>k</it>) is the prior probability of class <it>k</it>, estimated as the ratio of the number of the training samples in class <it>k </it>to the total number of training samples. In this paper, we model each variable as a univariate Gaussian, so <it>p</it>(<it>x</it><sub><it>i</it></sub>|<it>k</it>) = <it>N</it>(<inline-formula><m:math name="1471-2105-8-358-i18" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#956;</m:mi><m:mi>i</m:mi><m:mi>k</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF8oqBdaqhaaWcbaGaemyAaKgabaGaem4AaSgaaaaa@3150@</m:annotation></m:semantics></m:math></inline-formula>, <inline-formula><m:math name="1471-2105-8-358-i19" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#963;</m:mi><m:mi>i</m:mi><m:mi>k</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCdaqhaaWcbaGaemyAaKgabaGaem4AaSgaaaaa@315D@</m:annotation></m:semantics></m:math></inline-formula>), where the parameters <inline-formula><m:math name="1471-2105-8-358-i18" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#956;</m:mi><m:mi>i</m:mi><m:mi>k</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF8oqBdaqhaaWcbaGaemyAaKgabaGaem4AaSgaaaaa@3150@</m:annotation></m:semantics></m:math></inline-formula> and <inline-formula><m:math name="1471-2105-8-358-i19" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>&#963;</m:mi><m:mi>i</m:mi><m:mi>k</m:mi></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFdpWCdaqhaaWcbaGaemyAaKgabaGaem4AaSgaaaaa@315D@</m:annotation></m:semantics></m:math></inline-formula> are estimated by applying maximum likelihood to the training data <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Note that NB has far fewer parameters to estimate than either LDA or QDA, and for this reason, it often performs surprisingly well in practise, despite the unrealistic assumption of independent variables <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>KNN (K-Nearest Neighbors)</p>
               </st>
               <p>KNN is a nonparametric method, since it does not require the estimation of any parameters. Instead, to classify a test vector, KNN finds the vector's <it>K </it>nearest neighbors in the training data. If <it>K</it><sub>1 </sub>is the number of these neighbors in Class 1, then <it>K</it><sub>1</sub><it>/K </it>is returned as the discriminant value. The test vector is therefore assigned to Class 1 if and only if <it>K</it><sub>1</sub><it>/K </it>> <it>&#964;</it>, where <it>&#964; </it>is the decision threshold.</p>
               <p>A variety of different distance measures can be used with KNN to measure the nearness of one vector to another. In this paper, we use 1 - <it>&#961;</it>, where <it>&#961; </it>is the Pearson correlation coefficient of the two vectors. That is, if the two vectors are <b>x </b>and <b>y</b>, then</p>
               <p>
                  <display-formula>
                     <m:math name="1471-2105-8-358-i20" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>&#961;</m:mi>
                              <m:mo>=</m:mo>
                              <m:mfrac>
                                 <m:mrow>
                                    <m:msup>
                                       <m:mrow>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:mi>x</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mover accent="true">
                                             <m:mi>x</m:mi>
                                             <m:mo>&#175;</m:mo>
                                          </m:mover>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                       <m:mi>T</m:mi>
                                    </m:msup>
                                    <m:mo stretchy="false">(</m:mo>
                                    <m:mi>y</m:mi>
                                    <m:mo>&#8722;</m:mo>
                                    <m:mover accent="true">
                                       <m:mi>y</m:mi>
                                       <m:mo>&#175;</m:mo>
                                    </m:mover>
                                    <m:mo stretchy="false">)</m:mo>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:msqrt>
                                       <m:mrow>
                                          <m:msup>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>x</m:mi>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mover accent="true">
                                                   <m:mi>x</m:mi>
                                                   <m:mo>&#175;</m:mo>
                                                </m:mover>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                             <m:mi>T</m:mi>
                                          </m:msup>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:mi>x</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mover accent="true">
                                             <m:mi>x</m:mi>
                                             <m:mo>&#175;</m:mo>
                                          </m:mover>
                                          <m:mo stretchy="false">)</m:mo>
                                          <m:msup>
                                             <m:mrow>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>y</m:mi>
                                                <m:mo>&#8722;</m:mo>
                                                <m:mover accent="true">
                                                   <m:mi>y</m:mi>
                                                   <m:mo>&#175;</m:mo>
                                                </m:mover>
                                                <m:mo stretchy="false">)</m:mo>
                                             </m:mrow>
                                             <m:mi>T</m:mi>
                                          </m:msup>
                                          <m:mo stretchy="false">(</m:mo>
                                          <m:mi>y</m:mi>
                                          <m:mo>&#8722;</m:mo>
                                          <m:mover accent="true">
                                             <m:mi>y</m:mi>
                                             <m:mo>&#175;</m:mo>
                                          </m:mover>
                                          <m:mo stretchy="false">)</m:mo>
                                       </m:mrow>
                                    </m:msqrt>
                                 </m:mrow>
                              </m:mfrac>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFbpGCcqGH9aqpdaWcaaqaaiabcIcaOGqabiab+Hha4jabgkHiTiqb+Hha4zaaraGaeiykaKYaaWbaaSqabeaacqWGubavaaGccqGGOaakcqGF5bqEcqGHsislcuGF5bqEgaqeaiabcMcaPaqaamaakaaabaGaeiikaGIae4hEaGNaeyOeI0Iaf4hEaGNbaebacqGGPaqkdaahaaWcbeqaaiabdsfaubaakiabcIcaOiab+Hha4jabgkHiTiqb+Hha4zaaraGaeiykaKIaeiikaGIae4xEaKNaeyOeI0Iaf4xEaKNbaebacqGGPaqkdaahaaWcbeqaaiabdsfaubaakiabcIcaOiab+Lha5jabgkHiTiqb+Lha5zaaraGaeiykaKcaleqaaaaaaaa@55AC@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>In terms of gene expression measurements, two genes are highly correlated if their expression levels tend to rise and fall together (even though their absolute expression levels may be quite different). For this reason, Pearson correlation is often used to detect coregulation among genes <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Principal components analysis</p>
            </st>
            <p>Hidden dependencies and noise among experiments may confound the classification problem. In particular, experiments that are biologically different may actually be similar in terms of gene expression. Principal components analysis (PCA) helps to identify independent information in the data by transforming it to a data set of reduced dimension. The attributes of the reduced data set, called principal components, explain most of the variance in the original data and are mutually uncorrelated and orthogonal <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. In addition, by reducing the dimension of the data, PCA reduces the number of parameters that must be estimated during supervised learning, thus permitting more efficient use of the data.</p>
            <p>One can think of PCA as having a learning phase and a prediction phase. During learning, PCA is given a data set, from which it generates (learns) a linear transformation. This transformation maps high-dimensional vectors to low-dimensional vectors, and is applied to the given data set to reduce its dimensionality. During prediction, the transformation is applied to other data.</p>
            <p>We used PCA to reduce the dimensionality of the gene expression data from its original 290 dimensions to <it>p </it>dimensions, for <it>p </it>= 5, 10, 15, 20, 40, 100. During learning, we gave PCA our entire data set of 22,746 genes, <it>i.e</it>., the 11,533 annotated genes and the 11,193 unannotated genes. This is possible because PCA is a form of unsupervised learning, so it uses only the gene expression measurements (which are known), and not the gene annotations (which are to be learned). This increases the effectiveness of PCA by doubling the amount of data that it uses during learning. That is, using a larger data set decreases the variance of the principal components learned by PCA, thus increasing their statistical significance and reducing the number of anomalous components.</p>
            <p>It is worth noting that this use of PCA is different from that of many traditional applications of machine learning. This is because we apply PCA to the entire data set during learning, including the prediction data (<it>i.e</it>., the unannotated data). This is not possible in traditional applications simply because the prediction data is not known during learning. In such applications, a learning procedure is first trained and tested on one data set, and then applied to prediction data as it becomes available. This is not the situation for genome-wide expression experiments, since all the genes (and their expression levels) are known in advance, including the genes in the prediction set. PCA can therefore use both the prediction data and the training data during learning. This is a form of transductive inference <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B31">31</abbr></abbrgrp>, in which the prediction data is known and exploited during learning.</p>
         </sec>
         <sec>
            <st>
               <p>PCA and classifier evaluation</p>
            </st>
            <p>After PCA is performed on the entire data set, supervised learning is performed on the annotated portion of the dimensionally-reduced data. (As described earlier, this is a form of semi-supervised learning <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr></abbrgrp>). The result is a set of classifiers, one for each supervised learning method. The classifiers are then applied to the unannotated portion of the dimensionally-reduced data to predict the missing annotations. Cross validation was used to estimate the accuracy of these predictions.</p>
            <p>Before discussing our use of cross validation, we consider the simpler setting in which the annotated data is divided into two parts, training data and validation data <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. This will clarify our handling of PCA during validation. In this setting, classifier evaluation proceeds as follows. First, PCA is applied to the entire data set (training, validation and prediction data) to produce a dimensionally-reduced data set. Then, a supervised learner uses the (dimensionally-reduced) training data to produce a classifier. Finally, the accuracy of the classifier is estimated using the (dimensionally-reduced) validation data. Note that this process treats the validation and prediction data equally. That is, they are <it>both </it>used during unsupervised learning, and <it>neither </it>is used during supervised learning. In this way, the validation data is representative of the prediction data, as it should be. Also note that PCA is now effectively a preprocessing phase prior to supervised learning.</p>
            <p>Because PCA is applied to the entire data set, this validation process estimates the accuracy of the classifier on prediction data that is known and used during learning. In particular, the estimate does <it>not </it>apply to new prediction data that might arrive in the future (<it>e.g</it>., if new genes were discovered). In fact, it would likely be an overestimate of classifier accuracy on such data. However, this is not an issue in our application, since most if not all of the genes in <it>Arabidopsis </it>are already known. Moreover, even if some new genes were to be discovered in the future, we could simply add them to our prediction data and retrain the classifier on the enlarged data set.</p>
            <p>The above ideas are easily extended to cross validation. First, PCA is applied to the entire data set. Then, a supervised learner uses the annotated portion of the dimensionally-reduced data to produce a classifier. Finally, this classifier is evaluated by cross validation in the normal way, as described below. Note that this approach has the added computational advantage that PCA is applied only once, to the entire data set, and not over-and-over again during the many training phases of cross validation. The discussions below assume that the entire data set has been preprocessed using PCA, so that all references to data refer to the dimensionally-reduced data. Also, all references to generalization performance refer to the accuracy of the classifier on the given set of prediction data.</p>
         </sec>
         <sec>
            <st>
               <p>Cross validation</p>
            </st>
            <p>We used 20-fold cross-validation to assess the generalization performance of each classifier as well as to estimate the precision of its predictions. We randomly divided the annotated data into 20 non-overlapping, equal-sized parts, called folds. The classifier was trained on 19 of these folds, and tested on the remaining fold; <it>i.e</it>., the trained classifier was used to generate a discriminant value for each gene in the remaining fold. This was done in all 20 possible ways, using a different testing fold each time. In this way, a discriminant value, <it>dv</it>, was generated for every gene in the training set. Each gene in the training set was then predicted to be positive (<it>i.e</it>., to respond to stress) if and only if <it>dv </it>> <it>&#964;</it>, where <it>&#964; </it>is a decision threshold. From these predictions, true and false positives were computed, from which a point on the ROC<sub>50 </sub>curve was plotted. Using a large number of different decision thresholds, we plotted a large number of points on the ROC<sub>50 </sub>curve, effectively generating the entire curve. The area under this curve is the ROC<sub>50 </sub>score. To get an idea of how stable the estimated performance of the classifier is, we repeated the entire cross-validation and curve-generation procedure 10 times, each time using a different, random, 20-fold split of the training data.</p>
            <p>The above procedure was applied to all the basic classifiers, but assessing the combined classifier involved an additional subtlety. Recall that the combined classifier is a linear combination of the basic classifiers, where the weight given to a basic classifier is proportional to its estimated ROC<sub>50 </sub>score. The subtlety is in computing that score. A naive approach would be to simply use the above procedure to compute a ROC<sub>50 </sub>score for each basic classifier. However, this would mean that during cross validation, 19 of the 20 folds are used to train the basic classifiers, while the 20<sup><it>th </it></sup>fold is used to compute the ROC<sub>50 </sub>scores. The result is that <it>all 20 folds </it>are involved in computing the weights. Thus, all 20 folds are involved in constructing (<it>i.e</it>., training) the combined classifier, so no folds are left for testing it. If cross validation were used anyway to assess the combined classifier, it would amount to using training data as testing data, and the results would tend to overestimate the classifier's performance.</p>
            <p>As described earlier, we surmount this problem by using two sets of validation data. Loosely speaking, 18 of the 20 folds are used to train the basic classifiers, a 19<sup><it>th </it></sup>fold is used to compute their ROC<sub>50 </sub>scores, and the 20<sup><it>th </it></sup>fold is used to test the combined classifier. This results in what might be called <it>nested </it>cross validation. To start, the training data are divided randomly into 20 folds. Picking one of these as a testing fold, the other 19 are used to train the combined classifier. This in turn involves 19-fold cross validation to train and test the basic classifiers (and compute their ROC<sub>50 </sub>scores). Thus, each time the combined classifier is trained once, the basic classifiers are trained 19 times. Since the combined classifier is trained 20 times, each basic classifier is trained a total of 20 &#215; 19 = 380 times. A similar form of nested cross validation is involved in Stacking <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Predicting gene function and estimating precision</p>
            </st>
            <p>To predict which genes respond to stress, we first train a combined classifier using the 11,553 annotated genes in the training data. The classifier is then applied to the 11,193 unannotated genes in the prediction data. After this step, each annotated gene has a discriminant value, <it>dv</it>. The unannotated genes are then sorted in descending order by discriminant value, as illustrated in Table <tblr tid="T1">1</tblr>. To make actual predictions, a gene in the sorted list is chosen as a decision point. This gene and every gene above it in the sorted list are then predicted to respond to stress. In other words, suppose <it>dv </it>is the discriminant value of the chosen gene. An unannotated gene is then predicted to respond to stress if and only if its discriminant value is at least <it>dv</it>. The fraction of these predictions that are true is the <it>precision </it>of the predictions. We estimate this precision using the training data. Recall that each gene in the training set has a discriminant value assigned to it during cross validation. We also know which of these genes respond to stress. To estimate the precision of our predictions, we look at those genes in the training set whose discriminant value is at least <it>dv</it>. The fraction of them that respond to stress is an estimate of precision.</p>
            <p>Using this idea we actually get ten precision estimates, not one. This is because we do cross validation ten times, using ten different random splits of the data. The result is that each gene in the training set receives ten discriminant values, and for each one we get a different precision estimate. We could simply use the average of these ten precision estimates; however, to reduce the variance of the estimate, we use a weighted average. Specifically, let us number the cross validation runs from <it>i </it>= 1, &#8230;, 10. Then, given a discriminant value, <it>dv</it>, let <it>PP</it><sub><it>i </it></sub>be the number of genes in the training set whose discriminant values is at least <it>dv </it>in the <it>i</it><sup><it>th </it></sup>run of cross validation. (These are the predicted positives.) Let <it>TP</it><sub><it>i </it></sub>be the number of these genes that respond to stress (the true positives). Using only this cross validation run, the estimated precision would be <it>TP</it><sub><it>i</it></sub><it>/PP</it><sub><it>i</it></sub>. One problem with this estimate is that if <it>dv </it>is high, then <it>PP</it><sub><it>i </it></sub>(and hence <it>TP</it><sub><it>i</it></sub>) could be 0, so the precision estimate would be undefined, something we observed frequently in practice. More generally, if <it>PP</it><sub><it>i </it></sub>(and hence <it>TP</it><sub><it>i</it></sub>) is low, then the precision estimate will have high variance, since it is supported by very little data. To circumvent these problems, we estimate the precision using the formula</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-8-358-i21" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mi>r</m:mi>
                           <m:mi>e</m:mi>
                           <m:mi>c</m:mi>
                           <m:mi>i</m:mi>
                           <m:mi>s</m:mi>
                           <m:mi>i</m:mi>
                           <m:mi>o</m:mi>
                           <m:mi>n</m:mi>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msub>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mrow>
                                             <m:mtext>TP</m:mtext>
                                          </m:mrow>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                              <m:mrow>
                                 <m:mstyle displaystyle="true">
                                    <m:msub>
                                       <m:mo>&#8721;</m:mo>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mrow>
                                             <m:mtext>PP</m:mtext>
                                          </m:mrow>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mstyle>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mi>i</m:mi>
                              </m:munder>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>w</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>&#215;</m:mo>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mrow>
                                             <m:mtext>TP</m:mtext>
                                          </m:mrow>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mrow>
                                             <m:mtext>PP</m:mtext>
                                          </m:mrow>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCcqWGYbGCcqWGLbqzcqWGJbWycqWGPbqAcqWGZbWCcqWGPbqAcqWGVbWBcqWGUbGBcqGH9aqpdaWcaaqaamaaqababaGaeeivaqLaeeiuaa1aaSbaaSqaaiabdMgaPbqabaaabaGaemyAaKgabeqdcqGHris5aaGcbaWaaabeaeaacqqGqbaucqqGqbaudaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAaeqaniabggHiLdaaaOGaeyypa0ZaaabuaeaacqWG3bWDdaWgaaWcbaGaemyAaKgabeaakiabgEna0oaalaaabaGaeeivaqLaeeiuaa1aaSbaaSqaaiabdMgaPbqabaaakeaacqqGqbaucqqGqbaudaWgaaWcbaGaemyAaKgabeaaaaaabaGaemyAaKgabeqdcqGHris5aaaa@59BB@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <inline-formula><m:math name="1471-2105-8-358-i22" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mi>w</m:mi><m:mi>i</m:mi></m:msub><m:mo>=</m:mo><m:msub><m:mrow><m:mtext>PP</m:mtext></m:mrow><m:mi>i</m:mi></m:msub><m:mo>/</m:mo><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8721;</m:mo><m:mi>j</m:mi><m:mi>N</m:mi></m:msubsup><m:mrow><m:msub><m:mrow><m:mtext>PP</m:mtext></m:mrow><m:mi>j</m:mi></m:msub></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWG3bWDdaWgaaWcbaGaemyAaKgabeaakiabg2da9iabbcfaqjabbcfaqnaaBaaaleaacqWGPbqAaeqaaOGaei4la8YaaabmaeaacqqGqbaucqqGqbaudaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAaeaacqWGobGta0GaeyyeIuoaaaa@3DCF@</m:annotation></m:semantics></m:math></inline-formula>. The right-hand formula is a weighted average of individual precision estimates, <it>TP</it><sub><it>i</it></sub><it>/PP</it><sub><it>i</it></sub>. It gives more weight to precision estimates that are based on more data, <it>i.e</it>., for which <it>PP</it><sub><it>i </it></sub>is higher. In addition, by using the left-hand formula, we rarely end up dividing by zero, since the denominator is a sum of (random) non-negative numbers; <it>i.e</it>., &#931;<sub><it>i</it></sub><it>PP</it><sub><it>i </it></sub>is much less likely to be zero than is any individual <it>PP</it><sub><it>i</it></sub>.</p>
         </sec>
         <sec>
            <st>
               <p>Biological experiments</p>
            </st>
            <p>Wild type and homozygous mutant seeds were plated on 0.5X MS media. They were stratified for 3 days and then germinated at 25C for 7 days. The abiotic temperature stresses consisted of 7 days exposure to either 30C, 14C or 4C. Anthocyanin levels were quantified as a measure of plant stress response. Anthocyanin was extracted using methanol-HCl <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. In order to measure response to salt stress, plants were germinated for 3 days on 0.5X MS media and then transferred to medium containing 50 mM NaCl or to control plates. New root growth was measured 7 days after the transfer.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The author(s) declares that there are no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>HL and AJB did the machine learning, with HL doing the actual programming. HL developed the idea of using ROC<sub>50 </sub>scores to combine classifiers. RC performed the gene knockout experiments under the supervision of NJP. HL and AJB wrote the bioinformatics sections of the manuscript, with HL providing the first draft. RC wrote the biological sections. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>AJB is supported by a grant from NSERC. HL is funded by the Department of Computer Science at the University of Toronto. NJP is supported by grants from NSERC. RC is funded in part by a University of Toronto fellowship. The Botany Beowulf Cluster was funded by a Genome Canada grant administered through the Ontario Genomics Institute.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Knowledge-based analysis of microarray gene expression data by using support vector machines</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Grundy</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Cristianini</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Sugnet</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Furey</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ares</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of National Academy of Sciences of the United States of America</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <issue>1</issue>
            <fpage>262</fpage>
            <lpage>267</lpage>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Cluster analysis and display of genome-wide expression patterns</p>
            </title>
            <aug>
               <au>
                  <snm>Eisen</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of National Academy of Sciences of the United States of America</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <issue>25</issue>
            <fpage>14863</fpage>
            <lpage>14868</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>A literature network of human genes for high-throughput analysis of gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Hartigan</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>L&#230;greid</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Komorowski</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hoving</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>2001</pubdate>
            <volume>28</volume>
            <issue>1</issue>
            <fpage>21</fpage>
            <lpage>28</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11326270</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Assessment of prediction accuracy of protein function from protein-protein interaction data</p>
            </title>
            <aug>
               <au>
                  <snm>Hishigaki</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Nakai</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ono</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Tanigami</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Takagi</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Yeast</source>
            <pubdate>2001</pubdate>
            <volume>18</volume>
            <issue>6</issue>
            <fpage>523</fpage>
            <lpage>531</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11284008</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Synexpression groups in eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Niehrs</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Pollet</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1999</pubdate>
            <volume>402</volume>
            <issue>6761</issue>
            <fpage>483</fpage>
            <lpage>487</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10591207</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Genes, themes and microarrays: Using information retrieval for large-scale gene analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Shatkay</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Wilbur</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Boguski</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the International Conference on Intelligent Systems for Molecular Biology</source>
            <pubdate>2000</pubdate>
            <volume>8</volume>
            <fpage>317</fpage>
            <lpage>328</lpage>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Functional discovery via a compendium of expression profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Hughes</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Marton</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Stoughton</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Armour</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bennett</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Coffey</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Dai</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kidd</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>King</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Slade</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lum</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Stepaniants</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Shoemaker</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Gachotte</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Chakraburtty</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bard</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Friend</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2000</pubdate>
            <volume>102</volume>
            <issue>1</issue>
            <fpage>109</fpage>
            <lpage>126</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10929718</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Predicting gene function from gene expressions and ontologies</p>
            </title>
            <aug>
               <au>
                  <snm>Hvidsten</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Komorowski</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sandvik</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Laegreid</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Pacific Symposium on Biocomputing</source>
            <pubdate>2001</pubdate>
            <fpage>299</fpage>
            <lpage>310</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11262949</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Predicting gene ontology biological process from temporal gene expression patterns</p>
            </title>
            <aug>
               <au>
                  <snm>L&#230;greid</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hvidsten</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Midelfart</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Komorowski</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sandvik</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <issue>5</issue>
            <fpage>965</fpage>
            <lpage>979</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430886</pubid>
                  <pubid idtype="pmpid" link="fulltext">12695321</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Gene classification using expression profiles: A feasibility study</p>
            </title>
            <aug>
               <au>
                  <snm>Kuramochi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Karypis</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>2nd IEEE International Symposium on Bioinformatics and Bioengineering</source>
            <pubdate>2001</pubdate>
            <fpage>191</fpage>
            <lpage>200</lpage>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Gene functional classification by semi-supervised learning from heterogeneous data</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Ogihara</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Proceedings of the 2003 ACM Symposium on Applied Computing</source>
            <pubdate>2003</pubdate>
            <fpage>78</fpage>
            <lpage>82</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Exploration of essential gene functions via titratable promoter alleles</p>
            </title>
            <aug>
               <au>
                  <snm>Mnaimneh</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Davierwala</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Haynes</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Moffat</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Peng</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Pootoolal</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Chua</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Trochesset</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Morse</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Krogan</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hiley</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Morris</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Grigull</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mitsakakis</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Greenblatt</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Boone</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kaiser</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Andrews</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hughes</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2004</pubdate>
            <volume>118</volume>
            <issue>1</issue>
            <fpage>31</fpage>
            <lpage>44</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15242642</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Gene functional classification from heterogeneous data</p>
            </title>
            <aug>
               <au>
                  <snm>Pavlidis</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Weston</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Cai</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Grundy</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Proceedings of the 5th International Conference on Computational Molecular Biology</source>
            <pubdate>2001</pubdate>
            <fpage>242</fpage>
            <lpage>248</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons</p>
            </title>
            <aug>
               <au>
                  <snm>Mateos</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Dopazo</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Jansen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Tu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Stolovitzky</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <issue>11</issue>
            <fpage>1703</fpage>
            <lpage>1715</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">187551</pubid>
                  <pubid idtype="pmpid" link="fulltext">12421757</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Clustering Labeled Data and Cross-Validation for Classification with Few Positives in Yeast</p>
            </title>
            <aug>
               <au>
                  <snm>Trochesset</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bonner</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD)</source>
            <pubdate>2004</pubdate>
         </bibl>
         <bibl id="B16">
            <title>
               <p>The functional landscape of mouse gene expression</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Morris</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Chang</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Shai</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Bakowski</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mitsakakis</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Mohammad</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Robinson</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Zirngibl</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Somogyi</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Laurin</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Eftekharpour</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Sat</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Grigull</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Pan</snm>
                  <fnm>Q</fnm>
               </au>
               <au>
                  <snm>Peng</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Krogan</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Greenblatt</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Fehlings</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kooy</snm>
                  <fnm>vdD</fnm>
               </au>
               <au>
                  <snm>Aubin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bruneau</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Rossant</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Blencowe</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Frey</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hughes</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Jounral of Biology</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <issue>5</issue>
            <fpage>21</fpage>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A green chapter in the book of life</p>
            </title>
            <aug>
               <au>
                  <snm>Walbot</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>408</volume>
            <fpage>794</fpage>
            <lpage>795</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11130710</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Bioinformatic resources, challenges, and opportunities using Arabidopsis as a model organism in a post-genomic era</p>
            </title>
            <aug>
               <au>
                  <snm>Rhee</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Plant Physiology</source>
            <pubdate>2000</pubdate>
            <volume>124</volume>
            <issue>4</issue>
            <fpage>1460</fpage>
            <lpage>1464</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1539296</pubid>
                  <pubid idtype="pmpid" link="fulltext">11115859</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Cellular function prediction and biological pathway discovery in Arabidopsis thaliana using microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>Joshi</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Alexandrov</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proceedings of the 26th Annual International Conference of the IEEE EMBS</source>
            <publisher>San Francisco, CA</publisher>
            <pubdate>2004</pubdate>
            <fpage>2881</fpage>
            <lpage>2884</lpage>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Functional Bioinformatics for Arabidopsis thaliana</p>
            </title>
            <aug>
               <au>
                  <snm>Clare</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Karwath</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ougham</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>King</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>9</issue>
            <fpage>1130</fpage>
            <lpage>1136</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">16481336</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <aug>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Friedman</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>The Elements of Statistical Learning: Data Mining, Inference and Prediction</source>
            <publisher>Springer-Verlag, New York</publisher>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B22">
            <title>
               <p>The Botany Array Resource: e-Northerns, Expression Angling, and promoter analyses</p>
            </title>
            <aug>
               <au>
                  <snm>Toufighi</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Brady</snm>
                  <fnm>SM</fnm>
               </au>
               <au>
                  <snm>Austin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Ly</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Provart</snm>
                  <fnm>NJ</fnm>
               </au>
            </aug>
            <source>The Plant Journal</source>
            <pubdate>2005</pubdate>
            <volume>43</volume>
            <fpage>153</fpage>
            <lpage>163</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15960624</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses</p>
            </title>
            <aug>
               <au>
                  <snm>Kilian</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Whitehead</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Horak</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wanke</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Weinl</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Batistic</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>D'Angelo</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bornberg-Bauer</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Kudla</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Harter</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>The Plant Journal</source>
            <pubdate>2007</pubdate>
            <volume>50</volume>
            <issue>2</issue>
            <fpage>347</fpage>
            <lpage>363</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">17376166</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>NASCArrays: a repository for microarray data generated by NASC's transcriptomics service</p>
            </title>
            <aug>
               <au>
                  <snm>Craigon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>James</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Okyere</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Higgins</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Jotham</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>May</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2004</pubdate>
            <issue>32 Database</issue>
            <fpage>575</fpage>
            <lpage>577</lpage>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Nottingham Arabidopsis Stock Centre (NASC)</p>
            </title>
            <url>http://arabidopsis.info</url>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Gene Ontology: Tool for the unification of biology</p>
            </title>
            <aug>
               <au>
                  <snm>Consortium</snm>
                  <fnm>TGO</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <issue>1</issue>
            <fpage>25</fpage>
            <lpage>29</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10802651</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>The Arabidopsis Information Resource (TAIR)</p>
            </title>
            <url>http://www.arabidopsis.org</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Functional annotation of the Arabidopsis genome using controlled vocabularies</p>
            </title>
            <aug>
               <au>
                  <snm>Berardini</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Mundodi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Reiser</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Huala</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Garcia-Hernandez</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Mueller</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Yoon</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Doyle</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Moseyko</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Yoo</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Xu</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Zoeckler</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Montoya</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Weems</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Rhee</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Plant Physiology</source>
            <pubdate>2004</pubdate>
            <volume>135</volume>
            <issue>2</issue>
            <fpage>1</fpage>
            <lpage>11</lpage>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Semi-supervised Learning on Riemannian Manifolds</p>
            </title>
            <aug>
               <au>
                  <snm>Belkin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Niyogi</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Machine Learning</source>
            <pubdate>2004</pubdate>
            <volume>56</volume>
            <fpage>209</fpage>
            <lpage>239</lpage>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Splitting the Unsupervised and Supervised Components of Semi-Supervised Learning</p>
            </title>
            <aug>
               <au>
                  <snm>Oliveira</snm>
                  <fnm>CS</fnm>
               </au>
               <au>
                  <snm>Cozman</snm>
                  <fnm>FG</fnm>
               </au>
            </aug>
            <source>Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, Bonn, Germany</source>
            <pubdate>2005</pubdate>
            <fpage>67</fpage>
            <lpage>74</lpage>
         </bibl>
         <bibl id="B31">
            <aug>
               <au>
                  <snm>Vapnik</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Statistical Learning Theory</source>
            <publisher>Wiley-Interscience</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B32">
            <title>
               <p>ROC Graphs: Notes and practical considerations for researchers</p>
            </title>
            <aug>
               <au>
                  <snm>Fawcett</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Tech Rep HPL-2003-4, HP Laboratories, Palo Alto, CA</source>
            <pubdate>2003</pubdate>
         </bibl>
         <bibl id="B33">
            <title>
               <p>On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes</p>
            </title>
            <aug>
               <au>
                  <snm>Ng</snm>
                  <fnm>AY</fnm>
               </au>
               <au>
                  <snm>Jordan</snm>
                  <fnm>MI</fnm>
               </au>
            </aug>
            <source>Advances in Neural Information Processing Systems 14</source>
            <publisher>Cambridge, MA: MIT Press</publisher>
            <editor>Dietterich TG, Becker S, Ghahramani Z</editor>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Use of Receiver Operating Characteristic (ROC) analysis to evaluate sequence matching</p>
            </title>
            <aug>
               <au>
                  <snm>Gribskov</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Robinson</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Computers and Chemistry</source>
            <pubdate>1996</pubdate>
            <fpage>25</fpage>
            <lpage>33</lpage>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Controlling the false discovery rate: a practical and powerful approach to multiple testing</p>
            </title>
            <aug>
               <au>
                  <snm>Benjamini</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Hochberg</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Journal of the Royal Statistical Society: Series B</source>
            <pubdate>1995</pubdate>
            <volume>57</volume>
            <fpage>289</fpage>
            <lpage>300</lpage>
         </bibl>
         <bibl id="B36">
            <title>
               <p>A gene expression map of Arabidopsis thaliana development</p>
            </title>
            <aug>
               <au>
                  <snm>Schmid</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Davison</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Henz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Pape</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Demar</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Vingron</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sholkpf</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Weigel</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lohmann</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>2005</pubdate>
            <volume>37</volume>
            <fpage>501</fpage>
            <lpage>506</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">15806101</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Genome-wide insertional mutagenesis of Arabidopsis thaliana</p>
            </title>
            <aug>
               <au>
                  <snm>Alonso</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Stepanova</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Leisse</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Shinn</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Stevenson</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Zimmerman</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Barajas</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Cheuk</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gadrinab</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Heller</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Jeske</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Koesema</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Meyers</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Parker</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Prednis</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ansari</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Choy</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Deen</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Geralt</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hazari</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hom</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Karnes</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mulholland</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ndubaku</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Schmidt</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Guzman</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Aguilar-Henonin</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Schmid</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Weigel</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Carter</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Marchand</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Risseeuw</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Brogden</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Zeko</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Crosby</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Berry</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ecker</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2003</pubdate>
            <volume>2003</volume>
            <fpage>653</fpage>
            <lpage>657</lpage>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Athena: a resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences</p>
            </title>
            <aug>
               <au>
                  <snm>O'Connor</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Dyreson</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wyrick</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>21</volume>
            <fpage>4411</fpage>
            <lpage>4413</lpage>
         </bibl>
         <bibl id="B39">
            <title>
               <p>An 'electronic fluorescent protein' browser for exploring Arabidopsis Microarray Data</p>
            </title>
            <aug>
               <au>
                  <snm>Winter</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Vinegar</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wilson</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Provart</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>in prep</source>
            <pubdate>2006</pubdate>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Arabidopsis transcriptome profiling indicates that multiple regulatory pathways are activated during cold acclimation in addition to the CBF cold response pathway</p>
            </title>
            <aug>
               <au>
                  <snm>Fowler</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Thomashow</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Plant Cell</source>
            <pubdate>2002</pubdate>
            <volume>14</volume>
            <fpage>1675</fpage>
            <lpage>1690</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">151458</pubid>
                  <pubid idtype="pmpid" link="fulltext">12172015</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>DREB takes the stress out of growing up</p>
            </title>
            <aug>
               <au>
                  <snm>Smirnoff</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nature Biotechnology</source>
            <pubdate>1999</pubdate>
            <volume>17</volume>
            <fpage>229</fpage>
            <lpage>230</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10096286</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Comparative genomics in salt tolerance between Arabidopsis and Arabidopsis-related halophyte salt cress using Arabidopsis microarray</p>
            </title>
            <aug>
               <au>
                  <snm>Taji</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Seki</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Satou</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Sakurai</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Kobayashi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ishiyama</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Naruasak</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Narusaka</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Zhu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Shinozaki</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Plant Physiology</source>
            <pubdate>2004</pubdate>
            <volume>135</volume>
            <fpage>1697</fpage>
            <lpage>1709</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">519083</pubid>
                  <pubid idtype="pmpid" link="fulltext">15247402</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Sucrose-specific induction of the anthocyanin biosynthetic pathway in Arabidopsis</p>
            </title>
            <aug>
               <au>
                  <snm>Solfanelli</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Poggi</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Loreti</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Alpi</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Perata</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Plant Physiology</source>
            <pubdate>2006</pubdate>
            <volume>140</volume>
            <fpage>637</fpage>
            <lpage>646</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1361330</pubid>
                  <pubid idtype="pmpid" link="fulltext">16384906</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
