<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-86</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Lin</snm>
               <mi>PY</mi>
               <fnm>Frank</fnm>
               <insr iid="I1"/>
               <email>frank.lin@student.unsw.edu.au</email>
            </au>
            <au id="A2">
               <snm>Coiera</snm>
               <fnm>Enrico</fnm>
               <insr iid="I1"/>
               <email>e.coiera@unsw.edu.au</email>
            </au>
            <au id="A3">
               <snm>Lan</snm>
               <fnm>Ruiting</fnm>
               <insr iid="I2"/>
               <email>r.lan@unsw.edu.au</email>
            </au>
            <au id="A4">
               <snm>Sintchenko</snm>
               <fnm>Vitali</fnm>
               <insr iid="I1"/>
               <insr iid="I3"/>
               <email>Vitali.Sintchenko@swahs.health.nsw.gov.au</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Centre for Health Informatics, University of New South Wales, Sydney, Australia</p>
            </ins>
            <ins id="I2">
               <p>School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia</p>
            </ins>
            <ins id="I3">
               <p>Centre for Infectious Diseases and Microbiology, Western Clinical School, University of Sydney, Sydney, Australia</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>1</issue>
         <fpage>86</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/86</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19292914</pubid>
               <pubid idtype="doi">10.1186/1471-2105-10-86</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>27</day>
               <month>10</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>17</day>
               <month>3</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>17</day>
               <month>3</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Lin et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p><it>In silico </it>candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in <it>Escherichia coli </it>K-12 (EC-K12) and 0.978 <it>Streptococcus agalactiae </it>2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Identifying gene functions is an important task in biology. The exponential growth of genome sequences has placed greater importance on the use of computational approaches for sequence analysis and annotation. With the development of high-throughput technology, methods of comparative genomics are increasingly used to assist with the identification of gene functions <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, as conventional methods of gene screening using transgenic organisms are resource intensive and time consuming. In practice, bench-side researchers frequently encounter extensive lists of genes that require further pruning and experimental validation. Accurate prioritisation of candidate genes, therefore, constitutes a key step in accelerating the discovery of gene functions.</p>
         <p><it>In silico </it>candidate gene prioritisation (CGP) ranks genes based upon the features associated with genes and the function of interest. A variety of <it>gene features </it>have been suggested for the prioritisation of causal genes in human diseases, including the co-occurrence of gene name and disease terminology in biomedical texts <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>, sharing of terms in annotation or gene ontology databases <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B4">4</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, gene expression in different tissues <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B4">4</abbr><abbr bid="B6">6</abbr></abbrgrp>, protein-protein interactions <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, similarity of gene or protein sequences <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, presence of genes within a phenotype or diseases database <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, phylogenetic relationships <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, or a combination of the above <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B4">4</abbr></abbrgrp>. However, to construct a CGP system for prokaryotes, different forms of gene features are needed, as current CGP algorithms are skewed towards eukaryotic genomes and the systematic curation of annotation or genotype-phenotype databases are less complete than for eukaryotes. Hundreds of whole genome sequences of bacteria and thousands of partial genome sequences are available in public databases, yet prokaryotic genomes display a higher proportion of genes with unknown function than eukaryotes <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. In contrast, several methods for computational protein function discovery have been studied, including chromosomal proximity method, domain fusion analysis, analysis of gene expression patterns, and phylogenetic profiles <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. In particular, the phylogenetic profile method exploits knowledge of gene occurrences across a range of sequenced genomes and postulates that genes involved in the same metabolic pathway are frequently co-inherited. Phylogenetic profiles have been applied to unsupervised clustering of proteins to discover their functional linkages <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> and to discover conserved gene clusters in microbes (with probabilistic phylogenetic tree models) <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Supervised approaches of phylogenetic profiles have also been applied to infer protein networks (with canonical correlation analysis <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>) and predicting protein functional class in <it>Saccharomyces cerevisiae </it>(with tree-based kernels <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>), in the discovery of protein localisation in eukaryotes <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, in functional annotation of genes (by correlation enrichments <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>). These studies suggested that the concept of phylogenetic profiles provides a valuable tool for predicting gene-function linkage. It was thus hypothesised that such concept can also be exploited as <it>gene features </it>for prioritising genes contributing to a particular phenotypic trait of interest, thus providing a practical and generalisable tool to guide microbiologists in gene selection.</p>
         <p>This paper examines the practical application of the phylogenetic profile method for gene prioritisation to investigate its generalisability and applicability on both simple and complex traits in prokaryotes.</p>
         <p>Phylogenetic profiles form an indirect connection between gene and function in two conceptual steps. The first step establishes the gene-genome relationship, by examining the occurrence (presence or absence) of a candidate gene (or its homolog) in a given genome. The second step groups genomes according to their known phenotypes. We investigate two scenarios in which CGP can be useful in assisting with functional discovery of uncharacterised genes in prokaryotes. The method of <it>statistical </it>CGP is used when the occurrence profile can be directly inferred from the study phenotype, whereas <it>inductive </it>CGP is used when the profile is obscure but a small number of genes known to contribute to the study phenotype are available. Candidate genes are then prioritised by either statistical scoring functions or supervised machine learning algorithms.</p>
         <p>In addition, at present there are no clear benchmarks to allow comparison between these different approaches to gene prioritisation, and the extent to which such algorithms are capable of identifying target genes in bacteria remains unexplored. This paper takes advantage of selected metabolic processes with a well-understood genetic basis to craft gold standard prioritisation tasks. The two CGP approaches are evaluated by rediscovering genes participating in well-characterised biochemical pathways &#8211; the metabolism of peptidoglycan, fermentation in anaerobes, and selected metabolic pathways curated in Kyoto Encyclopaedia of Genes and Genomes (KEGG) <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. We ultimately aim to develop metrics that will provide an indication of the likelihood that highly prioritised genes are strong biological candidates, and the degree to all potential candidates have been identified for tasks such as the selection of biomarkers, the discovery of virulence genes, and the formulation of new hypotheses about uncharacterised genes.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Determination of genomic occurrences of candidate genes</p>
            </st>
            <p>To evaluate the performance of CGP methods, three case studies were selected for rediscovery experiments using well-known pathway genes as gold-standards. For each case study, the polypeptide sequences of <it>n </it>candidate genes were compared with all open reading frames (<it>orf</it>) of the <it>k </it>genome sequences from the National Centre for Biotechnology Information (NCBI, accessed April 2007) by Basic Local Alignment and Search Tool (BLASTP) <url>http://blast.ncbi.nlm.nih.gov/Blast.cgi</url>. If a candidate gene reached the critical E-value of &lt; 10<sup>-5 </sup>in a given genome, a gene or gene homolog was defined as present in the genome. If a gene did not reach the critical E-value in a genome, the gene was recorded as absent from the genome. The binary states of gene occurrence were recorded in an <it>n </it>&#215; <it>k </it>homolog matrix.</p>
         </sec>
         <sec>
            <st>
               <p>Statistical CGP</p>
            </st>
            <p>From the <it>k </it>genomes, <it>k</it><sub><it>p </it></sub>genomes known to display the phenotype of interest (<it>p</it>) were selected as positive genome examples, and <it>k</it><sub><it>n </it></sub>genomes not displaying <it>p </it>were chosen as negative genome examples. For each of the <it>n </it>candidate genes, the number of co-presence (homologs present in positive genome examples) and co-absence (homologs absent in negative genome examples) were counted and presented into a 2 &#215; 2 contingency table, from which a number of statistical scoring functions was calculated. The scoring functions included: a) sensitivity (<it>sens</it>, the proportion of genes present in the positive genome examples), b) specificity (<it>spec</it>, proportion of genes absent in the negative genome examples), c) positive and negative predictive values (<it>ppv/npv</it>, the proportion of positive/negative genomes were present/absent when the gene was present/absent), d) arithmetic (<it>amss</it>) and harmonic (<it>hmss</it>) mean of sensitivity and specificity, e) odds ratio (<it>OR</it>, the odds of a gene existed in the positive example versus the odds of a gene was absent in the negative examples), f) chi-square scoring function (<it>chisq</it>, the deviation of the observed frequency from the expected proportion), g) directional chi-square function (<it>bchisq</it>, the chisq function with genes that displayed inverse associations be reversed to the bottom of the rank), and h) F-measure (<it>F</it>, the harmonic mean of the sensitivity and precision). The mathematical definitions of these scoring functions are listed in the Additionl file <supplr sid="S1">1</supplr>.</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <caption>
                  <p>The statistical CGP scoring functions evaluated in Case studies 1 and 2. </p>
               </caption>
               <text>
                  <p>
                     <b>This file lists the mathematical definitions of the statistical scoring functions evaluated in Case studies 1 and 2.</b>
                  </p>
               </text>
               <file name="1471-2105-10-86-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Inductive CGP</p>
            </st>
            <p>Inductive CGP ranks genes by finding genes with similar occurrence pattern across a number of bacterial genomes using supervised machine learning. A number of genes known to display a target phenotype or function <it>p </it>were selected as positive examples for the training set. Similarly, genes that did not contribute to <it>p </it>were selected as negative gene examples. The occurrences of genes in <it>k </it>genome examples were used as features for model training. Candidate genes were ranked by the score or posterior probability from the output of the machine learning classifiers. The machine learning classifiers included na&#239;ve Bayes (<it>NB</it>), logistic regression (<it>LR</it>; ridge = 10<sup>-5</sup>), <it>J48 </it>decision tree (<it>J48</it>, pruning confidence = 0.25), nearest neighbour classifier (<it>IBk</it>, with inverse distance weighing; <it>k </it>was determined by leave-one-out cross-validation), alternating decision tree (<it>ADTree</it>; boosting iteration = 10), support vector machines (<it>SVM</it>) with polynomial (<it>SVM/Poly</it>; linear kernel trained by sequential minimal optimisation algorithm, SMO) and radial basis function (<it>SVM/RBF</it>; trained by SMO; <it>&#947; </it>= 0.01) kernels. The Waikato Environment for Knowledge Analysis (WEKA) 3.5.6 was used for classifier training <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. For the purpose of benchmarking, the generalisation performance of inductive CGP was evaluated by stratified 10-fold cross-validation: for the <it>n </it>genes used as candidate genes for prioritisation, all <it>n</it><sub>+ </sub>genes from the validation set and the rest of <it>n</it><sub>- </sub>genes not in the validation set were each randomly divided into 10 subsets. One-tenth of the the genes from each group (<inline-formula><m:math name="1471-2105-10-86-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mfrac><m:mn>1</m:mn><m:mrow><m:mn>10</m:mn></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
MathType@MTEF@1@1@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaIXaqmcqaIWaamaaaaaa@2F41@</m:annotation></m:semantics></m:math></inline-formula> of <it>n</it><sub>+ </sub>and <it>n</it><sub>- </sub>genes) were sequentially selected as test set, whereas the rest of the genes were selected as training set to train inductive models. The performance of each inductive CGP algorithm was obtained by averaging areas under receiver operating characteristic curves (AUC) over the 10 runs.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation of CGP performance</p>
            </st>
            <p>The performance of different CGP methods was evaluated by rediscovery experiments. The relative position of the ranked candidate gene was measured by percentiles from the top of the rank (<it>pct</it>). The AUCs were estimated non-parametrically by trapezoidal rule. We adopted probability enrichment (the relative enrichment ratio) described by Turner et al <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> to compare the performance of different statistical CGP scoring functions [see Additional file <supplr sid="S1">1</supplr>]. The average and maximum probability enrichments, defined as <it>n </it>folds-improvement in precision above a certain score threshold <it>&#964;</it>, were calculated by partial precision (<it>pppv</it>), such that:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-10-86-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable columnalign="left">
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mtext>Partial&#160;prec</m:mtext>
                                       <m:mo>.</m:mo>
                                       <m:mtext>&#160;at&#160;</m:mtext>
                                       <m:mi>&#964;</m:mi>
                                    </m:mrow>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mo>=</m:mo>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mi>p</m:mi>
                                       <m:mi>p</m:mi>
                                       <m:mi>p</m:mi>
                                       <m:mi>v</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>&#964;</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow/>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mo>=</m:mo>
                                 </m:mtd>
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mtext>Num</m:mtext>
                                             <m:mo>.</m:mo>
                                             <m:mtext>&#160;correct&#160;genes</m:mtext>
                                             <m:mo>></m:mo>
                                             <m:mi>&#964;</m:mi>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mtext>Num</m:mtext>
                                             <m:mo>.</m:mo>
                                             <m:mtext>&#160;genes</m:mtext>
                                             <m:mo>></m:mo>
                                             <m:mi>&#964;</m:mi>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
MathType@MTEF@1@1@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabiWaaaqaaiabbcfaqjabbggaHjabbkhaYjabbsha0jabbMgaPjabbggaHjabbYgaSjabbccaGiabbchaWjabbkhaYjabbwgaLjabbogaJjabc6caUiabbccaGiabbggaHjabbsha0jabbccaGiabes8a0bqaaiabg2da9aqaaiabdchaWjabdchaWjabdchaWjabdAha2jabcIcaOiabes8a0jabcMcaPaqaaaqaaiabg2da9aqcfayaamaalaaabaGaeeOta4KaeeyDauNaeeyBa0MaeiOla4IaeeiiaaIaee4yamMaee4Ba8MaeeOCaiNaeeOCaiNaeeyzauMaee4yamMaeeiDaqNaeeiiaaIaee4zaCMaeeyzauMaeeOBa4MaeeyzauMaee4CamNaeyOpa4JaeqiXdqhabaGaeeOta4KaeeyDauNaeeyBa0MaeiOla4IaeeiiaaIaee4zaCMaeeyzauMaeeOBa4MaeeyzauMaee4CamNaeyOpa4JaeqiXdqhaaaaaaaa@773B@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>and the average (<inline-formula><m:math name="1471-2105-10-86-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>&#951;</m:mi><m:mo>&#175;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
MathType@MTEF@1@1@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4TdGMbaebaaaa@2D99@</m:annotation></m:semantics></m:math></inline-formula>) and the maximum (<it>&#951;</it><sub><it>max</it></sub>) probability enrichments were defined as:</p>
            <p>
               <display-formula>
                  <m:math name="1471-2105-10-86-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mi>&#951;</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>t</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mo>=</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mi>p</m:mi>
                                             <m:mi>p</m:mi>
                                             <m:mi>p</m:mi>
                                             <m:msub>
                                                <m:mi>v</m:mi>
                                                <m:mi>n</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mi>t</m:mi>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mi>p</m:mi>
                                             <m:mi>p</m:mi>
                                             <m:mi>v</m:mi>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:mover accent="true">
                                          <m:mi>&#951;</m:mi>
                                          <m:mo>&#175;</m:mo>
                                       </m:mover>
                                       <m:mo>=</m:mo>
                                       <m:mstyle displaystyle="true">
                                          <m:mrow>
                                             <m:msubsup>
                                                <m:mo>&#8747;</m:mo>
                                                <m:mn>0</m:mn>
                                                <m:mn>1</m:mn>
                                             </m:msubsup>
                                             <m:mrow>
                                                <m:mi>&#951;</m:mi>
                                                <m:mo stretchy="false">(</m:mo>
                                                <m:mi>t</m:mi>
                                                <m:mo stretchy="false">)</m:mo>
                                                <m:mi>d</m:mi>
                                                <m:mi>t</m:mi>
                                             </m:mrow>
                                          </m:mrow>
                                       </m:mstyle>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr>
                                 <m:mtd>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>&#951;</m:mi>
                                          <m:mrow>
                                             <m:mi>m</m:mi>
                                             <m:mi>a</m:mi>
                                             <m:mi>x</m:mi>
                                          </m:mrow>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:mi>&#951;</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msup>
                                          <m:mi>t</m:mi>
                                          <m:mo>&#8727;</m:mo>
                                       </m:msup>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                           </m:mtable>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
MathType@MTEF@1@1@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabmqaaaqaaiabeE7aOjabcIcaOiabdsha0jabcMcaPiabg2da9KqbaoaalaaabaGaemiCaaNaemiCaaNaemiCaaNaemODay3aaSbaaeaacqWGUbGBaeqaaiabcIcaOiabdsha0jabcMcaPaqaaiabdchaWjabdchaWjabdAha2baaaOqaaiqbeE7aOzaaraGaeyypa0Zaa8qmaeaacqaH3oaAcqGGOaakcqWG0baDcqGGPaqkcqWGKbazcqWG0baDaSqaaiabicdaWaqaaiabigdaXaqdcqGHRiI8aaGcbaGaeq4TdG2aaSbaaSqaaiabd2gaTjabdggaHjabdIha4bqabaGccqGH9aqpcqaH3oaAcqGGOaakcqWG0baDdaahaaWcbeqaaiabgEHiQaaakiabcMcaPaaaaaa@5CC1@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <it>t </it>was the rank fraction at threshold <it>&#964;</it>, <it>ppv </it>was the overall precision, <it>pppv</it><sub><it>n</it></sub>(<it>t</it>) was the partial precision at rank fraction <it>t</it>, and <it>&#951;</it>(<it>t</it>) was at its maximum at <it>t</it>*. Both AUC and <inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula> measure the overall performance of a CGP task. The rank fraction <it>t</it>* indicates the point above which correct genes are likely to be found <it>&#951;</it><sub><it>max</it></sub>-times more likely than compared to a random gene list. Evaluation with <it>&#951;</it><sub><it>max </it></sub>is useful to identify cases where a small proportion of genes is ranked highly but the overall performance is poor.</p>
         </sec>
         <sec>
            <st>
               <p>Effect of number of genome examples on CGP performance</p>
            </st>
            <p>Two simulation experiments were performed to investigate the effect of the number of genome examples on statistical CGP performance (Case study 1, see below). For the first simulation, the <it>amss </it>scoring function was repeatedly applied on randomly selected subsets of 417 positive and negative genome examples, using genes from set M (Case study 1). The number of positive genome examples (<it>N</it><sub><it>p</it></sub>) and negative genome examples (<it>N</it><sub><it>n</it></sub>) were gradually increased in each subset. For each combination of <it>N</it><sub><it>p </it></sub>and <it>N</it><sub><it>n</it></sub>, 25 runs were performed and the median AUC was obtained. A second simulation was performed to determine the variability of performance. Here the proportion of positive and negative genome examples was kept the same (400:17) and the median and the range of AUC were then obtained over 1000 runs for each <it>N</it><sub><it>p </it></sub>and <it>N</it><sub><it>n</it></sub>.</p>
            <p>A similar simulation was also performed to determine the effect of genome example sizes on inductive CGP performance. Twenty five subsets of <it>N </it>genomes (from 417 genomes) were randomly selected as features with <it>N </it>increased from 1 to 417. For each <it>N</it>, stratified 10-fold cross-validations were performed with <it>SVM/Poly </it>using all genes from SA-2603 genome as candidates. Median AUCs from 25 random subset of <it>N </it>genomes were obtained.</p>
         </sec>
         <sec>
            <st>
               <p>The case studies</p>
            </st>
            <sec>
               <st>
                  <p>Case study 1: Identification of genes involved in bacterial cell wall synthesis</p>
               </st>
               <p>Well-characterised genes responsible for peptidoglycan biosynthesis and metabolism in bacteria were used for testing and were grouped into three nested validation sets [see Additional file <supplr sid="S2">2</supplr>]. The <it>C </it>(core) validation set consisted of genes responsible for the synthesis of <it>N</it>-acetylmuramate-pentapeptide from UDP-<it>N</it>-acetylglucosamine (<it>mur</it>A to <it>mur</it>G, and <it>mra</it>Y). The <it>B </it>(biosynthesis) validation set, extended the <it>C </it>set with genes involved in precursor pathways including <it>N</it>-acetyl-D-glucosamine, <it>meso</it>-diaminopelamate and D-alanyl-D-alanine, as well as genes responsible for undecaprenyl phosphate biosynthesis and recycling. The <it>M </it>(metabolism) validation set further extended the <it>B </it>set by including genes responsible for the modification, recycling, and cross-linking of the peptidoglycan such as penicillin-binding proteins and <it>N</it>-acetylmuramoyl-L-alanine amidases [see Additional file <supplr sid="S3">3</supplr>]. Genome examples were selected from the NCBI bacterial genomes catalogue file <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and manually verified by one of the authors (RL). Genes in the validation sets were identified using KEGG <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> and EcoCyc <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. Genomes of one Gram positive bacterium (<it>S. agalactiae </it>2603 V/R, SA-2603, 2124 genes, GenBank ID: <ext-link ext-link-type="gen" ext-link-id="AE009948">AE009948</ext-link>) and one Gram negative bacterium (<it>E. coli </it>K-12, EC-K12, 4134 genes, GenBank ID: <ext-link ext-link-type="gen" ext-link-id="U00096">U00096</ext-link>) were selected for prioritisation.</p>
               <suppl id="S2">
                  <title>
                     <p>Additional file 2</p>
                  </title>
                  <caption>
                     <p>The validation gene sets in peptidoglycan metabolism (case study 1). </p>
                  </caption>
                  <text>
                     <p><b>The <it>C </it>validation set (shaded area) includes genes responsible for synthesis of the peptidoglycan backbone (shaded area).</b> The <it>B </it>validation set includes various accessory pathways (UDP-NAG synthesis, D-Glu and D-Ala synthesis, <it>meso</it>-DAP synthesis, and und-PP synthesis and recycling). The <it>M </it>validation set further includes genes responsible for transpeptidation, transglycosylation, and other genes responsible for peptidoglycan metabolisms. Abbreviations: UDP: uridine diphosphate; NAG: <it>N</it>-acetylglucosamine; NAG-1P: <it>N</it>-acetylglucosamine-1-phosphate; NAM: <it>N</it>-acetylmuramate; NAG-EP: <it>N</it>-acetylglucosamine-enopyruvate; Ala: alanine; Glu: glutamate; (D-Ala)<sub>2</sub>: D-alanyl-D-alanine; m-DAP: <it>meso</it>-diaminopelamate; Und-PP: undecaprenyl diphosphate; Und-P: undecaprenyl phosphate; F6P: fructose-6-phosphate; D-Glc: D-glucosamine; D-Glc-6P: D-glucosamine-6-phosphate; D-Glc-1P: D-glucosamine-1-phosphate; L-Asp: L-aspartate; L-Asp-4P: L-aspartate-4-phosphate; ASA: aspartate semialdehyde; DHDP: L-2,3-dihydrodipicolinate; THDP: tetrahydrodipicolinate; NS-AKP: <it>N</it>-succinyl-2-amino-6-ketopimelate; NS-DAP: <it>N</it>-succinyl-L,L-2,6-diaminopimelate; L,L-DAP: L,L-diaminopimelate.</p>
                  </text>
                  <file name="1471-2105-10-86-S2.eps">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <suppl id="S3">
                  <title>
                     <p>Additional file 3</p>
                  </title>
                  <caption>
                     <p>Peptidoglycan-related genes. </p>
                  </caption>
                  <text>
                     <p>
                        <b>The genes and the validation sets of peptidoglycan-related genes used in Case study 1.</b>
                     </p>
                  </text>
                  <file name="1471-2105-10-86-S3.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <p>For statistical CGP, 400 genomes of bacteria known to produce peptidoglycan were selected as positive examples. Genomes of 17 bacterial species lacking cell wall, including <it>Mycoplasma </it>spp., <it>Ureaplasma </it>spp., <it>Anaplasma </it>spp., and <it>Phytoplasma </it>spp., were selected as negative examples [see Additional file <supplr sid="S4">4</supplr>]. For inductive CGP, the occurrence of the candidate genes in the same 417 genomes was used as features for machine learning training. Candidate genes were labelled according to whether they belong to <it>C</it>, <it>B</it>, and <it>M </it>validation sets. To compare the effectiveness of statistical CGP, the relative positions of the peptidoglycan genes were compared with an unrelated metabolic pathway (glycolysis genes) acting as the control validation set [see Additional file <supplr sid="S5">5</supplr>].</p>
               <suppl id="S4">
                  <title>
                     <p>Additional file 4</p>
                  </title>
                  <caption>
                     <p>Positive and negative genome examples used in the statistical CGP of peptidoglycan-related genes. </p>
                  </caption>
                  <text>
                     <p>
                        <b>This file lists the 400 positive and 17 negative genome examples used in statistical CGP of peptidoglycan-related genes.</b>
                     </p>
                  </text>
                  <file name="1471-2105-10-86-S4.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
               <suppl id="S5">
                  <title>
                     <p>Additional file 5</p>
                  </title>
                  <caption>
                     <p>The prioritised genes of the control validation set (glycolysis) in Case study 1. </p>
                  </caption>
                  <text>
                     <p>
                        <b>This file lists the positions of glycolysis genes in the ranks produced by statistical CGP of peptidoglycan genes.</b>
                     </p>
                  </text>
                  <file name="1471-2105-10-86-S5.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
            </sec>
            <sec>
               <st>
                  <p>Case study 2: Anaerobic mixed acid fermentation genes</p>
               </st>
               <p>Enzymes responsible for anaerobic respiration and fermentation were identified from pathway databases <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B23">23</abbr></abbrgrp> and literature searches <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. Statistical and inductive CGP methods were used to derive the occurrence matrix and to rank candidate genes for anaerobic mixed-acid fermentation in EC-K12. All genes in EC-K12 were used as candidates for prioritisation. For statistical CGP, 200 bacterial genomes of known obligatory and facultative anaerobes capable of performing anaerobic metabolism were selected as positive genome examples, and 142 genomes of obligatory aerobes that do not perform anaerobic respiration were applied as negative examples [see Additional file <supplr sid="S6">6</supplr>]. Methods for genome example selection were identical to Case study 1. For inductive CGP, the occurrence patterns of 4134 candidate genes in 342 genomes were obtained by the methods described above.</p>
               <suppl id="S6">
                  <title>
                     <p>Additional file 6</p>
                  </title>
                  <caption>
                     <p>The positive and negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes. </p>
                  </caption>
                  <text>
                     <p>
                        <b>This file lists the 200 positive and 142 negative genome examples used in statistical CGP of anaerobic mixed-acid fermentation genes.</b>
                     </p>
                  </text>
                  <file name="1471-2105-10-86-S6.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
            </sec>
            <sec>
               <st>
                  <p>Case study 3: KEGG Pathways</p>
               </st>
               <p>To evaluate the generalisability of inductive CGP, a large-scale rediscovery experiment based on the curated KEGG metabolic pathways was performed <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Thirty-one metabolic pathways with at least 10 genes involved in each pathway were selected for evaluation from the 81 known pathways available for the SA-2603 genome in KEGG. All seven inductive CGP algorithms were tested, and the generalisation performance of the algorithms evaluated by stratified 10-fold cross-validation.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Case study 1: Peptidoglycan-related genes</p>
            </st>
            <p>The best scoring functions for rediscovering metabolic genes (<it>M </it>set) were the <it>amss</it>, <it>hmss</it>, and <it>npv </it>(AUC >0.970) using the whole genome of SA-2603 (2124 genes) as candidate genes. Of the 25 known peptidoglycan-related genes, all except one gene were identified within the top 13% (median: top 1.0 <it>pct</it>) in SA-2603 (Table <tblr tid="T1">1</tblr> and Figure <figr fid="F1">1</figr>). The top-scored genes in the SA-2603 genome are listed in [see Additional file <supplr sid="S7">7</supplr>]. Encouraging results were also achieved in prioritising the EC-K12 genes in all three validation sets (Table <tblr tid="T2">2</tblr>); for example, for the <it>M </it>set genes in EC-K12 (51 known genes out of 4134 genes in the bacterial genome), an AUC of 0.911 was achieved by <it>amss</it>, and the median of the rediscovered genes was at the top 3.2 <it>pct </it>of the rank. In contrast, poor performances were yielded when matching the control validation sets (glycolysis) against the same <it>amss</it>-prioritised ranks (SA-2603: 0.398; EC-K12: 0.341; Figure <figr fid="F1">1</figr>).</p>
            <suppl id="S7">
               <title>
                  <p>Additional file 7</p>
               </title>
               <caption>
                  <p>Prioritised rank (statistical CGP, <it>amss </it>scoring function) of SA-2603 genes. </p>
               </caption>
               <text>
                  <p>
                     <b>The rank positions, rank fractions (in <it>pct</it>), cluster of orthologous groups (COG), and the positions of candidate genes in the reference genome (SA-2603) ranked by <it>amss </it>scoring function.</b>
                  </p>
               </text>
               <file name="1471-2105-10-86-S7.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>CGP performance on peptidoglycan-related genes (<it>Streptococcus agalactiae </it>2603 V/R, 2124 genes). </p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>Validation sets</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Methods</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>C (9 genes)</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>B (18 genes)</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>M (25 genes)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="right">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="right">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="right">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7" ca="left">
                        <p>Statistical CGP (scoring functions)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>sens</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.858</p>
                     </c>
                     <c ca="right">
                        <p>(2.0/5.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.853</p>
                     </c>
                     <c ca="right">
                        <p>(1.9/4.3)</p>
                     </c>
                     <c ca="center">
                        <p>0.830</p>
                     </c>
                     <c ca="right">
                        <p>(1.8/3.8)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>spec</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.396</p>
                     </c>
                     <c ca="right">
                        <p>(0.5/1.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.427</p>
                     </c>
                     <c ca="right">
                        <p>(0.7/2.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.506</p>
                     </c>
                     <c ca="right">
                        <p>(1.1/5.2)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>ppv</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.420</p>
                     </c>
                     <c ca="right">
                        <p>(0.6/1.57)</p>
                     </c>
                     <c ca="center">
                        <p>0.504</p>
                     </c>
                     <c ca="right">
                        <p>(1.3/29.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.590</p>
                     </c>
                     <c ca="right">
                        <p>(2.1/85.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>npv</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.966</p>
                     </c>
                     <c ca="right">
                        <p>(3.6/30.1)</p>
                     </c>
                     <c ca="center">
                        <p>0.964</p>
                     </c>
                     <c ca="right">
                        <p>(3.5/21.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.978</p>
                     </c>
                     <c ca="right">
                        <p>(3.2/17.3)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>amss</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.985</p>
                     </c>
                     <c ca="right">
                        <p>(4.6/88.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.980</p>
                     </c>
                     <c ca="right">
                        <p>(4.4/59)</p>
                     </c>
                     <c ca="center">
                        <p>0.970</p>
                     </c>
                     <c ca="right">
                        <p>(4.4/85.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>hmss</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.986</p>
                     </c>
                     <c ca="right">
                        <p>(4.8/88.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.980</p>
                     </c>
                     <c ca="right">
                        <p>(4.5/64)</p>
                     </c>
                     <c ca="center">
                        <p>0.969</p>
                     </c>
                     <c ca="right">
                        <p>(4.5/85.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>OR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.415</p>
                     </c>
                     <c ca="right">
                        <p>(0.5/1.57)</p>
                     </c>
                     <c ca="center">
                        <p>0.509</p>
                     </c>
                     <c ca="right">
                        <p>(1.3/29.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.592</p>
                     </c>
                     <c ca="right">
                        <p>(2.1/85.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>chisq</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.978</p>
                     </c>
                     <c ca="right">
                        <p>(4.2/59.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.975</p>
                     </c>
                     <c ca="right">
                        <p>(3.9/34.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.959</p>
                     </c>
                     <c ca="right">
                        <p>(3.7/28.3)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>bchisq</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.978</p>
                     </c>
                     <c ca="right">
                        <p>(4.2/59.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.975</p>
                     </c>
                     <c ca="right">
                        <p>(3.9/34.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.960</p>
                     </c>
                     <c ca="right">
                        <p>(3.7/28.3)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>F</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.932</p>
                     </c>
                     <c ca="right">
                        <p>(3.3/32.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.915</p>
                     </c>
                     <c ca="right">
                        <p>(3.1/23.1)</p>
                     </c>
                     <c ca="center">
                        <p>0.881</p>
                     </c>
                     <c ca="right">
                        <p>(2.8/18.5)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7" ca="left">
                        <p>Inductive CGP (machine learning algorithms)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>NB</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.901</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.879</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.843</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>LR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.980</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.905</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.887</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>ADTree</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.996</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.944</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.975</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>IBk</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.948</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.950</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.974</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>J48</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.885</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.832</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.752</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>SMO/Poly</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.999</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.948</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.879</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>SMO/RBF</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.998</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.991</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.909</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This table lists the performance of statistical and inductive CGPs in prioritising peptidoglycan-related genes in <it>Streptococcus agalactiae </it>2603 V/R. Abbreviations: <it>sens</it>: sensitivity; <it>spec</it>: specificity; <it>ppv</it>: positive predictive value; <it>npv</it>: negative predictive value; <it>amss</it>: arithmetic mean of sensitivity and specificity; <it>hmss</it>: harmonic mean of sensitivity and specificity; <it>OR</it>: odds ratio; <it>chisq</it>: chi-square; <it>bchisq</it>: signed chi-square; <it>F</it>: F-measure; <it>NB</it>: na&#239;ve Bayes classifier; <it>LR</it>: logistic regression; <it>ADTree</it>: alternating decision tree; <it>IBk</it>: k-nearest neighbour classifier; <it>J48</it>: <it>J48 </it>decision tree; <it>SMO</it>: support vector machine trained by sequential minimal optimisation algorithm; <it>Poly</it>: polynomial kernel; <it>RBF</it>: radial basis function kernel.</p>
               </tblfn>
            </tbl>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>CGP performance on peptidoglycan-related genes (<it>Escherichia coli </it>K-12, 4131 genes). </p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>Validation sets</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Methods</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>C (8 genes)</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>B (28 genes)</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>M (51 genes)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="right">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="right">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="right">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7" ca="left">
                        <p>Statistical CGP (scoring functions)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>sens</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.913</p>
                     </c>
                     <c ca="right">
                        <p>(2.5/10.6)</p>
                     </c>
                     <c ca="center">
                        <p>0.891</p>
                     </c>
                     <c ca="right">
                        <p>(2.3/6.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.818</p>
                     </c>
                     <c ca="right">
                        <p>(1.9/4.2)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>spec</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.321</p>
                     </c>
                     <c ca="right">
                        <p>(0.4/1.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.310</p>
                     </c>
                     <c ca="right">
                        <p>(0.4/1.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.418</p>
                     </c>
                     <c ca="right">
                        <p>(0.8/2.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>ppv</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.405</p>
                     </c>
                     <c ca="right">
                        <p>(0.8/5.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.423</p>
                     </c>
                     <c ca="right">
                        <p>(1.2/18.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.553</p>
                     </c>
                     <c ca="right">
                        <p>(1.7/28.6)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>npv</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.974</p>
                     </c>
                     <c ca="right">
                        <p>(3.9/42.0)</p>
                     </c>
                     <c ca="center">
                        <p>0.956</p>
                     </c>
                     <c ca="right">
                        <p>(3.5/20.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.891</p>
                     </c>
                     <c ca="right">
                        <p>(2.8/13.2)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>amss</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.989</p>
                     </c>
                     <c ca="right">
                        <p>(4.8/110.)</p>
                     </c>
                     <c ca="center">
                        <p>0.966</p>
                     </c>
                     <c ca="right">
                        <p>(4.1/53.7)</p>
                     </c>
                     <c ca="center">
                        <p>0.911</p>
                     </c>
                     <c ca="right">
                        <p>(3.5/44.7)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>hmss</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.989</p>
                     </c>
                     <c ca="right">
                        <p>(4.9/113.)</p>
                     </c>
                     <c ca="center">
                        <p>0.969</p>
                     </c>
                     <c ca="right">
                        <p>(4.2/55.3)</p>
                     </c>
                     <c ca="center">
                        <p>0.909</p>
                     </c>
                     <c ca="right">
                        <p>(3.5/45.6)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>OR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.403</p>
                     </c>
                     <c ca="right">
                        <p>(0.8/5.2)</p>
                     </c>
                     <c ca="center">
                        <p>0.424</p>
                     </c>
                     <c ca="right">
                        <p>(1.2/18.4)</p>
                     </c>
                     <c ca="center">
                        <p>0.552</p>
                     </c>
                     <c ca="right">
                        <p>(1.7/28.6)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>chisq</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.984</p>
                     </c>
                     <c ca="right">
                        <p>(4.7/73.8)</p>
                     </c>
                     <c ca="center">
                        <p>0.963</p>
                     </c>
                     <c ca="right">
                        <p>(3.9/35.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.902</p>
                     </c>
                     <c ca="right">
                        <p>(3.2/27.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>bchisq</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.984</p>
                     </c>
                     <c ca="right">
                        <p>(4.7/73.8)</p>
                     </c>
                     <c ca="center">
                        <p>0.963</p>
                     </c>
                     <c ca="right">
                        <p>(3.9/35.9)</p>
                     </c>
                     <c ca="center">
                        <p>0.903</p>
                     </c>
                     <c ca="right">
                        <p>(3.2/27.0)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>F</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.965</p>
                     </c>
                     <c ca="right">
                        <p>(4.0/45.8)</p>
                     </c>
                     <c ca="center">
                        <p>0.921</p>
                     </c>
                     <c ca="right">
                        <p>(3.2/22.5)</p>
                     </c>
                     <c ca="center">
                        <p>0.838</p>
                     </c>
                     <c ca="right">
                        <p>(2.5/15.1)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7" ca="left">
                        <p>Inductive CGP (machine learning algorithms)</p>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>NB</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.930</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.889</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.820</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>LR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.882</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.935</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.828</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>ADTree</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.976</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.981</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.925</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>IBk</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.998</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.929</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.946</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>J48</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.935</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.828</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.752</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>SMO/Poly</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.997</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.876</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.933</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c indent="1" ca="left">
                        <p>
                           <it>SMO/RBF</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.963</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.932</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>0.964</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This table lists the performance of statistical and inductive CGP in prioritising peptidoglycan-related genes in <it>Escherichia coli </it>K-12. Abbreviations: <it>sens</it>: sensitivity; <it>spec</it>: specificity; <it>ppv</it>: positive predictive value; <it>npv</it>: negative predictive value; <it>amss</it>: arithmetic mean of sensitivity and specificity; <it>hmss</it>: harmonic mean of sensitivity and specificity; <it>OR</it>: odds ratio; <it>chisq</it>: chi-square; <it>bchisq</it>: signed chi-square; <it>F</it>: F-measure; <it>NB</it>: na&#239;ve Bayes classifier; <it>LR</it>: logistic regression; <it>ADTree</it>: alternating decision tree; <it>IBk</it>: k-nearest neighbour classifier; <it>J48</it>: <it>J48 </it>decision tree; <it>SMO</it>: support vector machine trained by sequential minimal optimisation algorithm; <it>Poly</it>: polynomial kernel; <it>RBF</it>: radial basis function kernel.</p>
               </tblfn>
            </tbl>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>The performance of statistical CGP in rediscovering peptidoglycan-related genes (amss scoring function)</p>
               </caption>
               <text>
                  <p><b>The performance of statistical CGP in rediscovering peptidoglycan-related genes (amss scoring function)</b>. The box-and-whisker plot shows the result of statistical CGP (<it>amss </it>scoring function) in rediscovering peptidoglycan-related genes in the two study genomes (<it>Streptococcus agalactiae </it>2603 and <it>Escherichia coli </it>K-12). The horizontal bars indicate medians of the prioritised ranks and the boxes indicate the upper and lower quartiles. Groups <it>C</it>, <it>B</it>, and <it>M </it>indicate 3 sets of validation genes used. Genes from the glycolytic pathways were used as controls for comparison.</p>
               </text>
               <graphic file="1471-2105-10-86-1"/>
            </fig>
            <p>The performance of statistical CGP was also measured by folds-increase in precision (probability enrichments) compared to the non-prioritised rank. With the <it>chisq </it>scoring function, the ranked gene list achieved an average enrichment of 3.65 folds (maximum 28.3 folds) for SA-2603 and 3.16 folds for EC-K12 (maximum 27 folds). The probability enrichments of other validation sets are listed in Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr>. High AUC values were obtained from the stratified cross-validations of inductive CGP experiments. In particular, <it>SVM </it>achieved near-perfect AUCs in both SA-2603 <it>C </it>and <it>B </it>validation sets, whereas <it>ADTree </it>had the best AUC of 0.975 in <it>M </it>set genes (the trained <it>ADTree </it>model is shown in Additional file <supplr sid="S8">8</supplr>). Similarly, the best AUC was achieved by <it>SVM/RBF </it>in the EC-K12 <it>M </it>validation set (0.964). The best AUCs of 0.998 and 0.981 were also achieved in the rediscovery of <it>C </it>and <it>B </it>set genes (by <it>IBk </it>and <it>ADTree </it>respectively).</p>
            <suppl id="S8">
               <title>
                  <p>Additional file 8</p>
               </title>
               <caption>
                  <p>The alternating decision tree model trained by using SA-2603 M-validation set. </p>
               </caption>
               <text>
                  <p><b>This figure shows the alternating decision tree (<it>ADTree</it>) model induced by M-validation set of SA-2603 genome.</b> This model predicts whether a gene is related to peptidoglycan metabolism by summing the scores of all preceding nodes from root (Start). A higher score would rank the candidate gene higher. The model shown in this figure achieved an AUC of 0.975 as estimated by using stratified 10-fold cross-validation. Abbreviations of genome names: Nit. europ.: <it>Nitrosomonas europaea </it>(GenBank accession: <ext-link ext-link-type="gen" ext-link-id="AL954747">AL954747</ext-link>); Wig. brevipalpis.: <it>Wigglesworthia brevipalpis </it>(<ext-link ext-link-type="gen" ext-link-id="AB063523">AB063523</ext-link>, <ext-link ext-link-type="gen" ext-link-id="BA000021">BA000021</ext-link>); Oen. Oeni PSU-1: <it>Oenococcus oeni </it>PSU-1 (<ext-link ext-link-type="gen" ext-link-id="CP000411">CP000411</ext-link>); Clos. tetan. E88: <it>Clostridium tetani </it>E88 (<ext-link ext-link-type="gen" ext-link-id="AE015927">AE015927</ext-link>, <ext-link ext-link-type="gen" ext-link-id="AF528097">AF528097</ext-link>); Myc. mycoides.: <it>Mycoplasma mycoides </it>(<ext-link ext-link-type="gen" ext-link-id="BX293980">BX293980</ext-link>); Ehr. ruminantium str.: <it>Ehrlichia ruminantium str</it>. Welgevonden (<ext-link ext-link-type="gen" ext-link-id="CR925678">CR925678</ext-link>); Buc. aphidicol. Cc Cinara cedri.: <it>Buchnera aphidicola </it>Cc Cinara cedri (<ext-link ext-link-type="gen" ext-link-id="CP000263">CP000263</ext-link>); Hah. chejuensis: <it>Hahella chejuensis </it>KCTC 2396 (<ext-link ext-link-type="gen" ext-link-id="CP000155">CP000155</ext-link>); Ric. felis URRWXCal2: <it>Rickettsia felis </it>URRWXCal2 (<ext-link ext-link-type="gen" ext-link-id="CP000053">CP000053</ext-link>&#8211;<ext-link ext-link-type="gen" ext-link-id="CP000055">CP000055</ext-link>); Por. gingivalis. W83: <it>Porphyromonas gingivalis </it>W83 (<ext-link ext-link-type="gen" ext-link-id="AE015924">AE015924</ext-link>)</p>
               </text>
               <file name="1471-2105-10-86-S8.eps">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>Simulations were performed to investigate the effect of number of genome examples on statistical CGP performance. The range of AUCs was found to be considerably broader with fewer genome examples (Figure <figr fid="F2">2</figr>). However, a median AUC (>0.95) could still be achieved with as few as 10 genome examples in each group in the rediscovery of the <it>M </it>set genes, compared with a maximum of 0.97 using all 417 bacterial genomes (Figure <figr fid="F3">3</figr>), indicating that the method has considerable power using even a very small genome sample.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>AUC versus number of genome examples in statistical CGP on Streptococcus agalactiae 2603 peptidoglycan-related genes (amss scoring function)</p>
               </caption>
               <text>
                  <p><b>AUC versus number of genome examples in statistical CGP on Streptococcus agalactiae 2603 peptidoglycan-related genes (amss scoring function)</b>. This figure demonstrates how the number of genome examples may influence statistical CGP performance (<it>amss </it>scoring function). The proportion of positive:negative genome examples were fixed (400:17).</p>
               </text>
               <graphic file="1471-2105-10-86-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Three-dimensional surface plot demonstrating the effect of number of genome examples on the statistical CGP performance</p>
               </caption>
               <text>
                  <p><b>Three-dimensional surface plot demonstrating the effect of number of genome examples on the statistical CGP performance</b>. The median AUCs (<it>z</it>-axis) over 25 simulation runs are shown for each <it>N</it><sub><it>p </it></sub>(<it>x</it>-axis) and <it>N</it><sub><it>n </it></sub>(<it>y</it>-axis) combination. Statistical CGP (<it>amss </it>scoring function) was performed to discover peptidoglycan-related genes by prioritising all genes in the <it>Streptococcus agalactiae </it>2603 genome.</p>
               </text>
               <graphic file="1471-2105-10-86-3"/>
            </fig>
            <p>The same simulation performed on inductive CGP (with <it>SVM/Poly</it>) also achieved an median AUC >0.90 with only 20 genome examples (Figure <figr fid="F4">4</figr>). It was noted, however, that the AUCs peaked at 70 to 160 genome examples (with corresponding AUCs between 0.93&#8211;0.95) and the performance gradually declined as more genome examples were added to the profile. Considerable variation of AUC was also noted when the full 417 genomes were included in the profile panel (median AUC: 0.872; interquartile range: 0.858&#8211;0.898; range: 0.824&#8211;0.925).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Line plot illustrating the effect of number of genome examples on the inductive CGP performance</p>
               </caption>
               <text>
                  <p><b>Line plot illustrating the effect of number of genome examples on the inductive CGP performance</b>. This figure shows the effect of AUC versus number of genome examples in inductive CGP (<it>SVM/Poly </it>algorithm) on <it>Streptococcus agalactiae </it>2603 genes related to peptidoglycan metabolism (M validation set). Twenty-five simulation runs of stratified 10-fold cross-validation were performed. The black line indicates the median AUC, the grey solid lines indicate the upper and lower quartiles, and the dotted grey lines indicate the maximum and minimum AUC of the simulation runs.</p>
               </text>
               <graphic file="1471-2105-10-86-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Case study 2: Anaerobic mixed-acid fermentation genes</p>
            </st>
            <p>Statistical CGP on the anaerobic mixed-acid fermentation rediscovery task for EC-K12 performed poorly (AUC: 0.46&#8211;0.77). However, the maximum probability enrichment was high (up to 108-folds, Table <tblr tid="T3">3</tblr>). Bacterial genes specific to anaerobic metabolism were identified with high ranking scores (the <it>pfl </it>complex: above 0.27 <it>pct</it>; <it>adh</it>E: 2.1 <it>pct</it>; <it>ack</it>A: 2.1 <it>pct</it>; <it>pta</it>: 12 <it>pct</it>; see Additional files <supplr sid="S9">9</supplr> and <supplr sid="S10">10</supplr>) by <it>amss</it>. In contrast, genes shared with aerobic respiration, such as the fumerase genes (<it>fumABC</it>) and the phosphoenolpyruvate carboxylase gene (<it>ppc</it>), were ranked much lower (61&#8211;96 <it>pct </it>and 57 <it>pct </it>respectively). For genes encoding the fumarate reductase complex, there were mixed results: the membrane anchor subunits (<it>frd</it>CD) were ranked highly (10.1 and 7.8 <it>pct </it>respectively) and the catalytic subunits were placed at the bottom of the rank (<it>frd</it>AB, 93 and 99 <it>pct</it>). Better overall performance of inductive CGP was achieved compared with statistical CGP (AUC: 0.70&#8211;0.86). The best AUCs with inductive prioritisation were produced by <it>IBk </it>and <it>SVM/Poly </it>algorithms respectively (0.86 and 0.85).</p>
            <suppl id="S9">
               <title>
                  <p>Additional file 9</p>
               </title>
               <caption>
                  <p>Gene rank produced by <it>amss </it>scoring function (anaerobic mixed-acid fermentation, Case study 2). </p>
               </caption>
               <text>
                  <p>
                     <b>The rank fraction (in <it>pct</it>) of genes prioritised by <it>amss </it>scoring function in the EC-K12 genome.</b>
                  </p>
               </text>
               <file name="1471-2105-10-86-S9.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S10">
               <title>
                  <p>Additional file 10</p>
               </title>
               <caption>
                  <p>Gene rank produced by <it>amss </it>scoring function (anaerobic mixed-acid fermentation). </p>
               </caption>
               <text>
                  <p>
                     <b>The rank positions, rank fractions (in <it>pct</it>), cluster of orthologous groups (COG), and the positions of candidate genes in the reference genome (<it>E. coli </it>K-12) prioritised by <it>amss </it>scoring function.</b>
                  </p>
               </text>
               <file name="1471-2105-10-86-S10.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>CGP performance on anaerobic mixed-acid fermentation genes (Escherichia coli K-12, 4131 genes). Prioritisation (AUC) of anaerobic mixed-acid fermentation genes in <it>Escherichia coli </it>K-12</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c cspan="3" ca="center">
                        <p>Statistical CGP</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Inductive CGP</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>Scoring function</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                     <c ca="center">
                        <p>(<inline-formula><graphic file="1471-2105-10-86-i3.gif"/></inline-formula>/<it>&#951;</it><sub><it>max</it></sub>)</p>
                     </c>
                     <c ca="center">
                        <p>Algorithm</p>
                     </c>
                     <c ca="center">
                        <p>AUC</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>sens</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.634</p>
                     </c>
                     <c ca="right">
                        <p>(1.2/1.8)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>NB</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.695</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>spec</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.464</p>
                     </c>
                     <c ca="right">
                        <p>(0.8/1.5)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>LR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.796</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>ppv</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.519</p>
                     </c>
                     <c ca="right">
                        <p>(1.1/2.0)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>ADTree</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.780</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>npv</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.594</p>
                     </c>
                     <c ca="right">
                        <p>(1.8/11.0)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>IBk</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.860</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>amss</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.578</p>
                     </c>
                     <c ca="right">
                        <p>(2.4/96.6)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>J48</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.663</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>hmss</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.628</p>
                     </c>
                     <c ca="right">
                        <p>(2.4/95.1)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>SMO/Poly</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.848</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>OR</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.537</p>
                     </c>
                     <c ca="right">
                        <p>(1.2/2.3)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>SMO/RBF</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.782</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>chisq</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.767</p>
                     </c>
                     <c ca="right">
                        <p>(3.2/109)</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>bchisq</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.585</p>
                     </c>
                     <c ca="right">
                        <p>(2.5/109)</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>F</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.698</p>
                     </c>
                     <c ca="right">
                        <p>(2.5/69.9)</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Thirty-eight known genes were labelled as known (out of 4131 genes of the EC-K12 genome). The AUC in inductive CGP were calculated using stratified 10-fold cross-validation. Abbreviations: <it>sens</it>: sensitivity; <it>spec</it>: specificity; <it>ppv</it>: positive predictive value; <it>npv</it>: negative predictive value; <it>amss</it>: arithmetic mean of sensitivity and specificity; <it>hmss</it>: harmonic mean of sensitivity and specificity; <it>OR</it>: odds ratio; <it>chisq</it>: chi-square; <it>bchisq</it>: signed chi-square; <it>F</it>: F-measure; <it>NB</it>: na&#239;ve Bayes classifier; <it>LR</it>: logistic regression; <it>ADTree</it>: alternating decision tree; <it>IBk</it>: k-nearest neighbour classifier; <it>J48</it>: <it>J48 </it>decision tree; <it>SMO</it>: support vector machine trained by sequential minimal optimisation algorithm; <it>Poly</it>: polynomial kernel; <it>RBF</it>: radial basis function kernel.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Case study 3: Inductive prioritisation of KEGG pathway genes</p>
            </st>
            <p>Inductive CGP was conducted on 31 KEGG pathways of SA-2603 using 7 algorithms. The best supervised machine learning algorithms identified 14 pathways (45%) with AUCs >0.90 and 28 pathways (87%) with AUCs >0.80 (Figure <figr fid="F5">5</figr> and see Additional file <supplr sid="S11">11</supplr>). The best performing algorithm was <it>IBk </it>which had the highest AUC in 10 pathways. <it>ADTree </it>and <it>SVM/Poly </it>also performed well, with each producing the best AUC in 8 pathways. <it>SVM/RBF </it>achieved best AUC in 4 pathways. <it>NB </it>and <it>J48 </it>did not produce a best AUC in any of the 31 pathways studied.</p>
            <suppl id="S11">
               <title>
                  <p>Additional file 11</p>
               </title>
               <caption>
                  <p>Stratified cross-validation results of inductive CGP algorithms on 31 selected KEGG pathways. </p>
               </caption>
               <text>
                  <p>
                     <b>This is the tabular representation of results in Figure</b>
                     <figr fid="F5">5</figr>
                     <b>, </b>
                     <b>showing the AUCs of 10-fold cross-validations in rediscovering genes in the 31 KEGG pathways evaluated in Case study 3.</b>
                  </p>
               </text>
               <file name="1471-2105-10-86-S11.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>The performance of inductive CGP in prioritising 31 KEGG metabolic pathways</p>
               </caption>
               <text>
                  <p><b>The performance of inductive CGP in prioritising 31 KEGG metabolic pathways</b>. The AUCs attempted by stratified 10-fold cross-validations were obtained by the rediscovery experiment in Case study 3. Genes of 31 metabolic pathways of <it>S. agalactiae </it>2603 genome were obtained from KEGG and rediscovered by 7 machine learning algorithms.</p>
               </text>
               <graphic file="1471-2105-10-86-5"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Successful prioritisation of bacterial genes by occurrence-based CGP methods</p>
            </st>
            <p>In this paper, we applied two approaches (statistical and inductive CGP) to prioritise candidate genes for functional discovery, based on the occurrence patterns of candidate genes in a selected set of bacterial genomes (phylogenetic profiles). Our findings demonstrate that both CGP methods can rediscover genes with high accuracy in two selected genomes of <it>E. coli </it>K-12 and <it>S. agalactiae </it>2603 (Figure <figr fid="F3">3</figr>).</p>
            <p>Interestingly, these methods seem relatively insensitive to the number of genome examples. In the peptidoglycan example with statistical CGP (case study 1), we were able to identify peptidoglycan genes with high accuracy, despite only a limited number of sequenced genomes among negative examples. For inductive CGP, increasing profile dimension beyond 200 genome examples apparently resulted in a decrease in subsequent median AUCs, implying that a proportion of genome examples was less informative and might not have contributed to the identification of genes of interest. This finding coincides with the observation by Johti <it>et al.</it>, where increasing the phylogenetic profile dimension with redundant genomes did not necessarily improve the accuracy in eukaryotic gene function prediction <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>.</p>
            <p>In statistical CGP, we found that the scoring functions measuring gene occurrence in both positive and negative genome example groups (<it>amss</it>, <it>hmss</it>, <it>chisq</it>, and <it>bchisq</it>) consistently outperform the scoring functions that measure only positive (<it>sens </it>and <it>ppv</it>), negative (<it>spec </it>and <it>npv</it>), or partial (<it>F</it>-measure) frequencies of the groups. This finding highlights the importance of including both positive and negative examples in comparative genomic studies. In inductive CGP, the rediscovery experiments favoured <it>ADTree</it>, <it>IBk</it>, and <it>SVM </it>s when compared with other algorithms. In case study 1, the performance of the best inductive and statistical CGP methods are comparable, suggesting that both approaches are capable of producing robust results.</p>
         </sec>
         <sec>
            <st>
               <p>Statistical CGP rediscovers genes specific to the function of interest</p>
            </st>
            <p>Our results demonstrated that statistical CGP can discover genes specific to a particular function or pathway. For example, by comparing the cell-walled bacteria with cell wall-less <it>Mycoplasma </it>spp. (case study 1), it was expected that the peptidoglycan genes, which are specific for the phenotypic trait, were among the genes lost in this evolutionary lineage <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> (See Additional file <supplr sid="S12">12</supplr>). Such genes were ranked very highly and yielded favourable aggregated performance (with AUC >0.95). In contrast, results from our anaerobic fermentation experiment (case study 2) suggested that genes specific to anaerobic respiration were placed very highly on the rank (<it>&#951;</it><sub><it>max </it></sub>> 95-fold), whereas genes <it>sharing </it>with the obligatory aerobic bacteria (negative examples) were ranked much lower. As these shared genes are present in both phenotypic groups, finding these "shared" genes by applying only statistical CGP is a challenging task. Alternative methods are needed to aid in the discovery of such non-specific genes.</p>
            <suppl id="S12">
               <title>
                  <p>Additional file 12</p>
               </title>
               <caption>
                  <p>Statistical CGP of peptidoglycan-related genes on <it>Prochlorococcus marinus </it>MIT9313. </p>
               </caption>
               <text>
                  <p><b>In Case study 1, evaluation experiments were performed on candidate genes selected from one <it>S. agalactiae </it>and one <it>E. coli </it>genomes.</b> These bacterial genomes belong to divisions of Firmicutes and Gamma-proteobacteria, both consisting of large number of closely-related sequences in positive examples, and it could have favourably biased the performance due to over-representation. This file describes an additional CGP experiment by selecting a less-well represented genome from the NCBI database, <it>Prochlorococcus marinus </it>MIT9313, to investigate this effect.</p>
               </text>
               <file name="1471-2105-10-86-S12.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Evolutionary pressure may contribute to specific gene occurrence patterns</p>
            </st>
            <p>Both occurrence-based CGP methods performed well, suggesting the genes encoding for a complex phenotype are frequently co-present and co-absent across the genomes, forming specific occurrence patterns, thus allowing the functional predictions. This co-occurrence phenomenon may reflect the process of natural selection and the adaptation of microorgranisms into different evolutionary niches.</p>
            <p>For bacteria undergoing positive selection, the acquisition of a particular gene group may result in phenotypes conferring survival advantage for the microorganism to adapt to a new environment. It has been known that genes contributing to symbiosis or pathogenesis are frequently organised into genomic islands, in which gene mobility is facilitated by horizontal gene transfer, conferring the ability to form a new relationship with the host <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. The good AUC achieved by inductive CGP in KEGG pathways (case study 3) suggests specific functional co-occurrence patterns of genes do exist, regardless of the physical proximity of the genes or the presence of a mobile genetic structure.</p>
            <p>Similarly, negative selection can also contribute to the co-absence of functional gene units across multiple genomes. For a complex phenotype encoded by multiple genes, the deletion of a critical gene could result in the non-expression of phenotype, leading to the subsequent loss of other non-functional genes over time. Thus, the differential co-occurrence patterns in genes can be exploited for comparative genomics studies, as demonstrated by our methods, in assisting our understanding of gene functions.</p>
         </sec>
         <sec>
            <st>
               <p>Factors affecting CGP performance</p>
            </st>
            <sec>
               <st>
                  <p>Sampling biases</p>
               </st>
               <p>Prioritising candidate genes with reliance on gene features from literature, ontology, or annotation as background knowledge may introduce literature or annotation biases <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B4">4</abbr><abbr bid="B9">9</abbr></abbrgrp>. While we minimised such biases, several sources of <it>sampling bias </it>could have limited the performance of our gene prioritisation methods. With statistical CGP, accuracy can be affected by <it>ad hoc </it>selection of examples, especially by the choice of positive and negative genome examples representing the variations in the study phenotype. With inductive CGP, performance may be impeded by incorporating inconsistent (or wrong) genes in the training set. Increasing the heterogeneity of the training genes may adversely influence prioritisation performance. An example can be found in a slight decrease of performance (in median AUC) from <it>C </it>to <it>M </it>validation sets in the peptidoglycan experiments in EC-K12 candidates, as shown in Figure <figr fid="F1">1</figr>.</p>
            </sec>
            <sec>
               <st>
                  <p>Using KEGG as a validation data source</p>
               </st>
               <p>There were considerable variations in inductive CGP performances across different KEGG categories (Case study 3). By manually inspecting the worst-performing functional category (phenylalanine, tyrosine, and tryptophan biosynthesis), we found the phenylalanine and tyrosine tRNA synthases genes were also included in the validation set. The tRNA synthases have roles downstream of the biosynthesis pathways and thus are not involved in the anabolism of these essential amino acids. Removal of the unrelated genes improved overall performance (with best AUC of 0.852 achieved by <it>SVM/Poly</it>, see Additional file <supplr sid="S13">13</supplr>). This contrasts with the best-performing pathways (for example, fatty acid biosynthesis and peptidoglycan synthesis pathways) where only function-specific genes were included in the validation set. Since KEGG is a commonly-used resource for benchmarking computational methods of functional discovery <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B26">26</abbr></abbrgrp>, our finding suggests that careful selection must be practised in constructing validation sets, as mixture of distinct functional groups could lead to inconsistent results. This specific sampling bias needs to be considered when explaining variations in the predicting of gene functions by <it>in silico </it>methods.</p>
               <suppl id="S13">
                  <title>
                     <p>Additional file 13</p>
                  </title>
                  <caption>
                     <p>Potential sampling biases resulting from inaccurate selection of validation sets. </p>
                  </caption>
                  <text>
                     <p><b>There were considerable variations in inductive CGP performance in Case study 3, and some variations is attributable to statistical uncertainties or algorithmic differences.</b> The influence of pathway functions on CGP performance was, however, unclear. Nevertheless, it was observed that there may be limitations in using KEGG pathways as a validation source, where potential sampling biases could have explained a significant proportion of such variations. In this file, an additional experiment was performed to illustrate this effect.</p>
                  </text>
                  <file name="1471-2105-10-86-S13.pdf">
                     <p>Click here for file</p>
                  </file>
               </suppl>
            </sec>
            <sec>
               <st>
                  <p>The inclusion of paralogs in the occurrence matrix</p>
               </st>
               <p>Reciprocal best BLAST matches are frequently used in the search for orthologous genes. In our experiments, we applied non-reciprocal BLAST E-value &lt; 10<sup>-5 </sup>as the criterion for determining the sequence similarity between genes. While our results supported its use in functional discovery at the gene level, the use of such criterion may include many paralogs and may affect prioritisation performance of large gene families with diverse functions. Detecting and excluding paralogs may be required to refine the gene ranking and warrant further studies.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We developed a statistical and an inductive computational gene prioritisation methods, based on the concept of gene occurrence across a range of genomes, to improve the search efficiency in the functional discovery of bacterial genes. We designed a range of rediscovery experiments for benchmarking different CGP approaches. Promising results were yielded from the testing on the rediscovery of peptidoglycan-related genes, mixed-acid fermentation genes, and a diverse range of bacterial metabolic pathways. These CGP methods could be generalised to other functional discovery tasks when a pair of positive and negative datasets are available (statistical CGP) or when a subset of genes with known functions can be used for training machine learning models (inductive CGP). With more genome sequences become available, we anticipate the demand of such methods will grow as many different scenarios can be formulated and analysed. In summary, occurrence-based gene prioritisation method offers a simple yet effective framework for ranking candidate genes for functional discovery in prokaryotes. In addition, our experimental framework should provide a standardised benchmark for evaluating future CGP methods and algorithms when prioritising bacterial candidate genes.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>FL contributed to the conception of the CGP frameworks and performed the rediscovery experiments. EC and RL contributed to the experimental design. All authors (FL, EC, RL, and VS) contributed to the critical analysis, interpretation, and discussion. All authors contributed to the preparation of manuscript. All authors read and approved the final version of the manuscript. We thank the anonymous reviewers for their valuable comments and criticisms.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Authors are grateful to Mike Bain and Guy Tsafnat for their critical comments on the design of CGP experiments. We thank Fanrong Kong for his review and comments on peptidoglycan-related genes. This research is funded by Australian National Health and Medical Research Council (NH&amp;MRC) project grant No. 358351.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Bacterial genomics and pathogen evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Raskin</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Seshadri</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Pukatzki</snm>
                  <fnm>SU</fnm>
               </au>
               <au>
                  <snm>Mekalanos</snm>
                  <fnm>JJ</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2006</pubdate>
            <volume>124</volume>
            <issue>4</issue>
            <fpage>703</fpage>
            <lpage>714</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.cell.2006.02.002</pubid>
                  <pubid idtype="pmpid" link="fulltext">16497582</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>A computational system to select candidate genes for complex human traits</p>
            </title>
            <aug>
               <au>
                  <snm>Gaulton</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Mohlke</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Vision</snm>
                  <fnm>TJ</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>23</volume>
            <issue>9</issue>
            <fpage>1132</fpage>
            <lpage>1140</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btm001</pubid>
                  <pubid idtype="pmpid" link="fulltext">17237041</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Association of genes to genetically inherited diseases using data mining</p>
            </title>
            <aug>
               <au>
                  <snm>Perez-Iratxeta</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Andrade</snm>
                  <fnm>MA</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2002</pubdate>
            <volume>31</volume>
            <issue>3</issue>
            <fpage>316</fpage>
            <lpage>319</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12006977</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Gene prioritization through genomic data fusion</p>
            </title>
            <aug>
               <au>
                  <snm>Aerts</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lambrechts</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Maity</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Van Loo</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Coessens</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>De Smet</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Tranchevent</snm>
                  <fnm>LC</fnm>
               </au>
               <au>
                  <snm>De Moor</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Marynen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hassan</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Carmeliet</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Moreau</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2006</pubdate>
            <volume>24</volume>
            <issue>5</issue>
            <fpage>537</fpage>
            <lpage>544</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nbt1203</pubid>
                  <pubid idtype="pmpid" link="fulltext">16680138</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Integration of text- and data-mining using ontologies successfully selects disease gene candidates</p>
            </title>
            <aug>
               <au>
                  <snm>Tiffin</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kelso</snm>
                  <fnm>JF</fnm>
               </au>
               <au>
                  <snm>Powell</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Pan</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Bajic</snm>
                  <fnm>VB</fnm>
               </au>
               <au>
                  <snm>Hide</snm>
                  <fnm>WA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <issue>5</issue>
            <fpage>1544</fpage>
            <lpage>1552</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1065256</pubid>
                  <pubid idtype="pmpid" link="fulltext">15767279</pubid>
                  <pubid idtype="doi">10.1093/nar/gki296</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Exploring relationships and mining data with the UCSC Gene Sorter</p>
            </title>
            <aug>
               <au>
                  <snm>Kent</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Hsu</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Karolchik</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kuhn</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Clawson</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Trumbower</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <issue>5</issue>
            <fpage>737</fpage>
            <lpage>741</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1088302</pubid>
                  <pubid idtype="pmpid" link="fulltext">15867434</pubid>
                  <pubid idtype="doi">10.1101/gr.3694705</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>POCUS: mining genomic sequence annotation to predict disease genes</p>
            </title>
            <aug>
               <au>
                  <snm>Turner</snm>
                  <fnm>FS</fnm>
               </au>
               <au>
                  <snm>Clutterbuck</snm>
                  <fnm>DR</fnm>
               </au>
               <au>
                  <snm>Semple</snm>
                  <fnm>CAM</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <issue>11</issue>
            <fpage>R75</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">329128</pubid>
                  <pubid idtype="pmpid" link="fulltext">14611661</pubid>
                  <pubid idtype="doi">10.1186/gb-2003-4-11-r75</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Speeding disease gene discovery by sequence based candidate prioritization</p>
            </title>
            <aug>
               <au>
                  <snm>Adie</snm>
                  <fnm>EA</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Porteous</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Pickard</snm>
                  <fnm>BS</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>55</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1274252</pubid>
                  <pubid idtype="pmpid" link="fulltext">15766383</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-55</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>SUSPECTS: enabling fast and effective prioritization of positional candidates</p>
            </title>
            <aug>
               <au>
                  <snm>Adie</snm>
                  <fnm>EA</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Porteous</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Pickard</snm>
                  <fnm>BS</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <issue>6</issue>
            <fpage>773</fpage>
            <lpage>774</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btk031</pubid>
                  <pubid idtype="pmpid" link="fulltext">16423925</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A similarity-based method for genome-wide prediction of disease-relevant human genes</p>
            </title>
            <aug>
               <au>
                  <snm>Freudenberg</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Propping</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>Suppl 2</issue>
            <fpage>S110</fpage>
            <lpage>115</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12385992</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Genome-wide identification of genes likely to be involved in human genetic disease</p>
            </title>
            <aug>
               <au>
                  <snm>L&#243;pez-Bigas</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <issue>10</issue>
            <fpage>3108</fpage>
            <lpage>3114</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">434425</pubid>
                  <pubid idtype="pmpid" link="fulltext">15181176</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh605</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Annotation, comparison and databases for hundreds of bacterial genomes</p>
            </title>
            <aug>
               <au>
                  <snm>M&#233;digue</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Moszer</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Res Microbiol</source>
            <pubdate>2007</pubdate>
            <volume>158</volume>
            <issue>10</issue>
            <fpage>724</fpage>
            <lpage>736</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.resmic.2007.09.009</pubid>
                  <pubid idtype="pmpid" link="fulltext">18031997</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Who's your neighbor? New computational approaches for functional genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Galperin</snm>
                  <fnm>MY</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Nat Biotechnol</source>
            <pubdate>2000</pubdate>
            <volume>18</volume>
            <issue>6</issue>
            <fpage>609</fpage>
            <lpage>13</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/76443</pubid>
                  <pubid idtype="pmpid" link="fulltext">10835597</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Assigning protein functions by comparative genome analysis: protein phylogenetic profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Thompson</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Yeates</snm>
                  <fnm>TO</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <issue>8</issue>
            <fpage>4285</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">16324</pubid>
                  <pubid idtype="pmpid" link="fulltext">10200254</pubid>
                  <pubid idtype="doi">10.1073/pnas.96.8.4285</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Phylogenetic detection of conserved gene clusters in microbial genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Zheng</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Anton</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Kasif</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>243</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1266350</pubid>
                  <pubid idtype="pmpid" link="fulltext">16202130</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-6-243</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Protein network inference from multiple genomic data: a supervised approach</p>
            </title>
            <aug>
               <au>
                  <snm>Yamanishi</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Vert</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Kanehisa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <issue>Suppl 1</issue>
            <fpage>i363</fpage>
            <lpage>70</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth910</pubid>
                  <pubid idtype="pmpid" link="fulltext">15262821</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>A tree kernel to analyse phylogenetic profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Vert</snm>
                  <fnm>JP</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>Suppl 1</issue>
            <fpage>S276</fpage>
            <lpage>84</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12169557</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Localizing proteins in the cell from their phylogenetic profiles</p>
            </title>
            <aug>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Xenarios</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Bliek</snm>
                  <mnm>van Der</mnm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <issue>22</issue>
            <fpage>12115</fpage>
            <lpage>20</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">17303</pubid>
                  <pubid idtype="pmpid" link="fulltext">11035803</pubid>
                  <pubid idtype="doi">10.1073/pnas.220399497</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Gene annotation and network inference by phylogenetic profiling</p>
            </title>
            <aug>
               <au>
                  <snm>Wu</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hu</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>DeLisi</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>80</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1388238</pubid>
                  <pubid idtype="pmpid" link="fulltext">16503966</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-80</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>KEGG: kyoto encyclopedia of genes and genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Kanehisa</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Goto</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>27</fpage>
            <lpage>30</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102409</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592173</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.27</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <aug>
               <au>
                  <snm>Witten</snm>
                  <fnm>IH</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Data Mining: Practical machine learning tools and techniques</source>
            <publisher>San Francisco: Morgan Kaufmann</publisher>
            <edition>2</edition>
            <pubdate>2005</pubdate>
         </bibl>
         <bibl id="B22">
            <title>
               <p>NCBI bacterial genomes catalogue file</p>
            </title>
            <url>ftp://ftp.ncbi.nih.gov/genomes/Bacteria/</url>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Multidimensional annotation of the Escherichia coli K-12 genome</p>
            </title>
            <aug>
               <au>
                  <snm>Karp</snm>
                  <fnm>PD</fnm>
               </au>
               <au>
                  <snm>Keseler</snm>
                  <fnm>IM</fnm>
               </au>
               <au>
                  <snm>Shearer</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Latendresse</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Krummenacker</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Paley</snm>
                  <fnm>SM</fnm>
               </au>
               <au>
                  <snm>Paulsen</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Collado-Vides</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gama-Castro</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Peralta-Gil</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Santos-Zavaleta</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Pe&#241;aloza-Sp&#237;nola</snm>
                  <fnm>MI</fnm>
               </au>
               <au>
                  <snm>Bonavides-Martinez</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ingraham</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2007</pubdate>
            <volume>35</volume>
            <issue>22</issue>
            <fpage>7577</fpage>
            <lpage>7590</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2190727</pubid>
                  <pubid idtype="pmpid" link="fulltext">17940092</pubid>
                  <pubid idtype="doi">10.1093/nar/gkm740</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Regulation of gene expression in fermentative and respiratory systems in Escherichia coli and related bacteria</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>Iuchi</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Annu Rev Genet</source>
            <pubdate>1991</pubdate>
            <volume>25</volume>
            <fpage>361</fpage>
            <lpage>387</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1146/annurev.ge.25.120191.002045</pubid>
                  <pubid idtype="pmpid" link="fulltext">1812811</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <aug>
               <au>
                  <snm>Michal</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Biochemical pathways: An Atlas of Biochemistry and Molecular Biology</source>
            <publisher>Hoboken, NJ; Wiley-Spektrum</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment</p>
            </title>
            <aug>
               <au>
                  <snm>Jothi</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Przytycka</snm>
                  <fnm>TM</fnm>
               </au>
               <au>
                  <snm>Aravind</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>173</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1904249</pubid>
                  <pubid idtype="pmpid" link="fulltext">17521444</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-8-173</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>The minimal gene complement of Mycoplasma genitalium</p>
            </title>
            <aug>
               <au>
                  <snm>Fraser</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Gocayne</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>White</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Clayton</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Fleischmann</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Bult</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Kerlavage</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Sutton</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Kelley</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Fritchman</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Weidman</snm>
                  <fnm>JF</fnm>
               </au>
               <au>
                  <snm>Small</snm>
                  <fnm>KV</fnm>
               </au>
               <au>
                  <snm>Sandusky</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fuhrmann</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Nguyen</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Utterback</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Saudek</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Phillips</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Merrick</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Tomb</snm>
                  <fnm>JF</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Bott</snm>
                  <fnm>KF</fnm>
               </au>
               <au>
                  <snm>Hu</snm>
                  <fnm>PC</fnm>
               </au>
               <au>
                  <snm>Lucier</snm>
                  <fnm>TS</fnm>
               </au>
               <au>
                  <snm>Peterson</snm>
                  <fnm>SN</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>HO</fnm>
               </au>
               <au>
                  <snm>Hutchison</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>270</volume>
            <issue>5235</issue>
            <fpage>397</fpage>
            <lpage>403</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.270.5235.397</pubid>
                  <pubid idtype="pmpid" link="fulltext">7569993</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Horizontal gene transfer, genome innovation and evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Gogarten</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Townsend</snm>
                  <fnm>JP</fnm>
               </au>
            </aug>
            <source>Nat Rev Microbiol</source>
            <pubdate>2005</pubdate>
            <volume>3</volume>
            <issue>9</issue>
            <fpage>679</fpage>
            <lpage>687</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrmicro1204</pubid>
                  <pubid idtype="pmpid" link="fulltext">16138096</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
