<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-10-56</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Methodology article</dochead>
      <bibl>
         <title>
            <p>TACOA &#8211; Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Diaz</snm>
               <mi>N</mi>
               <fnm>Naryttza</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>ndiaz@CeBiTec.Uni-Bielefeld.DE</email>
            </au>
            <au id="A2">
               <snm>Krause</snm>
               <fnm>Lutz</fnm>
               <insr iid="I5"/>
               <email>Lutz.Krause@rdls.nestle.com</email>
            </au>
            <au id="A3">
               <snm>Goesmann</snm>
               <fnm>Alexander</fnm>
               <insr iid="I1"/>
               <insr iid="I4"/>
               <email>agoesman@CeBiTec.Uni-Bielefeld.DE</email>
            </au>
            <au id="A4">
               <snm>Niehaus</snm>
               <fnm>Karsten</fnm>
               <insr iid="I3"/>
               <email>karsten.niehaus@CeBiTec.Uni-Bielefeld.DE</email>
            </au>
            <au id="A5">
               <snm>Nattkemper</snm>
               <mi>W</mi>
               <fnm>Tim</fnm>
               <insr iid="I2"/>
               <email>tim.nattkemper@Uni-Bielefeld.DE</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany</p>
            </ins>
            <ins id="I2">
               <p>Biodata Mining &amp; Applied Neuroinformatics Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany</p>
            </ins>
            <ins id="I3">
               <p>Proteome and Metabolome Research, Faculty of Biology, Bielefeld University, Bielefeld, Germany</p>
            </ins>
            <ins id="I4">
               <p>Computational Genomics, Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany</p>
            </ins>
            <ins id="I5">
               <p>Nestl&#233; Research Center, BioAnalytical Science Department, Lausanne, Switzerland</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>1</issue>
         <fpage>56</fpage>
         <url>http://www.biomedcentral.com/1471-2105/10/56</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19210774</pubid>
               <pubid idtype="doi">10.1186/1471-2105-10-56</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>28</day>
               <month>5</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>11</day>
               <month>2</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>11</day>
               <month>2</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Diaz et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the <it>k</it>-nearest neighbor with strategies from kernel-based learning.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Our novel strategy was extensively evaluated using the leave-one-out cross validation strategy on fragments of variable length (800 bp &#8211; 50 Kbp) from 373 completely sequenced genomes. TACOA is able to classify genomic fragments of length 800 bp and 1 Kbp with high accuracy until rank class. For longer fragments &#8805; 3 Kbp accurate predictions are made at even deeper taxonomic ranks (order and genus). Remarkably, TACOA also produces reliable results when the taxonomic origin of a fragment is not represented in the reference set, thus classifying such fragments to its known broader taxonomic class or simply as "unknown". We compared the classification accuracy of TACOA with the latest intrinsic classifier PhyloPythia using 63 recently published complete genomes. For fragments of length 800 bp and 1 Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia at all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates are also obtained.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>An accurate multi-class taxonomic classifier was developed for environmental genomic fragments. TACOA can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp. The proposed method is transparent, fast, accurate and the reference set can be easily updated as newly sequenced genomes become available. Moreover, the method demonstrated to be competitive when compared to the most current classifier PhyloPythia and has the advantage that it can be locally installed and the reference set can be kept up-to-date.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Metagenomics, or the direct sequencing of collective genomes is paving the road to a better understanding of our ecosystems and the impact of microbes on human health. Researchers are now changing the genome-centric approach, which focussed on isolation, cultivation and sequencing of single species at a time by sequencing complete DNA samples from an environment, thus bypassing the isolation and cultivation step. At present, most metagenomes are sequenced using the whole genome shotgun approach <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. When used in combination with the Sanger technique <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>, a collection of short sequence <it>reads </it>with average length of 800 bp is generated <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Recovery of DNA fragments of several thousand base pairs is also possible using bacterial artificial chromosomes (BACs) <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Longer DNA fragments can be also obtained when short overlapping reads are assembled into larger DNA stretches referred to as <it>contigs</it>.</p>
         <p>An essential task addressed in the metagenomic data analysis workflow is to predict the source organism or <it>taxonomic origin </it>of each read or assembled contig. This process is called taxonomic classification or <it>binning</it>. Predicting the taxonomic origin of reads or contigs can aid in linking gene functions to members of the community or to reconstruct the microbial composition of the studied sample. The knowledge of the taxonomic composition of a sample can be used to derive valuable ecological parameters at the community level (e.g. richness and evenness) <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp> or at the population level (e.g. effective genome size) <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
         <p>Two types of methods are used for the taxonomic classification of environmental fragments: Composition-based and similarity-based-methods. Similarity-based-methods depend on a sequence-comparison with a reference set of genomic sequences. Similarity-based methods directly align metagenomic sequences to a reference set, e.g. using BLAST <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. Composition-based methods rely on characteristics that can be extracted directly from the nucleotide sequences (e.g. oligonucleotide frequencies, GC-content, etc.). Recently, methods employing sequence-composition-based features are gaining popularity <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. In particular, oligonucleotide frequencies have frequently been used because they carry a phylogenetic signal <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Karlin <it>et al</it>. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> showed that significant deviations in terms of di-nucleotide or tetra-nucleotide frequencies were less significant within a genome than between genomes of different species.</p>
         <p>From a machine learning point of view composition- and similarity-based methods can be further divided into supervised and unsupervised apporaches. In the context of this work, supervised methods require a reference set of genomic sequences with known taxonomic origin. Supervised composition-based methods use the reference set to learn sequence characteristics of each taxonomic class during a training phase. Subsequently, the trained classifier is used to identify the taxonomic class of fragments of unknown origin. For example methods such as a Bayesian classifier <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and PhyloPythia <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> fall into the supervised composition-based category. Although MEGAN <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> and CARMA <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> do not have a training phase, these similarity based classifiers are supervised since they rely on the alignment of the genomic fragments to reference sequences with known taxonomic origin.</p>
         <p>The recently published CARMA software <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> has been developed to taxonomically classify short reads (80 bp &#8211; 400 bp) derived by the Pyrosequencing technique (454 &#8211; Life Sciences) <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. CARMA showed to be very accurate on taxonomically classifying reads that carry a complete or partial protein family contained in the Pfam database <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. CARMA has the advantage of giving very accurate predictions but it is computationally expensive. MEGAN <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> performs well in classifying genomic fragments if closely related reference genomes are available, which may not be always the case for organisms contained in an environmental sample. In general, sequence similarity based classifiers, such as CARMA and MEGAN, have the disadvantage of being able to predict the taxonomic class for only those fragments carrying a partial gene or a protein domain. Compared to MEGAN and CARMA, our proposed strategy has the advantage of being easy to maintain and the complete strategy can be run on a desktop computer in a reasonable time frame without preprocessing steps. PhyloPythia, a supervised composition-based method, uses over-represented oligonucleotide patterns as features to train a hierarchical collection of Support Vector Machines (SVMs), which is subsequently used to predict the taxonomic origin of genomic fragments as short as 1 Kbp <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Support Vector Machines demonstrated to achieve a high classification accuracy for fragments of length &#8805; 3 Kbp and moderate accuracy for 1 Kbp long fragments. However, the complete classifier needs to be retrained (a computationally expensive procedure) when newly sequenced genomes are added to the training set.</p>
         <p>Unsupervised learning approaches do not depend on reference sequences for classification, instead characteristics are directly learned from the same data set that is being analyzed. In the context of metagenomics, unsupervised learning methods are used to group genomic sequences such that all sequences originating from the same taxon are grouped into one cluster. Notably, this grouping can be done on different taxonomic ranks, ranging from superkingdom to species. Unsupervised methods are for example employed as a pre-processing step for assembly or to study the community composition of samples. Additionally, marker sequences of known taxonomic origin can be used to infer the taxonomic origin of each generated cluster. However, in this case the marker sequences are not involved in the classification process <it>per se </it><abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p>
         <p>Several unsupervised methods have been developed for the analysis of metagenomic data, the pioneering TETRA <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr></abbrgrp> used tetranucleotide-derived z-score correlations to taxonomically classify genomic fragments from metagenome libraries of low diversity. Abe <it>et al</it>. <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>, in a following work, showed the feasibility to classify environmental genomic fragments with minimal length of 5 Kbp using a self-organizing map (SOM). More recently, Chan <it>et al</it>. developed a seeded growing self-organizing map (S-GSOM) <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> to cluster metagenomic sequences.</p>
         <p>Currently, completely sequenced genomes, which could be used as a reference for the taxonomic classification of metagenomic sequences, become available at an exponential rate. Therefore, the taxonomic classification of metagenomic data will greatly benefit from supervised methods that can be instantaneously updated when new genomes become available. Herein, we present a TAxonomic COmposition Analysis method (TACOA) able to predict the taxonomic origin of environmental genomic fragments of variable length in a supervised manner. TACOA can be easily installed and run on a desktop computer offering more independence in the analysis of metagenomic data sets. Furthermore, the reference set used by the proposed classifier can easily be updated with newly sequenced genomes.</p>
         <p>TACOA applies the intuitive idea of the <it>k</it>-nearest neighbor (<it>k</it>-NN) approach <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and combines it with a smoother kernel function <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>. Compared to other less intuitive and more complex approaches, <it>k</it>-NN based methods have proven to yield competitive results in a large number of classification problems <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>. In particluar, if the classification problem has a multi-class nature. The kernelized <it>k</it>-NN approach used in TACOA allows to realize an accurate multi-class classifier. In general, <it>k</it>-NN is intuitive, does not make any assumptions about the distribution of the input data and the reference set can be easily updated. For a wide range of practical applications it approximates the optimal classifier if the reference set is large enough. A further advantage is that the classification results can be easily interpreted. However, the traditional <it>k</it>-NN algorithm runs into problems when dealing with high dimensional input data (called curse of dimensionality) <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. In our extension of <it>k</it>-NN, the introduction of a Gaussian kernel helps to alleviate this problem. <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. By using a smoother kernel function the complete reference set is considered during the classification procedure instead of a strict neighborhood. We present our kernelized <it>k</it>-NN approach as an alternative to solve the problem of taxonomically classifying environmental genomic fragments.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>The idea behind our approach is to exploit the benefits of the case-based-reasoning <it>k</it>-NN algorithm, which classifies vectors (i.e. Genomic Feature Vectors, GFVs) on the basis of the class labels observed for vectors in its neighborhood while keeping the advantage to approximate to the optimal classifier if the training set is large enough. In particular, we used a smoother kernel function with Gaussian density to profit from its implicit weighting scheme, thus allowing more flexibility on setting the neighborhood width and in handling high-dimensional input data. The weights given by a smoother kernel function decrease as the Euclidean distance between the classified GFV and the reference vector increases. The rate at which the weights decreases is controlled by the neighborhood width <it>&#955; </it><abbrgrp><abbr bid="B23">23</abbr></abbrgrp>.</p>
         <sec>
            <st>
               <p>Algorithm</p>
            </st>
            <p>In this study, a genomic fragment is defined as a DNA sequence of a given length (note, that a completely sequenced genome can be regarded as a genomic fragment). The total number of oligonucleotides of length <it>l</it>, from the alphabet &#8721; = {<it>a</it>, <it>t</it>, <it>c</it>, <it>g</it>} is given by 4<sup><it>l</it></sup>. Each genomic fragment is represented as a vector (i.e. GFV) using the Vector Space Model <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. For each of the possible four oligonucleotides in a sequence, the vector stores the ratio between the observed frequency of that oligonucleotide to the expected frequency given the GC-content of that genomic fragment.</p>
            <p>In order to predict the taxonomic origin of a query GFV, TACOA compares that query GFV to the reference GFVs. In our method, the reference GFVs are computed from all 373 completely sequenced reference genomes. In the following, the set of all computed reference GFVs is named as reference set (<b>ref</b><sub><it>set</it></sub>). In this study a reference set consisting of 373 genomes was used, i.e. <it>T </it>= 373 in this case.</p>
            <p>More formally, let <b>ref</b><sub><it>set </it></sub>= {<b>x</b><sub><it>j </it></sub>} with 1 &#8804; <it>j </it>&#8804; <it>T </it>be the set of reference GFVs, where each <b>x</b><sub><it>j </it></sub>represents a GFV computed from a completely sequenced reference genome. Let <b>x </b>be a query GFV representing a genomic fragment to classify. The multi-class classification problem addressed herein, resides in deciding to which of all different taxonomic classes, at rank <it>r</it>, <b>x </b>belongs to.</p>
            <p>For each taxonomic rank <it>r </it>out of superkingdom, phylum, class, order and genus and for each taxonomic class <it>i </it>at that rank, the algorithm computes a discriminant function <it>&#948;</it><sub><it>i</it></sub>(<b>x</b>), and then classifies <b>x </b>into that class with the highest value for its discriminant function.</p>
            <p>More precisely, for a given taxonomic rank <it>r</it>, let <it>i </it>be that class with the highest discriminant function <it>&#948;</it><sub><it>i</it></sub>(<b>x</b>). Then, <b>x </b>is classified into class <it>i </it>if <it>&#948;</it><sub><it>i</it></sub>(<b>x</b>) is at least half as large as the value of the second highest discriminant function on rank <it>r</it>, otherwise <b>x </b>is classified as "unclassified". This optimal cut-off value for the discrimination function at each taxonomic rank <it>r </it>was identified in a grid search. The discriminant function for a taxonomic class <it>i </it>is computed by:</p>
            <p>
               <display-formula id="M1">
                  <m:math name="1471-2105-10-56-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>&#948;</m:mi>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>x</m:mi>
                           </m:mstyle>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munder>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:msub>
                                       <m:mstyle mathvariant="bold" mathsize="normal">
                                          <m:mi>x</m:mi>
                                       </m:mstyle>
                                       <m:mi>j</m:mi>
                                    </m:msub>
                                    <m:mo>&#8712;</m:mo>
                                    <m:mstyle mathvariant="bold" mathsize="normal">
                                       <m:mi>r</m:mi>
                                       <m:mi>e</m:mi>
                                    </m:mstyle>
                                    <m:msub>
                                       <m:mstyle mathvariant="bold" mathsize="normal">
                                          <m:mi>f</m:mi>
                                       </m:mstyle>
                                       <m:mi>i</m:mi>
                                    </m:msub>
                                 </m:mrow>
                              </m:munder>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>K</m:mi>
                                    <m:mi>&#955;</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mstyle mathvariant="bold" mathsize="normal">
                                    <m:mi>x</m:mi>
                                 </m:mstyle>
                                 <m:mo>,</m:mo>
                                 <m:msub>
                                    <m:mstyle mathvariant="bold" mathsize="normal">
                                       <m:mi>x</m:mi>
                                    </m:mstyle>
                                    <m:mi>j</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiTdq2aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqWH4baEcqGGPaqkcqGH9aqpdaaeqbqaaiabdUealnaaBaaaleaacqaH7oaBaeqaaOGaeiikaGIaeCiEaGNaeiilaWIaeCiEaG3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkaSqaaiabhIha4naaBaaameaacqWGQbGAaeqaaSGaeyicI4SaeCOCaiNaeCyzauMaeCOzay2aaSbaaWqaaiabdMgaPbqabaaaleqaniabggHiLdaaaa@4A23@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <b>ref</b><sub><it>i </it></sub>= {<b>x</b><sub><it>j</it></sub>|<b>x</b><sub><it>j </it></sub>&#8712; <b>ref</b><sub><it>set </it></sub>and <b>x</b><sub><it>j </it></sub>stems from class <it>i</it>} is the set of all reference GFVs from class <it>i</it>. The smoother kernel <it>K</it><sub><it>&#955;</it></sub>(<b>x</b>, <b>x</b><sub><it>j</it></sub>) is based on the Gaussian density function that exponentially decreases with Euclidian distance from <b>x</b>:</p>
            <p>
               <display-formula id="M2">
                  <m:math name="1471-2105-10-56-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>K</m:mi>
                              <m:mi>&#955;</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>x</m:mi>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>x</m:mi>
                              </m:mstyle>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>d</m:mi>
                                          <m:mi>w</m:mi>
                                       </m:msub>
                                       <m:msup>
                                          <m:mrow>
                                             <m:mo stretchy="false">(</m:mo>
                                             <m:mstyle mathvariant="bold" mathsize="normal">
                                                <m:mi>x</m:mi>
                                             </m:mstyle>
                                             <m:mo>,</m:mo>
                                             <m:msub>
                                                <m:mstyle mathvariant="bold" mathsize="normal">
                                                   <m:mi>x</m:mi>
                                                </m:mstyle>
                                                <m:mi>j</m:mi>
                                             </m:msub>
                                             <m:mo stretchy="false">)</m:mo>
                                          </m:mrow>
                                          <m:mn>2</m:mn>
                                       </m:msup>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:mn>2</m:mn>
                                       <m:mi>&#955;</m:mi>
                                    </m:mrow>
                                 </m:mfrac>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4saS0aaSbaaSqaaiabeU7aSbqabaGccqGGOaakcqWH4baEcqGGSaalcqWH4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabg2da9iabdwgaLnaaCaaaleqabaGaeiikaGIaeyOeI0scfa4aaSaaaeaacqWGKbazdaWgaaqaaiabdEha3bqabaGaeiikaGIaeCiEaGNaeiilaWIaeCiEaG3aaSbaaeaacqWGQbGAaeqaaiabcMcaPmaaCaaabeqaaiabikdaYaaaaeaacqaIYaGmcqaH7oaBaaWccqGGPaqkaaaaaa@49C6@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <it>d</it><sub><it>w</it></sub>(<b>x</b>, <b>x</b><sub><it>j</it></sub>) is a weighted distance function as defined later in Equation (4) and <it>&#955; </it>controls the neighborhood width around <b>x </b>in the kernel function. Small values of <it>&#955; </it>result in decision boundaries with higher variance that well-fit the reference set while large values achieve smooth and stable decision boundaries that avoid overfitting and are more robust <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>.</p>
            <p>In order to estimate how much a query GFV <b>x </b>differs from a reference GFV the distance between the two vectors is determined. By normalizing each vector to unit length differences in genomic vector lengths are corrected. The distance <it>d </it>between a query GFV <b>x </b>and each reference GFV <b>x</b><sub><it>j </it></sub>is computed using the dot-product between the normalized query GFV <inline-formula><m:math name="1471-2105-10-56-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mstyle mathvariant="bold" mathsize="normal"><m:mover accent="true"><m:mi>x</m:mi><m:mo>^</m:mo></m:mover></m:mstyle><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafCiEaGNbaKaaaaa@2D62@</m:annotation></m:semantics></m:math></inline-formula> and the normalized reference GFV <inline-formula><m:math name="1471-2105-10-56-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:mstyle mathvariant="bold" mathsize="normal"><m:mover accent="true"><m:mi>x</m:mi><m:mo>^</m:mo></m:mover></m:mstyle><m:mi>j</m:mi></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafCiEaGNbaKaadaWgaaWcbaGaemOAaOgabeaaaaa@2EEB@</m:annotation></m:semantics></m:math></inline-formula>:</p>
            <p>
               <display-formula id="M3">
                  <m:math name="1471-2105-10-56-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>d</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>x</m:mi>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>x</m:mi>
                              </m:mstyle>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:mo>&lt;</m:mo>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mover accent="true">
                                 <m:mi>x</m:mi>
                                 <m:mo>^</m:mo>
                              </m:mover>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mover accent="true">
                                    <m:mi>x</m:mi>
                                    <m:mo>^</m:mo>
                                 </m:mover>
                              </m:mstyle>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo>></m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemizaqMaeiikaGIaeCiEaGNaeiilaWIaeCiEaG3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkcqGH9aqpcqaIXaqmcqGHsislcqGH8aapcuWH4baEgaqcaiabcYcaSiqbhIha4zaajaWaaSbaaSqaaiabdQgaQbqabaGccqGH+aGpaaa@3F0F@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The distance <it>d </it>was weighted in order to account for the imbalanced reference set used in this study, where majority classes and minority classes are present, e.g. the bacteria group is over-represented compared to the archaea in a proportion of 10:1.</p>
            <p>The weighted distance function is denoted as <it>d</it><sub><it>w </it></sub>and the weights are assigned using the following weighting scheme. Let <b>x</b><sub><it>j </it></sub>originate from class <it>i </it>and let <it>n</it><sub><it>i </it></sub>be the number of genomes in class <it>i</it>. Furthermore, let <it>T </it>be the number of genomes constituting the reference set. The weighted distance function <it>d</it><sub><it>w </it></sub>is given by:</p>
            <p>
               <display-formula id="M4">
                  <m:math name="1471-2105-10-56-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:msub>
                              <m:mi>d</m:mi>
                              <m:mi>w</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>x</m:mi>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>x</m:mi>
                              </m:mstyle>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mi>T</m:mi>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>n</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mfrac>
                           <m:mi>d</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>x</m:mi>
                           </m:mstyle>
                           <m:mo>,</m:mo>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>x</m:mi>
                              </m:mstyle>
                              <m:mi>j</m:mi>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemizaq2aaSbaaSqaaiabdEha3bqabaGccqGGOaakcqWH4baEcqGGSaalcqWH4baEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaemivaqfabaGaemOBa42aaSbaaeaacqWGPbqAaeqaaaaakiabdsgaKjabcIcaOiabhIha4jabcYcaSiabhIha4naaBaaaleaacqWGQbGAaeqaaOGaeiykaKcaaa@4470@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>This weighting scheme assigns small weights to the GFVs belonging to the majority classes and a relative larger weight for GFVs contained in the minority classes.</p>
         </sec>
         <sec>
            <st>
               <p>Testing</p>
            </st>
            <p>As a proof of concept the method was evaluated on a data set containing fragments from 373 completely sequenced genomes representing a vast majority of members from the archaeal and bacterial group. All completely sequenced genomes available up to March 2008 were downloaded from the SEED database <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. The selected genomes represent 2 Superkingdoms, 11 Phyla, 21 Classes, 45 Orders and 61 Genera. The taxonomic information for this data set was collected from the taxonomy database located at the US National Center for Biotechnology Information (NCBI) <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Some of the genomes downloaded from SEED were unfinished and present as several contigs. In this case, all contigs of each genome were arbitrarily joined together.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation strategy</p>
            </st>
            <p>The classification accuracy of the presented method was assessed using the leave-one-out cross-validation strategy. In the leave-one-out cross validation, one genome is used to generate fragments of a fixed length and thereafter the taxonomic origin of each fragment was predicted using the remaining 372 genomes and used as the reference set (Figure <figr fid="F1">1</figr>). This simulates the case when the taxonomic origin of DNA fragments is predicted that stem from genomes that are not yet represented in the public genome databases. In a second experiment we also evaluated the classification accuracy of the method with the test set included in the reference set, i.e. in this case the fragments of each genome were taxonomically classified using all 373 genomes as a reference. This experiment simulated the case when fragments need to be classified but they stem from genomes that are already represented in the reference set.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>A sketch of the leave-one-out cross validation strategy adopted in this study</p>
               </caption>
               <text>
                  <p><b>A sketch of the leave-one-out cross validation strategy adopted in this study</b>. A genome is selected from the data set comprising 373 genomes and fragmented subsequently. The collection of genomic fragments is regarded as the test set from which each fragment is drawn and subsequently classified. Classification of each test fragment is carried out using the remaining 372 organisms as a reference.</p>
               </text>
               <graphic file="1471-2105-10-56-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Parameter optimization</p>
            </st>
            <p>We extensively investigated the oligonucleotide length parameter choosing different values of <it>l </it>(2 &#8804; <it>l </it>&#8804; 6) and detected the length which resulted in the maximal classification accuracy. For short fragment lengths only small <it>l </it>values were considered to guarantee that all possible oligonucleotides have a sufficient occurrence, i.e. 4<sup><it>l </it></sup>&lt; |<it>s</it>| in a genomic fragment <it>s </it>(see Methods). The optimal oligonucleotide length <it>l </it>was identified for each genomic fragment length at each taxonomic rank.</p>
            <p>Oligonucleotides of length 4 were sufficient to achieve high classification rates for genomic fragments of length 800 bp, 1 Kbp, and 3 Kbp. For genomic fragments of length 10 Kbp, 15 Kbp, and 50 Kbp, oligonucleotides of length 5 were best suited for classification. A general trend for all genomic fragment lengths was that both average specificity and average sensitivity dropped when oligonucleotides longer than 5 were analyzed. In Additional file <supplr sid="S1">1</supplr> the oligonucleotide length-dependent classification accuracy is exemplified using sequence of length 800 bp and 50 Kbp. Conversely, the false negative rate increased when longer oligonucleotide lengths were considered (Additional file <supplr sid="S1">1</supplr>). A detailed table summarizing average accuracy values and standard deviations for the two different fragment lengths (800 bp and 50 Kbp) and for each oligonucleotide length analyzed is given as Additional file <supplr sid="S2">2</supplr>.</p>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <text>
                  <p><b>Oligonucleotide length-dependent performance for two different genomic fragment length.</b> Achieved specificity (left), sensitivity (middle) and false negative rate (right) for different oligonucleotide lengths in genomic fragments of length 800 bp (a) and 50 Kbp (b). For clarity the standard deviation was not depicted in these figures, instead is given as Additional file <supplr sid="S2">2</supplr>.</p>
               </text>
               <file name="1471-2105-10-56-S1.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <text>
                  <p><b>Standard deviation for average accuracy and false negative rate achieved for different oligonucleotide lengths.</b> Standard deviation and average specificity, sensitivity and false negative rate is given for all oligonucleotide lengths and taxonomic ranks evaluated.</p>
               </text>
               <file name="1471-2105-10-56-S2.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>The kernel parameter <it>&#955; </it>governs the width of the local neighborhood, thus influencing the local behavior of the decision boundary allowing to search for an optimal trade-off between a well-fitted and a more generalized classifier.</p>
            <p>A grid search (2 &#8804; <it>&#955; </it>&#8804; 1000) was employed to detect values of <it>&#955; </it>resulting in maximal accuracy values (<it>&#955;</it><sub><b>opt</b></sub>). In general, <it>&#955;</it><sub><b>opt </b></sub>is smaller at lower taxonomic ranks (Table <tblr tid="T1">1</tblr>). This observation may be explained by the drastic increase on the number of taxonomic classes at deeper ranks. If a large number of taxonomic classes occur at deeper ranks, the neighborhood to be considered in the classification task needs to be smaller (small <it>&#955;</it>) than for broader taxonomic ranks. On the other hand, if a large <it>&#955; </it>is considered and a large number of classes exists, the respective neighborhood of a query genomic vector may cover too many reference vectors from diverse taxonomic classes; resulting in a negative impact on the classification accuracy. However, if the reference vectors from a taxonomic class are sparsely distributed from the query genomic vector, it is necessary to consider a bigger neighborhood (large <it>&#955;</it>). This may explain those cases where a large <it>&#955;</it><sub><b>opt </b></sub>is obtained.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Optimized parameter obtained for each genomic fragment length at each taxonomic rank</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="5" ca="center">
                        <p>
                           <it>&#955;</it>
                           <sub>
                              <b>opt</b>
                           </sub>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>
                           <b>Fragment length</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>S</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>P</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>C</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>O</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>G</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>800 bp</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>1 Kbp</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>200</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>3 Kbp</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>400</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>10 Kbp</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>400</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>90</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>15 Kbp</p>
                     </c>
                     <c ca="center">
                        <p>400</p>
                     </c>
                     <c ca="center">
                        <p>300</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>200</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>50 Kbp</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>1000</p>
                     </c>
                     <c ca="center">
                        <p>400</p>
                     </c>
                     <c ca="center">
                        <p>500</p>
                     </c>
                     <c ca="center">
                        <p>80</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Optimal lambda parameter (<it>&#955;</it><sub><b>opt</b></sub>) is shown for each genomic fragment length at each taxonomic rank: Superkingdom (S), Phylum (P), Class (C), Order (O), and Genus (G).</p>
               </tblfn>
            </tbl>
            <p>During the optimization procedure, optimal parameters were chosen based on average accuracy values over all taxonomic classes at each taxonomic rank, therefore it may occur that the optimal parameters chosen are indeed suboptimal for some taxonomic classes at a given rank. In consequence, the accuracy for some taxonomic classes can drop dramatically, this situation can be seen as "gaps" in Figure <figr fid="F2">2</figr>.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Classification accuracy achieved for genomic fragments of different lengths</p>
               </caption>
               <text>
                  <p><b>Classification accuracy achieved for genomic fragments of different lengths</b>. Bars depict detailed specificity and average values for specificity (Sp.), sensitivity (Sn.) and false negative rate (FNr.) for each fragment length on different taxonomic ranks. Each color represents a genomic fragment length.</p>
               </text>
               <graphic file="1471-2105-10-56-2"/>
            </fig>
            <p>From a practical perspective we regarded it to be more valuable to produce a low number of highly reliable predictions rather than a large number of predictions with low reliability. Therefore, in this study we favored parameters that produce a high specificity rather than a high sensitivity.</p>
         </sec>
         <sec>
            <st>
               <p>Classification accuracy for genomic fragments of variable length</p>
            </st>
            <p>The classification accuracy of TACOA was evaluated on genomic fragments of lengths ranging from 800 bp to 50 kbp. A total of 11,730,382 genomic fragments from 373 different species were analyzed, comprising &#8776;42 Mb of sequence data. The classification accuracy for all different evaluated genomic fragment lengths, taxonomic ranks, and taxonomic classes is given in detail in Figure <figr fid="F2">2</figr>.</p>
            <p>A high proportion of contigs (genomic fragments of length 3 Kbp, 10 Kbp, 15 Kbp, and 50 Kbp) was correctly classified with an average sensitivity between 76% at rank superkingdom and 39% at rank genus (Figure <figr fid="F3">3</figr>). At the same time, less than 10% of contigs were misclassified (false negative rate) at all taxonomic ranks. For the remaining contigs the taxonomic origin could not be inferred and hence these were assigned to the "unclassified" class. Overall, reliable predictions were obtained with an average specificity ranging from 89% at superkingdom to 71% at rank genus. For the longest analyzed contig length (50 Kbp), TACOA achieved an average sensitivity of 82% at superkingdom and 46% at genus, and specificity of 93% (superkingdom) and 77% (genus) (Figure <figr fid="F2">2</figr>, Additional file <supplr sid="S3">3</supplr>). Also for shorter contigs, a high classification accuracy was obtained. For example, 74% of the contigs of length 3 Kbp were correctly classified at rank superkingdom and 31% at rank genus (Figure <figr fid="F2">2</figr>, Additional file <supplr sid="S3">3</supplr>), the specificity for contigs of length 3 kbp reached values between 74% (superkingdom) and 31% (genus).</p>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <text>
                  <p><b>Fragment-length and rank dependent performance.</b> Sensitivity (left) and specificity (right) achieved by TACOA for each genomic fragment length and taxonomic rank evaluated. Single read lengths are simulated by fragments 800 bp and 1 Kbp long and contigs by fragment lengths between 3 Kbp and 50 Kbp.</p>
               </text>
               <file name="1471-2105-10-56-S3.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Overall performance for reads and contigs for each taxonomic rank</p>
               </caption>
               <text>
                  <p><b>Overall performance for reads and contigs for each taxonomic rank</b>. Average sensitivity (Sn.), specificity (Sp.), and false negative rate (FNr.) achieved for reads and contigs at each taxonomic rank.</p>
               </text>
               <graphic file="1471-2105-10-56-3"/>
            </fig>
            <p>In this evaluation, single reads were represented by genomic fragments of length 800 bp &#8211; 1 Kbp. TACOA is capable of accurately predicting the taxonomic origin of single reads up to the rank of class, despite the limited information contained in these short sequences. A high proportion of reads was correctly classified. For reads of length 800 bp, the average sensitivity was between 67% at superkingdom to 16% at rank class and for reads of length 1 Kbp, it ranged from 71% to 22%. Furthermore, in average only between 9% (superkingdom) and 5% (class) of reads were misclassified. Overall, reliable predictions were obtained, with an average specificity ranging from 73% (superkingdom) to 62% (class) for 800 bp reads and between 73% and 64% for reads of length 1 Kbp. In light of the limited information contained in fragments of length 800 bp &#8211; 1 Kbp and the complexity of the classification problem (e.g. 62 classes on rank genus), TACOA also achieves a surprisingly good performance for single reads at rank order and genus (Additional file <supplr sid="S3">3</supplr>).</p>
            <p>However, in practice it is not recommended to interpret classification results of single reads on these ranks because only a small number of fragments may be represented in the currently available sequenced genomes. In real metagenomic data sets, already sequenced organisms may be contained in the studied sample. Therefore, the classification accuracy of TACOA was also assessed for fragments stemming from organisms included in the reference set (Additional file <supplr sid="S4">4</supplr>). As expected, having the source organisms of classified fragments included in the reference set has a markedly positive impact on the accuracy at all taxonomic ranks. The sensitivity increased of up to 30%. Furthermore, the specificity substantially increased while the false negative rate was reduced (Additional file <supplr sid="S4">4</supplr>).</p>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <text>
                  <p><b>Classification accuracy achieved using two different reference sets. </b>Each colored bar depicts the accuracy achieved by TACOA with two different reference sets. The label "Taxonomic organism of test fragment absent from reference set" refers when the test fragment is classified using a reference set not containing the source organism from which the test fragment originates from.</p>
               </text>
               <file name="1471-2105-10-56-S4.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>As a general trend, the accuracy improves when longer genomic fragments were classified (Figure <figr fid="F2">2</figr>, Additional file <supplr sid="S3">3</supplr>). For example, on rank superkingdom the sensitivity increased from 67% for 800 bp reads to 82% for 50 Kbp contigs and at rank genus from 5% to 46%. Conversely, the accuracy decreases as deeper taxonomic ranks were examined (Figure <figr fid="F3">3</figr>, Additional file <supplr sid="S3">3</supplr>, Additional file <supplr sid="S4">4</supplr>). In general, it is easy to predict classes that are well represented in the reference set, while detecting the underrepresented taxonomic groups is more challenging (Figure <figr fid="F2">2</figr>). TACOA is capable of detecting a remarkably high number of different taxonomic classes, if they are present in a studied sample. For example for contigs of length 3 Kbp, TACOA achieved a sensitivity above 20% for all 11 phyla, for 18 of the 21 classes, for 30 of the 45 orders, and for 33 of the 61 genera represented in our test set (Additional file <supplr sid="S5">5</supplr> and Additional file <supplr sid="S6">6</supplr>).</p>
            <suppl id="S5">
               <title>
                  <p>Additional file 5</p>
               </title>
               <text>
                  <p><b>Intervals for specificity (left) and sensitivity (right) of predicted taxonomic classes for reads.</b> Classification accuracy intervals for genomic fragments of length 800 bp (top) and 1 Kbp (bottom).</p>
               </text>
               <file name="1471-2105-10-56-S5.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S6">
               <title>
                  <p>Additional file 6</p>
               </title>
               <text>
                  <p><b>Intervals for specificity (left) and sensitivity (right) of predicted taxonomic classes for contigs.</b> Classification accuracy intervals for genomic fragments of length 3 Kbp, 10 Kbp, 15 Kbp, and 50 Kbp (from top to bottom).</p>
               </text>
               <file name="1471-2105-10-56-S6.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Assessing the classification accuracy of TACOA and PhyloPythia for genomic fragments of variable length</p>
            </st>
            <p>We compared the classification accuracy (sensitivity, specificity and false negative rate) of our proposed kernelized <it>k</it>-NN classification method with PhyloPythia, which employs a hierarchical collection of SVMs for the taxonomic classification of environmental fragments. The set of completely sequenced genomes used for comparison was selected as follows: at rank class, two different genomes were randomly chosen from each taxonomic class guaranteeing that the data set used in the comparison is unbiased. Moreover, the genomes were randomly selected from the universe of all recently published genomes ensuring that the test set is not contained in the training set of PhyloPythia or reference set of TACOA. The selected test set resembles very well the situation when the classifiers need to predict the taxonomic origin of organisms that have not yet been sequenced.</p>
            <p>In general, TACOA and PhyloPythia achieved quite comparable classification accuracies, but TACOA has a slightly improved performance for the classification of short DNA fragments. For the classification of reads of length 800 bp and 1 Kbp, TACOA has a higher sensitivity while both tools achieve a comparable false negative rate and specificity values (Figure <figr fid="F4">4</figr>). Remarkably, on ranks order and genus TACOA is still able to correctly classify between 3% and 17% of short fragments (sensitivity), while PhyloPythia cannot infer the taxonomic origin of any of the fragments and thus has an average sensitivity of 0%. For longer contigs (DNA fragments of length 10 Kbp) PhyloPythia is more sensitive on higher taxonomic ranks (superkingdom, phylum and class). In contrast, TACOA produces less misclassifications (false negative rate) making its prediction more reliable. On lower taxonomic ranks (genus and order), TACOA is able to correctly infer the taxonomic origin of about 10% to 17% of all contigs, while PhyloPythia has a sensitivity of 0% for all taxonomic groups at these ranks.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Classification accuracy obtained for TACOA and PhyloPythia</p>
               </caption>
               <text>
                  <p><b>Classification accuracy obtained for TACOA and PhyloPythia</b>. Sensitivity (top), specificity (middle) and false negative rate (bottom) achieved by TACOA and PhyloPythia for three different genomic fragment lengths and taxonomic ranks evaluated. Single read lengths are represented by fragments of length 800 bp and 1 Kbp and contigs by 10 Kbp long fragments. The accuracy achieved is depicted using green bars for TACOA and blue bars for PhyloPythia. The sensitivity and specificity charts are scaled between 0&#8211;100% and the false negative rate is scaled between 0&#8211;30%.</p>
               </text>
               <graphic file="1471-2105-10-56-4"/>
            </fig>
            <p>A closer analysis of the classification of short DNA fragments, across ranks superkingdom to class, reveals that TACOA achieved sensitivity values of 71% to 3% for 800 bp fragments and 76% to 11% for 1 Kbp fragments. On the other hand, at ranks superkingdom, phylum and class PhyloPythia obtained a slightly lower sensitivity of 66% to 6% for 800 bp fragments and 75% to 9% for 1 Kbp fragments. At deeper ranks order and genus, TACOA is able to correctly classify between 3% and 7% of all short fragments (sensitivity), while only between 1% and 2.43% of fragments are misclassified (false negative rate). In contrast, PhyloPythia was not able to predict any taxonomic class resulting in a sensitivity of 0% for all groups on these two ranks. Overall, for short fragments TACOA is more sensitive at almost all taxonomic ranks, in particular at ranks order and genus. The only exception is at rank class, at which PhyloPythia is more sensitive for the classification of 800 bp fragments. At the same time, for the classification of short fragments TACOA has a slightly lower false negative rate for almost all taxonomic ranks. The only exceptions are rank phylum at which PhyloPythia has a lower false negative rate for 800 bp fragments. For the classification of contigs of length 10 Kbp, TACOA achieved a sensitivity between 73% and 30% at ranks superkingdom to class, while PhyloPythia correctly classified between 82% and 47%. According to these results PhyloPythia was between 9% and 17% more sensitive than TACOA. But for the same contig length and ranks, TACOA is between 10% and 9% more specific than PhyloPythia. In addition, a high percentage of misclassifications was also observed for PhyloPythia (18.64% in average) in contrast to that achieved by TACOA (4.30% in average). At lower taxonomic ranks, TACOA achieved average sensitivity values between 17% (order) and 10% (genus) for the classification of 10 Kbp contigs, while PhyloPythia was not able to predict any taxonomic class for these long contigs, thus obtaining a sensitivity of 0% (Figure <figr fid="F4">4</figr>). Although PhyloPythia was not able to make predictions for ranks order and genus, a marginal misclassification rate was observed (0.14% at rank order and 0.10% at rank genus) for a fragment length of 10 Kbp. Detailed sensitivity, specificity and false negative rate values for all taxonomic ranks and evaluated lengths are given in Additional file <supplr sid="S7">7</supplr>, Additional file <supplr sid="S8">8</supplr> and Additional file <supplr sid="S9">9</supplr>.</p>
            <suppl id="S7">
               <title>
                  <p>Additional file 7</p>
               </title>
               <text>
                  <p><b>Detailed accuracy obtained for genomic fragments of length 800 bp using TACOA and PhyloPythia classifiers.</b> At each taxonomic rank, the classification accuracy (specificity and sensitivity) achieved for two different intrinsic classifiers: TACOA and PhyloPythia is given. The symbol (-) refers to the cases where the respective value cannot be mathematically defined.</p>
               </text>
               <file name="1471-2105-10-56-S7.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S8">
               <title>
                  <p>Additional file 8</p>
               </title>
               <text>
                  <p><b>Detailed accuracy obtained for genomic fragments of length 1 Kbp using TACOA and PhyloPythia classifiers.</b> At each taxonomic rank, the classification accuracy (specificity and sensitivity) achieved for two different intrinsic classifiers: TACOA and PhyloPythia is given. The symbol (-) refers to the cases where the respective value cannot be mathematically defined.</p>
               </text>
               <file name="1471-2105-10-56-S8.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <suppl id="S9">
               <title>
                  <p>Additional file 9</p>
               </title>
               <text>
                  <p><b>Detailed accuracy obtained for genomic fragments of length 10 Kbp using TACOA and PhyloPythia classifiers. </b>At each taxonomic rank, the classification accuracy (specificity and sensitivity) achieved for two different intrinsic classifiers: TACOA and PhyloPythia is given. The symbol (-) refers to the cases where the respective value cannot be mathematically defined.</p>
               </text>
               <file name="1471-2105-10-56-S9.pdf">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Influence of horizontal gene transfer on the classification accuracy of an intrinsic-based classifier</p>
            </st>
            <p>The classification accuracy of methods using composition-based features might be influenced by a heterogeneous nucleotide composition present in the DNA sequence of the analyzed genomic fragment. Although differences in the nucleotide composition of DNA sequences can be linked to a number of genomic attributes, including codon usage, DNA base-stacking energy, DNA structural conformation, strand asymmetry and even relic features of the primary genetic information, horizontal gene transfer events (HGT) is one of the most common cause <abbrgrp><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>. The work of Brown <it>et al</it>. also suggests that despite the rapid changes on the nucleotide composition of recent transferred DNA chunks, the phylogenetic signal from the donor can still be detected if the HGT event is recent, rather than ancient <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. Since the importance of HGT events has been gaining increasing attention lately <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, we investigated its influence in the accuracy of the intrinsic-based classifier TACOA.</p>
            <p>One of the findings of this work is that tetranucleotides were best suited to analyzed genomic fragments &#8804; 3 Kbp. But it has been reported that tetranucleotide frequencies are a good measure to detect horizontally transferred regions <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. Therefore, any classifier aiming to predict the taxonomic origin of genomic fragments based on a tetranucleotide feature is susceptible to "wrongly" classify to the donor taxonomic class a genomic fragment obtained via HGT. To explore the influence of HGT events in the classification accuracy of TACOA, we selected fragments of length 1 Kbp from two genomes (one archaeal and one bacterial). Several studies <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp> have reported acquisition of large stretches of DNA via HGT events for <it>Thermoplasma acidophilum </it>(archaea) and for <it>Thermotoga maritima </it>(bacteria).</p>
            <p>In particular, the archaeal genome of <it>Thermoplasma acidophilum </it>has been reported to acquire &#8776;12% of its genome via HGT. The main donors seem to belong to bacterial organisms, but also some archaeal species have been detected <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr></abbrgrp>. It has been suggested that <it>T. acidophilum </it>has received genes via HGT from <it>Sulfolobus solfataricus</it>, a distantly related crenarchaeota living in the same ecological niche <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>. The sensitivity achieved by TACOA for <it>T. acidophilum </it>was 43% for reads 800 bp long and 51% for reads of length 1 Kbp.</p>
            <p>In order to evaluate the taxonomic distribution of misclassifications for <it>T. acidophilum </it>genomic fragments, we fragmented its genome in pieces of length 1 Kbp and predicted their taxonomic origin. For the 1,564 fragments analyzed, we found that 1% (16 from 1,564) were misclassified into the order sulfolobales, another 3% (47 from 1,564) into other members of the euryarchaeota group, 7% (110 from 1,564) to a variety of members from the bacterial group, and 38% (601 from 1,564) could not be classified (Figure <figr fid="F5">5</figr>). From the proportion of genomic fragments that were "erroneously" misclassified, the largest fraction (7%) was placed into the sulfolobus group. The results of the taxonomic distribution of "misclassifications" made by TACOA for <it>T. acidophilum </it>are in close agreement to previous studies <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr></abbrgrp>. Hence, the low number of correctly classified fragments obtained for <it>T. acidophilum </it>at rank genus may be partially explained by the lateral transfered DNA from other species.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Distribution of taxonomic assignments for Thermoplasma acidophilum</p>
               </caption>
               <text>
                  <p><b>Distribution of taxonomic assignments for Thermoplasma acidophilum</b>. Proportions of genomic fragments originating from the <it>T. acidophilum </it>genome that are misclassified into other taxonomic groups.</p>
               </text>
               <graphic file="1471-2105-10-56-5"/>
            </fig>
            <p>We also explored the bacterial genome of <it>Thermotoga maritima</it>, which is another organism with a high number of candidate genes that have been presumably acquired from archaea via HGT <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. A total of 1,860 genomic fragments of length 1 Kbp each were classified using TACOA and analyzed (Additional file <supplr sid="S8">8</supplr>). A high number of misclassified genomic fragments were "wrongly" assigned to the archaeal group (91 from 1,860), a small fraction (27 from 1,860) was erroneously assigned to the sulfolobus group and 27% (503 from 1,860) could not be classified. Conversely to <it>T. acidophilum</it>, the genome <it>T. maritima </it>seems to be recipient of DNA originating mainly from archaeal species as suggested by other authors <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>. These two case studies strongly suggest that horizontally transfered stretches of DNA can affect the classification accuracy of a classifier using compositional based features to infer the taxonomic origin of genomic fragments. A possible explanation for this observation is that the nucleotide composition of transferred DNA chunks still carry phylogenetic signals from the donor genome after the HGT event has occurred as suggested by Brown <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion and conclusion</p>
         </st>
         <p>Our novel strategy named TACOA can accurately predict the taxonomic origin of genomic fragments from metagenomic data sets by combining the advantages of the <it>k</it>-NN approach with a smoothing kernel function. The reference set used by our proposed method can be easily updated by simply adding the Genomic Feature Vectors (GFVs) from the new genomes to the reference set without the need of retraining. Our standalone tool TACOA can also be easily installed and run on a desktop computer, therefore allowing researchers to locally analyze their metagenomic sequence data or integrate it into their pipelines.</p>
         <p>Analogous to PhyloPythia, researchers can easily incorporate sample specific-models from particular organisms into the framework of TACOA. The use of sample-especific models can greatly support the identification of organisms of special interest. Sample specific-models can be easily incorporated into the framework of TACOA by the researcher using the following approach: Genomic fragments carrying phylogenetic marker genes (such as rRNA genes) or fragments with high similarity to reference sequences of known origin (identified using a blast search) can be taxonomically annotated in a pre-processing step. Subsequently, these annotated fragments can be added to the reference set of TACOA. This can be easily done with the "addReferenceGenome" program provided by TACOA. The use of sample-specific models will improve the accuracy of the classifier for those species that have a reference sequence in public databases (i.e. because the test set is contained in the reference set). In this work, we demonstrated that having the test set in the reference set can have a positive impact on the sensitivity and specificity of up to 30% and at the same time a decline on the false negative rate is observed (Additional file <supplr sid="S4">4</supplr>).</p>
         <p>As a whole, we evaluated the classification accuracy at five different taxonomic ranks: Superkingdom, Phylum, Class, Order, and Genus. TACOA can correctly classify genomic fragments of length as short as 800 bp up to rank class. Our proposed method can be used to predict the taxonomic origin of genomic fragments sequenced from any technology producing fragments &#8805; 800 bp. Our strategy also produced reliable predictions for genomic fragments originating from taxonomic groups that are absent from the reference set (simulating fragments stemming from genomes not yet sequenced). On average and over all taxonomic ranks, 77% of these fragments were correctly classified as "unknown".</p>
         <p>TACOA compares well to the current most sophisticated taxonomic classifier for environmental fragments PhyloPyhtia. In terms of percentage of correctly classified fragments (sensitivity) TACOA slightly outperforms PhyloPythia for reads of length 800 bp and 1 Kbp at all taxonomic ranks evaluated, except for reads 800 bp at rank class. But the very low false negative rate (0.16%) and the high specificity (86%) of TACOA makes the accuracy for reads of length 800 bp (at rank class) comparable to that obtained by PhyloPythia. Compared to TACOA, the overall reduced sensitivity obtained by PhyloPythia (evident for the analyzed read lengths) is partially due to the absence of the phylum Chloroflexi and Thermatogae from its training set. This example illustrates the positive effect of an updated training or reference set in the prediction of known taxonomic classes.</p>
         <p>For contigs of length 10 Kbp, TACOA achieved lower sensitivity, lower false negative rate and higher specificity values than PhyloPyhtia. Although PhyloPythia achieves higher sensitivity values for contigs of length 10 Kbp the overall performance is comparable for both classifiers at ranks superkingdom, phylum and class.</p>
         <p>At deeper taxonomic ranks (order and genus), for all evaluated lengths TACOA was still able to provide correct classifications for several taxonomic classes (average sensitivity of about 7%) while PhyloPythia failed in making any taxonomic assignments (sensitivity of 0%). With an average sensitivity of 17% (order) and 10% (genus), an average false negative rate of 1.45% (order) and 2.29% (genus), TACOA can provide a more detailed view of the taxonomic composition of an environmental sample. Notice that in practice it is not recommended to draw conclusions at such deep ranks for reads &#8804; 1 Kbp because only a small number of fragments may be represented in the currently available sequenced genomes.</p>
         <p>An interesting observation made during this work was that the classification of genomic fragments is possible using only GFVs computed from completely sequenced genomes rather than computing the vectors on fragments from genomes. Similar observations have already been made by Abe <it>et al</it>. in 2005 and 2006 and more recently by McHardy <it>et al</it>. in 2007, where the developed classifiers were trained with genomic fragments longer than the ones being tested. Here we demonstrated that even complete genomes can be used as reference to classify environmental genomic DNA fragments.</p>
         <p>This study supports the findings that frequencies of short length oligonucleotides (i.e. tetra- and penta-oligonucleotides) are best suited to capture taxon-specific differences among prokaryotic genomes <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B16">16</abbr><abbr bid="B20">20</abbr></abbrgrp>. Moreover, our parameter search analysis strongly suggests that tetra- or penta-oligonucleotide frequencies are optimal features for TACOA to classify environmental genomic fragments as short as 800 bp. This observation is in accordance to those reported by Bohlin <it>et al</it>. <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> who already proposed that little increase in information potential about phylogenetic relationships is gained in oligonucleotide sizes larger than hexa-nucleotides.</p>
         <p>We showed that recent events of HGT can affect the accuracy of a composition-based classifier. The correct classification of horizontally transferred regions into its "current" taxon is difficult if these still carry a strong phylogenetic signal from the donor genome. This was illustrated by classifying fragments of length 1 Kbp from the archaea <it>T. acidophilum </it>and the bacteria <it>T. maritima</it>. Notably, HGT is not the only phenomena causing variations in the oligonucleotide frequencies within genomes and hence affecting the classification performance.</p>
         <p>TACOA combines the ability of predicting the taxonomic origin of genomic fragments with high accuracy and the advantage of being a tool that can easily be installed and used on a desktop computer breaking any dependency and limitations that web server services may bring. Altogether, it strongly suggests that TACOA offers a great potential to assist on the exploration of the taxonomic composition of metagenomic data sets.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Computation of genomic feature vectors (GFV) using the oligonucleotide frequency deviation</p>
            </st>
            <p>In the following, the computation of GFVs used by the TACOA classifier is described in detail. Computation of the GFVs is performed for each genome in the reference set and for each read and contig to be classified.</p>
            <p>An oligonucleotide <it>o </it>is defined as a string over the alphabet &#8721; = {<it>a</it>, <it>t</it>, <it>c</it>, <it>g</it>}. The total number of possible oligonucleotides of length <it>l </it>is given by 4<sup><it>l</it></sup>, e.g. for <it>l </it>= 3 oligonucleotides can take the form of <it>o</it><sup>[1] </sup>= <it>aaa</it>, <it>o</it><sup>[2] </sup>= <it>aat</it>, ..., <it>o</it><sup>[64] </sup>= <it>ggg</it>. To build a GFV for a genomic fragment, for each oligonucleotide the oligonucleotide deviation score is computed as the ratio between the observed oligonucleotide frequency in the fragment and the expected oligonucleotide frequency in that fragment given its GC-content. The GC-content has a profound impact on the sequence composition of genomes but a low phylogenetic signal. It has been shown that closely related organisms coming from different environments may show profound differences in GC-content <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>.</p>
            <p>More formally, given a genomic fragment <it>s</it>, for each oligonucleotide <it>o</it><sup>[<it>y</it>]</sup>(<it>y </it>= 1, 2, 3, ..., 4<sup><it>l</it></sup>) we count the number of occurrences of <it>o</it><sup>[<it>y</it>] </sup>in <it>s</it>. The counting of the oligonucleotide frequencies is conducted in a sliding window approach with step size of 1 and window size <it>l</it>. This ratio is carried out on the forward and reverse DNA strand.</p>
            <p>In order to more efficiently recover the phylogenetic signal contained in the oligonucleotide frequency deviation, we correct for biases introduced by the GC-content of the genomic fragments. The expected frequency for a certain oligonucleotide <it>o </it>in a genomic fragment <it>s </it>can be estimated by:</p>
            <p>
               <display-formula id="M5">
                  <m:math name="1471-2105-10-56-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>E</m:mi>
                           <m:mo stretchy="false">[</m:mo>
                           <m:mi>o</m:mi>
                           <m:mo stretchy="false">]</m:mo>
                           <m:mo>&#8776;</m:mo>
                           <m:mo>|</m:mo>
                           <m:mi>s</m:mi>
                           <m:mo>|</m:mo>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8719;</m:mo>
                                 <m:mrow>
                                    <m:mi>q</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mrow>
                                    <m:mo>|</m:mo>
                                    <m:mi>o</m:mi>
                                    <m:mo>|</m:mo>
                                 </m:mrow>
                              </m:munderover>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mo stretchy="false">(</m:mo>
                                 <m:msub>
                                    <m:mi>o</m:mi>
                                    <m:mi>q</m:mi>
                                 </m:msub>
                                 <m:mo stretchy="false">)</m:mo>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyrauKaei4waSLaem4Ba8Maeiyxa0LaeyisISRaeiiFaWNaem4CamNaeiiFaW3aaebCaeaacqWGWbaCcqGGOaakcqWGVbWBdaWgaaWcbaGaemyCaehabeaakiabcMcaPaWcbaGaemyCaeNaeyypa0JaeGymaedabaGaeiiFaWNaem4Ba8MaeiiFaWhaniabg+Givdaaaa@4759@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <it>o</it><sub><it>q </it></sub>is the nucleotide at position q of <it>o </it>and <it>p</it>(<it>o</it><sub><it>q</it></sub>) defines the probability to observe <it>o</it><sub><it>q </it></sub>in the analyzed genomic fragment, given its GC-content. The length of a genomic fragment is defined as |<it>s</it>| and |<it>o</it>| is the length of an oligonucleotide. Let <it>O</it>[<it>o</it>] be the observed occurrence of oligonucleotide <it>o </it>in the analyzed genomic fragment, then <it>p</it>(<it>o</it><sub><it>q</it></sub>) is estimated by <inline-formula><m:math name="1471-2105-10-56-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>p</m:mi><m:mo stretchy="false">(</m:mo><m:msub><m:mi>o</m:mi><m:mi>q</m:mi></m:msub><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mfrac><m:mrow><m:mi>O</m:mi><m:mo stretchy="false">[</m:mo><m:mi>o</m:mi><m:mo stretchy="false">]</m:mo></m:mrow><m:mrow><m:mo>|</m:mo><m:mi>s</m:mi><m:mo>|</m:mo><m:mo>&#8722;</m:mo><m:mo stretchy="false">(</m:mo><m:mi>l</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeiikaGIaem4Ba82aaSbaaSqaaiabdghaXbqabaGccqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabd+eapjabcUfaBjabd+gaVjabc2faDbqaaiabcYha8jabdohaZjabcYha8jabgkHiTiabcIcaOiabdYgaSjabgkHiTiabigdaXiabcMcaPaaaaaa@42F6@</m:annotation></m:semantics></m:math></inline-formula>. For each oligonucleotide <it>o</it>, a deviation score <it>g</it>(<it>o</it>) is computed in a given genomic fragment, which is normalized by the GC-content. The deviation score <it>g</it>(<it>o</it>) resolves for under and over-represented oligonucleotide frequencies in a genomic fragment. The deviaton score <it>g</it>(<it>o</it>) is given by:</p>
            <p>
               <display-formula id="M6">
                  <m:math name="1471-2105-10-56-i9" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>g</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>o</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mrow>
                              <m:mo>{</m:mo>
                              <m:mrow>
                                 <m:mtable>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mn>0</m:mn>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mtext>if</m:mtext>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>O</m:mi>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mi>o</m:mi>
                                             <m:mo stretchy="false">]</m:mo>
                                             <m:mo>=</m:mo>
                                             <m:mn>0</m:mn>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:mi>O</m:mi>
                                                   <m:mo stretchy="false">[</m:mo>
                                                   <m:mi>o</m:mi>
                                                   <m:mo stretchy="false">]</m:mo>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>E</m:mi>
                                                   <m:mo stretchy="false">[</m:mo>
                                                   <m:mi>o</m:mi>
                                                   <m:mo stretchy="false">]</m:mo>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mtext>if</m:mtext>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>O</m:mi>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mi>o</m:mi>
                                             <m:mo stretchy="false">]</m:mo>
                                             <m:mo>></m:mo>
                                             <m:mi>E</m:mi>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mi>o</m:mi>
                                             <m:mo stretchy="false">]</m:mo>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                    <m:mtr>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mo>&#8722;</m:mo>
                                             <m:mfrac>
                                                <m:mrow>
                                                   <m:mi>E</m:mi>
                                                   <m:mo stretchy="false">[</m:mo>
                                                   <m:mi>o</m:mi>
                                                   <m:mo stretchy="false">]</m:mo>
                                                </m:mrow>
                                                <m:mrow>
                                                   <m:mi>O</m:mi>
                                                   <m:mo stretchy="false">[</m:mo>
                                                   <m:mi>o</m:mi>
                                                   <m:mo stretchy="false">]</m:mo>
                                                </m:mrow>
                                             </m:mfrac>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mtext>if</m:mtext>
                                          </m:mrow>
                                       </m:mtd>
                                       <m:mtd>
                                          <m:mrow>
                                             <m:mi>O</m:mi>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mi>o</m:mi>
                                             <m:mo stretchy="false">]</m:mo>
                                             <m:mo>&#8804;</m:mo>
                                             <m:mi>E</m:mi>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mi>o</m:mi>
                                             <m:mo stretchy="false">]</m:mo>
                                          </m:mrow>
                                       </m:mtd>
                                    </m:mtr>
                                 </m:mtable>
                              </m:mrow>
                           </m:mrow>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4zaCMaeiikaGIaem4Ba8MaeiykaKIaeyypa0ZaaiqaaeaafaqabeWadaaabaGaeGimaadabaGaeeyAaKMaeeOzaygabaGaem4ta8Kaei4waSLaem4Ba8Maeiyxa0Laeyypa0JaeGimaadajuaGbaWaaSaaaeaacqWGpbWtcqGGBbWwcqWGVbWBcqGGDbqxaeaacqWGfbqrcqGGBbWwcqWGVbWBcqGGDbqxaaaakeaacqqGPbqAcqqGMbGzaeaacqWGpbWtcqGGBbWwcqWGVbWBcqGGDbqxcqGH+aGpcqWGfbqrcqGGBbWwcqWGVbWBcqGGDbqxaeaacqGHsisljuaGdaWcaaqaaiabdweafjabcUfaBjabd+gaVjabc2faDbqaaiabd+eapjabcUfaBjabd+gaVjabc2faDbaaaOqaaiabbMgaPjabbAgaMbqaaiabd+eapjabcUfaBjabd+gaVjabc2faDjabgsMiJkabdweafjabcUfaBjabd+gaVjabc2faDbaaaiaawUhaaaaa@6FD8@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The computed <it>g</it>(<it>o</it>) for each possible <it>o</it><sup>[<it>y</it>] </sup>of length <it>l </it>in a given genomic fragment is summarized in a GFV <b>x </b>(Equation 7), this approach is also referred to as the vector representation model <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>.</p>
            <p>
               <display-formula id="M7">
                  <m:math name="1471-2105-10-56-i10" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>x</m:mi>
                           </m:mstyle>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mrow>
                                 <m:mrow>
                                    <m:mo>(</m:mo>
                                    <m:mrow>
                                       <m:mi>f</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msup>
                                          <m:mi>o</m:mi>
                                          <m:mrow>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mn>1</m:mn>
                                             <m:mo stretchy="false">]</m:mo>
                                          </m:mrow>
                                       </m:msup>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mo>,</m:mo>
                                       <m:mi>g</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msup>
                                          <m:mi>o</m:mi>
                                          <m:mrow>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:mn>2</m:mn>
                                             <m:mo stretchy="false">]</m:mo>
                                          </m:mrow>
                                       </m:msup>
                                       <m:mo stretchy="false">)</m:mo>
                                       <m:mo>,</m:mo>
                                       <m:mn>...</m:mn>
                                       <m:mo>,</m:mo>
                                       <m:mi>g</m:mi>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:msup>
                                          <m:mi>o</m:mi>
                                          <m:mrow>
                                             <m:mo stretchy="false">[</m:mo>
                                             <m:msup>
                                                <m:mn>4</m:mn>
                                                <m:mi>l</m:mi>
                                             </m:msup>
                                             <m:mo stretchy="false">]</m:mo>
                                          </m:mrow>
                                       </m:msup>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mo>)</m:mo>
                                 </m:mrow>
                              </m:mrow>
                              <m:mi>T</m:mi>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeCiEaGNaeyypa0ZaaeWaaeaacqWGMbGzcqGGOaakcqWGVbWBdaahaaWcbeqaaiabcUfaBjabigdaXiabc2faDbaakiabcMcaPiabcYcaSiabdEgaNjabcIcaOiabd+gaVnaaCaaaleqabaGaei4waSLaeGOmaiJaeiyxa0faaOGaeiykaKIaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4zaCMaeiikaGIaem4Ba82aaWbaaSqabeaacqGGBbWwcqaI0aandaahaaadbeqaaiabdYgaSbaaliabc2faDbaakiabcMcaPaGaayjkaiaawMcaamaaCaaaleqabaGaemivaqfaaaaa@50BE@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
         </sec>
         <sec>
            <st>
               <p>Measuring the classification accuracy</p>
            </st>
            <p>We selected different genomic fragment lengths to simulate DNA fragments obtained in real metagenomic sequencing projects. Genomic fragments of length 800 bp and 1 Kbp were chosen to resemble single reads derived by the Sanger technology. Assembled contigs were simulated selecting fragment lengths of 3 Kbp, 10 Kbp, 15 Kbp, and 50 Kbp. Genomic fragment generation was executed in the following manner: For each completely sequenced genome and for each chosen genomic fragment length, 3000 non-overlapping fragments were extracted from the selected genome and subsequently included into the test set.</p>
            <p>We estimated the classification accuracy of the presented method (TACOA) based on the leave-one-out cross-validation strategy. We selected one genome from the 373 different organisms, generated genomic fragments of a given length |<it>s</it>|, represented them as GFVs and predicted their taxonomic origin using the remaining 372 organisms as the reference set (<b>ref</b><sub><it>set</it></sub>). Hereby, each of the 372 genomes in the reference set is represented as a GFV. This procedure was repeated for each genome out of the 373 completely sequenced genomes present in the data set (Figure <figr fid="F1">1</figr>).</p>
            <p>The classification accuracy of the presented method was assessed at each taxonomic rank. At each taxonomic rank, the predicted class of each query genomic fragment was compared to its known taxonomic class. We evaluated the classification accuracy for those genomes having at least two different representatives per taxonomic class. Furthermore, we also evaluated the classification accuracy for those genomes only having one member per taxonomic class, in which case the method should assign them to the "unknown" class. The latter evaluation mimics the situation of organisms without a reference genome because they have not yet been sequenced. The classification accuracy of the presented method was assessed at each taxonomic rank.</p>
            <p>In this study, we employed the adapted definition of sensitivity and specificity proposed by Baldi <it>et al</it>. in 2000 <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. The classification accuracy was evaluated for each taxonomic class. Let the <it>i</it>-th taxonomic class of taxonomic rank <it>r </it>be denoted as class <it>i</it>. Further, let <it>Z</it><sub><it>i </it></sub>be the total number of genomic fragments from class <it>i</it>, the true positives (<it>TP</it><sub><it>i</it></sub>) the number of genomic fragments correctly assigned to class <it>i</it>, the false positives (<it>FP</it><sub><it>i</it></sub>) the number of fragments from any class <it>j </it>&#8800; <it>i </it>that is wrongly assigned to <it>i</it>. The false negatives (<it>FN</it><sub><it>i</it></sub>) is defined as the number of fragments from class <it>i </it>that is erroneously assigned to any other class <it>j </it>&#8800; <it>i</it>. For a genomic fragment whose taxonomic class cannot be inferred, the algorithm classifies it as "unclassified". The unclassified (<it>U</it><sub><it>i</it></sub>) are the number of fragments from class <it>i </it>that cannot be assigned to a taxonomic class, so <it>Z</it><sub><it>i </it></sub>= <it>TP</it><sub><it>i </it></sub>+ <it>FN</it><sub><it>i </it></sub>+ <it>U</it><sub><it>i</it></sub>.</p>
            <p>The sensitivity (<b>Sn</b><sub><it>i</it></sub>) for a taxonomic class <it>i </it>is defined as the percentage of fragments from class <it>i </it>correctly classified and it is computed by:</p>
            <p>
               <display-formula id="M8">
                  <m:math name="1471-2105-10-56-i11" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>S</m:mi>
                           </m:mstyle>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>n</m:mi>
                              </m:mstyle>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>T</m:mi>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>Z</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uamLaeCOBa42aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpjuaGdaWcaaqaaiabdsfaujabdcfaqnaaBaaabaGaemyAaKgabeaaaeaacqWGAbGwdaWgaaqaaiabdMgaPbqabaaaaaaa@3883@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>The reliability (expressed in percentage) of the predictions made by the classifier for class <it>i </it>is denoted as specificity (<b>Sp</b><sub><it>i</it></sub>) and it is measured using the following equation:</p>
            <p>
               <display-formula id="M9">
                  <m:math name="1471-2105-10-56-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>S</m:mi>
                           </m:mstyle>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>p</m:mi>
                              </m:mstyle>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>T</m:mi>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:mi>T</m:mi>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                                 <m:mo>+</m:mo>
                                 <m:mi>F</m:mi>
                                 <m:msub>
                                    <m:mi>P</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeC4uamLaeCiCaa3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpjuaGdaWcaaqaaiabdsfaujabdcfaqnaaBaaabaGaemyAaKgabeaaaeaacqWGubavcqWGqbaudaWgaaqaaiabdMgaPbqabaGaey4kaSIaemOrayKaemiuaa1aaSbaaeaacqWGPbqAaeqaaaaaaaa@3E40@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Note that the specificity for class <it>i </it>is undefined for those cases when the terms <it>TP</it><sub><it>i </it></sub>and <it>FP</it><sub><it>i </it></sub>are both zero (marked as (-) in Additional figures 7 &#8211; 9). The overall specificity is computed over those classes that have a defined specificity value.</p>
            <p>We make use of the false negative rate (<b>FNr</b><sub><it>i</it></sub>) to measure the percentage of items from class <it>i </it>that is misclassified into any class <it>j </it>&#8800; <it>i</it>, which is given by:</p>
            <p>
               <display-formula id="M10">
                  <m:math name="1471-2105-10-56-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mstyle mathvariant="bold" mathsize="normal">
                              <m:mi>F</m:mi>
                              <m:mi>N</m:mi>
                           </m:mstyle>
                           <m:msub>
                              <m:mstyle mathvariant="bold" mathsize="normal">
                                 <m:mi>r</m:mi>
                              </m:mstyle>
                              <m:mi>i</m:mi>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:mi>F</m:mi>
                                 <m:msub>
                                    <m:mi>N</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                              <m:mrow>
                                 <m:msub>
                                    <m:mi>Z</m:mi>
                                    <m:mi>i</m:mi>
                                 </m:msub>
                              </m:mrow>
                           </m:mfrac>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeCOrayKaeCOta4KaeCOCai3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpjuaGdaWcaaqaaiabdAeagjabd6eaonaaBaaabaGaemyAaKgabeaaaeaacqWGAbGwdaWgaaqaaiabdMgaPbqabaaaaaaa@397A@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
         </sec>
         <sec>
            <st>
               <p>Measuring the classification accuracy in the comparison of PhyloPythia and TACOA</p>
            </st>
            <p>The set of completely sequenced genomes used for comparison was selected as follows: at rank class, two different genomes were randomly chosen from each taxonomic class guaranteeing that the data set used in the comparison is unbiased. This procedure yielded a set of 63 genomes that were downloaded from the NCBI genome database <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. For each evaluated fragment length and for each selected genome, ten non-overlapping genomic fragments were randomly extracted for classification. We evaluated both classification strategies at five different taxonomic ranks using three different genomic fragment lengths: 800 bp, 1 Kbp, and 10 Kbp. The PhyloPythia web server with the built-in generic model was employed to predict the taxonomic origin of genomic fragments generated from the 63 selected genomes. To predict the taxonomic origin of fragments from the same set of 63 selected genomes TACOA was executed using the default parameters. Notice that this evaluation aims to investigate the performance that a researcher should expect when analyzing their metagenomic data. The evaluation is not intended to assess the theoretical classification power of a kernelized <it>k</it>-NN against SVMs.</p>
            <p>The accuracy of both classifiers was assessed using the sensitivity, false negative rate and specificity. Values of sensitivity, specificity and false negative rate were computed as previously described in this section. For the analysis of the comparison results between PhyloPythia and TACOA, we decided to give more emphasis to the obtained sensitivity and the false negative rates (FNr or misclassifications) to account for possible compositional biases of the data set. The sensitivity and the FNr measured for one class do not depend on the composition of the remaining classes (since the term false positive is absent in the equations of sensitivity and FNr). Hence, the sensitivity and FNr measured for each taxonomic group is not affected by possible biases of the test set. Contrastingly, the specificity measured for a class is strongly affected by the composition of the test set since it includes the false positives obtained from other classes.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Availability</p>
         </st>
         <p>TACOA can be downloaded at <url>http://www.cebitec.uni-bielefeld.de/brf/tacoa/tacoa.html</url></p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>NND conceived, implemented, and performed the computational work, evaluated, and analyzed the data and drafted the manuscript. LK contributed to the implementation. KN and TWN supervised this work. AG provided the computational infrastructure for data generation and processing. All authors contributed to the editing of the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>NND was supported by the Deutscher Akademischer Austauschdienst. The authors wish to thank Torsten Kasch, Achim Neumann, Ralf Nolte, Bj&#246;rn Fischer and Volker T&#246;lle as members of the Bioinformatics Resource Facility for providing the computational and technical support to accomplish this work. We thank I. Rigoutsos from the Bioinformatics and Pattern Discovery Group, IBM Thomas J Watson Research Center for all the help in using the PhyloPythia web server.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Shotgun sequencing of the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Sutton</snm>
                  <fnm>GG</fnm>
               </au>
               <au>
                  <snm>Kerlavage</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>HO</fnm>
               </au>
               <au>
                  <snm>Hunkapiller</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>280</volume>
            <fpage>1540</fpage>
            <lpage>1542</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.280.5369.1540</pubid>
                  <pubid idtype="pmpid" link="fulltext">9644018</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>DNA sequencing with chain-terminating inhibitors</p>
            </title>
            <aug>
               <au>
                  <snm>Sanger</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Nicklen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Coulson</snm>
                  <fnm>AR</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci</source>
            <pubdate>1997</pubdate>
            <volume>74</volume>
            <fpage>5463</fpage>
            <lpage>5467</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.74.12.5463</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Whole-genome random sequencing and assembly of Haemophilus influenzae Rd</p>
            </title>
            <aug>
               <au>
                  <snm>Fleischmann</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>White</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Clayton</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Kirkness</snm>
                  <fnm>EF</fnm>
               </au>
               <au>
                  <snm>Kerlavage</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Bult</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Tomb</snm>
                  <fnm>JF</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Merrick</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>269</volume>
            <fpage>496</fpage>
            <lpage>512</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.7542800</pubid>
                  <pubid idtype="pmpid" link="fulltext">7542800</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Community structure and metabolism through reconstruction of microbial genomes from the environment</p>
            </title>
            <aug>
               <au>
                  <snm>Tyson</snm>
                  <fnm>GW</fnm>
               </au>
               <au>
                  <snm>Chapman</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hugenholtz</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Allen</snm>
                  <fnm>EE</fnm>
               </au>
               <au>
                  <snm>Ram</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Richardson</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Solovyev</snm>
                  <fnm>VV</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Rokhsar</snm>
                  <fnm>DS</fnm>
               </au>
               <au>
                  <snm>Banfield</snm>
                  <fnm>JF</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>428</volume>
            <fpage>37</fpage>
            <lpage>43</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature02340</pubid>
                  <pubid idtype="pmpid" link="fulltext">14961025</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Characterization of uncultivated prokaryotes: isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon</p>
            </title>
            <aug>
               <au>
                  <snm>Stein</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Marsh</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Shizuya</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>DeLong</snm>
                  <fnm>EF</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>1996</pubdate>
            <volume>178</volume>
            <fpage>591</fpage>
            <lpage>599</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">177699</pubid>
                  <pubid idtype="pmpid" link="fulltext">8550487</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Phylogenetic classification of short environmental DNA fragments</p>
            </title>
            <aug>
               <au>
                  <snm>Krause</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Diaz</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Goesmann</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kelley</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Nattkemper</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rohwer</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2008</pubdate>
            <volume>36</volume>
            <fpage>2230</fpage>
            <lpage>2239</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2367736</pubid>
                  <pubid idtype="pmpid" link="fulltext">18285365</pubid>
                  <pubid idtype="doi">10.1093/nar/gkn038</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Taxonomic composition and gene content of a methane-producing microbial community isolated from a biogas reactor</p>
            </title>
            <aug>
               <au>
                  <snm>Krause</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Diaz</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gartemann</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kr&#246;meke</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Neuweger</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>P&#252;hler</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Runte</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Schl&#252;ter</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Szczepanowski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Tauch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Goesmann</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>J Biotechnol</source>
            <pubdate>2008</pubdate>
            <volume>136</volume>
            <fpage>91</fpage>
            <lpage>101</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jbiotec.2008.06.003</pubid>
                  <pubid idtype="pmpid" link="fulltext">18611419</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Get the most out of your metagenome: computational analysis of environmental sequence data</p>
            </title>
            <aug>
               <au>
                  <snm>Raes</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Foerstner</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Curr Opin Microbiol</source>
            <pubdate>2007</pubdate>
            <volume>10</volume>
            <fpage>490</fpage>
            <lpage>498</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.mib.2007.09.001</pubid>
                  <pubid idtype="pmpid" link="fulltext">17936679</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples</p>
            </title>
            <aug>
               <au>
                  <snm>Abe</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sugawara</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Kinouchi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Kanaya</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ikemura</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>DNA Res</source>
            <pubdate>2005</pubdate>
            <volume>12</volume>
            <fpage>281</fpage>
            <lpage>290</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/dnares/dsi015</pubid>
                  <pubid idtype="pmpid" link="fulltext">16769690</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A novel bioinformatics tool for phylogenetic classification of genomic sequence fragments derived from mixed genomes of uncultured environmental microbes</p>
            </title>
            <aug>
               <au>
                  <snm>Abe</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sugawara</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Kanaya</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ikemura</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Polar Biosci</source>
            <pubdate>2006</pubdate>
            <volume>20</volume>
            <fpage>103</fpage>
            <lpage>112</lpage>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Accurate phylogenetic classification of variable-length DNA fragments</p>
            </title>
            <aug>
               <au>
                  <snm>McHardy</snm>
                  <fnm>AC</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>HG</fnm>
               </au>
               <au>
                  <snm>Tsirigos</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hugenholtz</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rigoutsos</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Nat Methods</source>
            <pubdate>2007</pubdate>
            <volume>4</volume>
            <fpage>63</fpage>
            <lpage>72</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nmeth976</pubid>
                  <pubid idtype="pmpid" link="fulltext">17179938</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Binning sequences using very sparse labels within a metagenome</p>
            </title>
            <aug>
               <au>
                  <snm>Chan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hsu</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Halgamuge</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Tang</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>215</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2383919</pubid>
                  <pubid idtype="pmpid" link="fulltext">18442374</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-9-215</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Compositional biases of bacterial genomes and evolutionary implications</p>
            </title>
            <aug>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mrazek</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>AM</fnm>
               </au>
            </aug>
            <source>J Bacteriol</source>
            <pubdate>1997</pubdate>
            <volume>179</volume>
            <fpage>3899</fpage>
            <lpage>3913</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">179198</pubid>
                  <pubid idtype="pmpid" link="fulltext">9190805</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Campbell</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mrazek</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1999</pubdate>
            <volume>96</volume>
            <fpage>9184</fpage>
            <lpage>9189</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">17754</pubid>
                  <pubid idtype="pmpid" link="fulltext">10430917</pubid>
                  <pubid idtype="doi">10.1073/pnas.96.16.9184</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Capturing whole-genome characteristics in short sequences using a na&#239;ve Bayesian classifier</p>
            </title>
            <aug>
               <au>
                  <snm>Sandberg</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Winberg</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Br&#228;nden</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kaske</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ernberg</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>C&#246;ster</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>1404</fpage>
            <lpage>1409</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">311094</pubid>
                  <pubid idtype="pmpid" link="fulltext">11483581</pubid>
                  <pubid idtype="doi">10.1101/gr.186401</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>MEGAN analysis of metagenomic data</p>
            </title>
            <aug>
               <au>
                  <snm>Huson</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Auch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Qi</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Schuster</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2007</pubdate>
            <volume>17</volume>
            <fpage>377</fpage>
            <lpage>386</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1800929</pubid>
                  <pubid idtype="pmpid" link="fulltext">17255551</pubid>
                  <pubid idtype="doi">10.1101/gr.5969107</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Genome sequencing in microfabricated high-density picolitre reactors</p>
            </title>
            <aug>
               <au>
                  <snm>Margulies</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Egholm</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Altman</snm>
                  <fnm>WE</fnm>
               </au>
               <au>
                  <snm>Attiya</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bader</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Bemben</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>Berka</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Braverman</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>YJ</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Dewell</snm>
                  <fnm>SB</fnm>
               </au>
               <au>
                  <snm>Du</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Fierro</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Gomes</snm>
                  <fnm>XV</fnm>
               </au>
               <au>
                  <snm>Godwin</snm>
                  <fnm>BC</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Helgesen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ho</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Irzyk</snm>
                  <fnm>GP</fnm>
               </au>
               <au>
                  <snm>Jando</snm>
                  <fnm>SC</fnm>
               </au>
               <au>
                  <snm>Alenquer</snm>
                  <fnm>MLI</fnm>
               </au>
               <au>
                  <snm>Jarvie</snm>
                  <fnm>TP</fnm>
               </au>
               <au>
                  <snm>Jirage</snm>
                  <fnm>KB</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>JB</fnm>
               </au>
               <au>
                  <snm>Knight</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Lanza</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Leamon</snm>
                  <fnm>JH</fnm>
               </au>
               <au>
                  <snm>Lefkowitz</snm>
                  <fnm>SM</fnm>
               </au>
               <au>
                  <snm>Lei</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lohman</snm>
                  <fnm>KL</fnm>
               </au>
               <au>
                  <snm>Lu</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Makhijani</snm>
                  <fnm>VB</fnm>
               </au>
               <au>
                  <snm>McDade</snm>
                  <fnm>KE</fnm>
               </au>
               <au>
                  <snm>McKenna</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Nickerson</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Nobile</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Plant</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Puc</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Ronan</snm>
                  <fnm>MT</fnm>
               </au>
               <au>
                  <snm>Roth</snm>
                  <fnm>GT</fnm>
               </au>
               <au>
                  <snm>Sarkis</snm>
                  <fnm>GJ</fnm>
               </au>
               <au>
                  <snm>Simons</snm>
                  <fnm>JF</fnm>
               </au>
               <au>
                  <snm>Simpson</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Srinivasan</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Tartaro</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Tomasz</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Vogt</snm>
                  <fnm>KA</fnm>
               </au>
               <au>
                  <snm>Volkmer</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>SH</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Weiner</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Begley</snm>
                  <fnm>RF</fnm>
               </au>
               <au>
                  <snm>Rothberg</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2005</pubdate>
            <volume>437</volume>
            <fpage>376</fpage>
            <lpage>380</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1464427</pubid>
                  <pubid idtype="pmpid" link="fulltext">16056220</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>The Pfam protein families database</p>
            </title>
            <aug>
               <au>
                  <snm>Finn</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Tate</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mistry</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Coggill</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sammut</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hotz</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Ceric</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Forslund</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2008</pubdate>
            <volume>36</volume>
            <fpage>D281</fpage>
            <lpage>288</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2238907</pubid>
                  <pubid idtype="pmpid" link="fulltext">18039703</pubid>
                  <pubid idtype="doi">10.1093/nar/gkm960</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Application of tetranucleotide frequencies for the assignment of genomic fragments</p>
            </title>
            <aug>
               <au>
                  <snm>Teeling</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Meyerdierks</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bauer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Amann</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gl&#246;ckner</snm>
                  <fnm>FO</fnm>
               </au>
            </aug>
            <source>Environ Microbiol</source>
            <pubdate>2004</pubdate>
            <volume>6</volume>
            <fpage>938</fpage>
            <lpage>947</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1111/j.1462-2920.2004.00624.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">15305919</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Teeling</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Waldmann</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lombardot</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bauer</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gl&#246;ckner</snm>
                  <fnm>FO</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>163</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">529438</pubid>
                  <pubid idtype="pmpid" link="fulltext">15507136</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-5-163</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Nearest Neighbor Pattern Classification</p>
            </title>
            <aug>
               <au>
                  <snm>Cover</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>IEEE Transactions</source>
            <pubdate>1967</pubdate>
            <volume>13</volume>
            <fpage>21</fpage>
            <lpage>27</lpage>
         </bibl>
         <bibl id="B23">
            <aug>
               <au>
                  <snm>Hastie</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Tibshirami</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Friedman</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>The Elements of Statistical Learning</source>
            <publisher>New York: Springer-Verlag</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B24">
            <title>
               <p>KNN-kernel density-based clustering for high-dimensional multivariate data</p>
            </title>
            <aug>
               <au>
                  <snm>Tran</snm>
                  <fnm>TN</fnm>
               </au>
               <au>
                  <snm>Wehrens</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Buydens</snm>
                  <fnm>LM</fnm>
               </au>
            </aug>
            <source>Computational Statistics &amp; Data Analysis</source>
            <pubdate>2006</pubdate>
            <volume>51</volume>
            <issue>2</issue>
            <fpage>513</fpage>
            <lpage>525</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.csda.2005.10.001</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Instance-based concept learning from multiclass DNA microarray data</p>
            </title>
            <aug>
               <au>
                  <snm>D</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Bradbury</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Dubitzky</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>73</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1402330</pubid>
                  <pubid idtype="pmpid" link="fulltext">16483361</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-73</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>In silico prediction of yeast deletion phenotypes</p>
            </title>
            <aug>
               <au>
                  <snm>Saha</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Heber</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Genet Mol Res</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <issue>1</issue>
            <fpage>224</fpage>
            <lpage>232</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16755513</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>A regression-based K nearest neighbor algorithm for gene functions prediction from heterogeneous data</p>
            </title>
            <aug>
               <au>
                  <snm>Yao</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Ruzzo</snm>
                  <fnm>WL</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7 Suppl 1</volume>
            <fpage>S11</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16723004</pubid>
                  <pubid idtype="pmcid">1810312</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-S1-S11</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Using machine learning algorithms to guide rehabilitation planning for home care clients</p>
            </title>
            <aug>
               <au>
                  <snm>Zhu</snm>
                  <fnm>MZZ</fnm>
               </au>
               <au>
                  <snm>Hirdes</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>Stolee</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>BMC Medical Informatics and Decision Making</source>
            <pubdate>2007</pubdate>
            <volume>7</volume>
            <fpage>41</fpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/1472-6947-7-41</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>A vector space model for automatic indexing</p>
            </title>
            <aug>
               <au>
                  <snm>Salton</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Communications of the ACM</source>
            <pubdate>1975</pubdate>
            <volume>18</volume>
            <fpage>613</fpage>
            <lpage>620</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1145/361219.361220</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Overbeek</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Begley</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Butler</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Choudhuri</snm>
                  <fnm>JV</fnm>
               </au>
               <au>
                  <snm>Chuang</snm>
                  <fnm>HY</fnm>
               </au>
               <au>
                  <snm>Cohoon</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>de Crecy-Lagard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Diaz</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Disz</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fonstein</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>ED</fnm>
               </au>
               <au>
                  <snm>Gerdes</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Glass</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Goesmann</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hanson</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Iwata-Reuyl</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Jensen</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Jamshidi</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Krause</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kubal</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Larsen</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Linke</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>McHardy</snm>
                  <fnm>AC</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Neuweger</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Olsen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Olson</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Osterman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Portnoy</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Pusch</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Rodionov</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Ruckert</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Steiner</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Stevens</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Thiele</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Vassieva</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Ye</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Zagnitko</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Vonstein</snm>
                  <fnm>V</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>5691</fpage>
            <lpage>5702</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1251668</pubid>
                  <pubid idtype="pmpid" link="fulltext">16214803</pubid>
                  <pubid idtype="doi">10.1093/nar/gki866</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Database resources of the National Center for Biotechnology Information: 2002 update</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Leipe</snm>
                  <fnm>DD</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Pontius</snm>
                  <fnm>JU</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Schriml</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rapp</snm>
                  <fnm>BA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>13</fpage>
            <lpage>16</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">99094</pubid>
                  <pubid idtype="pmpid" link="fulltext">11752242</pubid>
                  <pubid idtype="doi">10.1093/nar/30.1.13</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Investigations of oligonucleotide usage variance within and between prokaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Bohlin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Skjerve</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Ussery</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>PLoS Comput Biol</source>
            <pubdate>2008</pubdate>
            <volume>4</volume>
            <fpage>e1000057</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2289840</pubid>
                  <pubid idtype="pmpid" link="fulltext">18421372</pubid>
                  <pubid idtype="doi">10.1371/journal.pcbi.1000057</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Characteristics of oligonucleotide frequencies across genomes: Conservation versus variation, strand symmetry, and evolutionary implications</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>SH</fnm>
               </au>
               <au>
                  <snm>Ya-Zhi</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Nature Precedings</source>
            <pubdate>2008</pubdate>
            <fpage>1</fpage>
            <lpage>28</lpage>
            <url>http://hdl.handle.net/10101/npre.2008.2146.1</url>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Ancient horizontal gene transfer</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nature Reviews</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>121</fpage>
            <lpage>132</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrn1257</pubid>
                  <pubid idtype="pmpid" link="fulltext">12560809</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Horizontal gene transfer in eukaryotic evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Keeling</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Palmer</snm>
                  <fnm>JD</fnm>
               </au>
            </aug>
            <source>Nature Reviews Genetics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>605</fpage>
            <lpage>618</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg2386</pubid>
                  <pubid idtype="pmpid" link="fulltext">18591983</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Bohlin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Skjerve</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Ussery</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2008</pubdate>
            <volume>9</volume>
            <fpage>104</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2289816</pubid>
                  <pubid idtype="pmpid" link="fulltext">18307761</pubid>
                  <pubid idtype="doi">10.1186/1471-2164-9-104</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Horizontal gene transfer in prokaryotes: quantification and classification</p>
            </title>
            <aug>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Makarova</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Aravind</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Annu Rev Microbiol</source>
            <pubdate>2001</pubdate>
            <volume>55</volume>
            <fpage>709</fpage>
            <lpage>742</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1146/annurev.micro.55.1.709</pubid>
                  <pubid idtype="pmpid" link="fulltext">11544372</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>DarkHorse: a method for genome-wide prediction of horizontal gene transfer</p>
            </title>
            <aug>
               <au>
                  <snm>Podell</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Gaasterland</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2007</pubdate>
            <volume>8</volume>
            <fpage>R16</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1852411</pubid>
                  <pubid idtype="pmpid" link="fulltext">17274820</pubid>
                  <pubid idtype="doi">10.1186/gb-2007-8-2-r16</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>The genome sequence of the thermoacidiphilic scavender <it>Thermoplasma acidophilum</it></p>
            </title>
            <aug>
               <au>
                  <snm>Ruepp</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Graml</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Santos-Martinez</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Koretke</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Volker</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Mewes</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Frishman</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Stocker</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lupas</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Baumeister</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>407</volume>
            <fpage>508</fpage>
            <lpage>513</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35035069</pubid>
                  <pubid idtype="pmpid" link="fulltext">11029001</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Horizontal gene transfer in bacterial and archaeal complete genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Garcia-Vallve</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Romeu</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Palau</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>1719</fpage>
            <lpage>1725</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310969</pubid>
                  <pubid idtype="pmpid" link="fulltext">11076857</pubid>
                  <pubid idtype="doi">10.1101/gr.130000</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Environments shape the nucleotide composition of genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Foerstner</snm>
                  <fnm>KU</fnm>
               </au>
               <au>
                  <snm>von Mering</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hooper</snm>
                  <fnm>SD</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>EMBO Rep</source>
            <pubdate>2005</pubdate>
            <volume>6</volume>
            <fpage>1208</fpage>
            <lpage>1213</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1369203</pubid>
                  <pubid idtype="pmpid" link="fulltext">16200051</pubid>
                  <pubid idtype="doi">10.1038/sj.embor.7400538</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Assessing the accuracy of prediction algorithms for classification: an overview</p>
            </title>
            <aug>
               <au>
                  <snm>Baldi</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Chauvin</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Andersen</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Nielsen</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <fpage>412</fpage>
            <lpage>424</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.5.412</pubid>
                  <pubid idtype="pmpid" link="fulltext">10871264</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
