<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-9-479</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Database</dochead>
      <bibl>
         <title>
            <p>Extension of the COG and arCOG databases by amino acid and nucleotide sequences</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Meereis</snm>
               <fnm>Florian</fnm>
               <insr iid="I1"/>
               <email>florian@meereis.com</email>
            </au>
            <au ca="yes" id="A2">
               <snm>Kaufmann</snm>
               <fnm>Michael</fnm>
               <insr iid="I1"/>
               <email>mika@uni-wh.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>The Protein Chemistry Group, Witten/Herdecke University, Stockumer Str. 10, 58448 Witten, Germany</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2008</pubdate>
         <volume>9</volume>
         <issue>1</issue>
         <fpage>479</fpage>
         <url>http://www.biomedcentral.com/1471-2105/9/479</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19014535</pubid>
               <pubid idtype="doi">10.1186/1471-2105-9-479</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>03</day>
               <month>6</month>
               <year>2008</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>13</day>
               <month>11</month>
               <year>2008</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>13</day>
               <month>11</month>
               <year>2008</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2008</year>
         <collab>Meereis and Kaufmann; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at <url>http://www.uni-wh.de/nucocog</url>.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification id="endnote" subtype="user_supplied_xml" type="bmc"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The concept originally introduced by Tatusov <it>et al. </it>in 1997 <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> to assign protein sequences based on sequence similarities to COGs led to the establishment of the COG database which was updated repeatedly when more and more completely sequenced genomes became available <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. During the last decade, the COG database became a distinguished tool in comparative and functional genomics <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. The recently published Archaeal Clusters of Orthologous Genes (arCOG) are a refinement and update of archaeal sequences using a new sophisticated computational pipeline <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. However, although the original protein sequences used to construct the databases are available via FTP <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>, there are no direct assignments of protein or protein domain sequences to entries within the databases, and nucleotide sequences are completely absent. Since the COG and arCOG databases are excellent frameworks to study sequence specific aspects such as amino acid composition, GC content, codon usage or the like in both a functional and a phylogenetic context, versions including sequence information directly linked to every protein or protein domain were a desirable improvement. Here we present the latest update of the COG database, the arCOG database, and a combination of them as XML files (nucocog.xml, arnucocog.xml, and nucocog_2.xml, respectively) (Figs <figr fid="F1">1</figr> and <figr fid="F2">2</figr>) that include both amino acid and nucleotide sequences directly assigned to their respective protein names and GI-numbers.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>First lines of the NUCOCOG 242 MB XML file</p>
            </caption>
            <text>
               <p>
                  <b>First lines of the NUCOCOG 242 MB XML file.</b>
               </p>
            </text>
            <graphic file="1471-2105-9-479-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Screenshot of the NUCOCOG retrieval utility at <url>http://www.uni-wh.de/nucocog/</url><abbrgrp><abbr bid="B21">21</abbr></abbrgrp></p>
            </caption>
            <text>
               <p><b>Screenshot of the NUCOCOG retrieval utility at</b><url>http://www.uni-wh.de/nucocog/</url><abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            </text>
            <graphic file="1471-2105-9-479-2"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Construction and content</p>
         </st>
         <sec>
            <st>
               <p>Construction of the NUCOCOG database</p>
            </st>
            <p>The NUCOCOG database (nucocog.xml) was constructed by repeatedly running a set of different PERL scripts to read, create and manipulate files containing ASCII data. All files containing the information required to create the NUCOCOG database were obtained via FTP from the NCBI. Three files related to the COG-database (<it>whog</it>, <it>myva</it>, <it>myva=gb</it>) were taken from <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> and 159 GBK files containing the genome information of the 66 organisms currently present in the COG database were taken from <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. To obtain a maximum quality of nucleotide sequences we used current GBK files rather than the outdated ones corresponding to the original protein sequences. In principle, this may lead to errors due to the methods described below. However, those errors are very improbable and, if occurring at all, are negligible compared to all the ambiguous amino acids and nucleotides present even in current GBK versions (see table <tblr tid="T1">1</tblr>). The NUCOCOG database was built up by the following five main procedures: (i) The complete COG database as assembled in <it>whog</it>, was converted to an XML file which served as the scaffold that was subsequently stepwise extended to eventually represent the final NUCOCOG database. (ii) The amino acid sequences were extracted from <it>myva </it>by searching for their respective protein names as the unambiguous search keys. (iii) In the same manner, GI-numbers for all <it>complete </it>protein sequences were obtained from <it>myva=gb</it>. However, no protein <it>fragments </it>(domains) at all could be detected because their names are extended by an underscore followed by a consecutive number in <it>whog </it>which is not the case in <it>myva=gb</it>. We derived GI-numbers for those entries during a second run after the extensions of their names were truncated. (iv) Nucleotide sequences were then included separately for each of the 66 organisms by exclusively searching for organism specific sequences only in GBK files associated with the respective query organism. The amino acid sequences as annotated in <it>myva </it>were used as search keys to locate their corresponding nucleotide sequences in the GBK files. To detect all sequences, it was necessary to apply four different search approaches. The vast majority (98.4 %) of all nucleotide sequences were extracted by searching for matching amino acid sequences as annotated in the GBK files. Because some of the GBK files used in this work have been updated since the latest release of the COG database not every sequence could be detected by this method. We discovered additional sequences (1.4 %) by searching for matches with conceptual translated nucleotide sequences. Those sequences were located and their reading frames were determined according to the CDS information in the respective GBK file. In addition, the nucleotide sequences were extended by 300 adjacent nucleotides derived from the genome sequence in both directions prior to translation. Further sequences (0.1 %) could be found by searching for matching amino acid sequences in translations of whole genomes in all six reading frames. The residual 121 missing nucleotide sequences were included by manually editing the first qualifiers of the CDS feature key in the respective GBG file followed by a further search in conceptually translated nucleotide sequences. To inspect and edit GBK files, the genome annotation tool ARTEMIS <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp> was used in combination with the pairwise sequence alignment software JAligner <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Mainly, we deleted annotated frameshifts, included new frameshifts and in some cases replaced mismatching amino acids by using coding information of the nearest matching amino acid sequence within the same reading frame. For each nucleotide sequence the respective mapping method was included to the XML-files <it>i. e.</it>"original CDS", "extension of original CDS", "conceptual translation of whole genome", or "created or edited CDS manually". (v) After this first version of the complete NUCOGOG database had been finished, some PERL scripts were run for verification and validation purposes with the main focus on translating the nucleotide sequences and comparing the translations to their respective annotations. For four entries (MK0324, MK0315, MK0689, MK0809) even the manual search for more information <it>e. g. </it>by BLAST <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> assigned no GI-number and we adopted the entry "gi?" from <it>myva=gb</it>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Content of the three NUCOCOG databases</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>NUCOCOG</p>
                     </c>
                     <c ca="left">
                        <p>arNUCOCOG</p>
                     </c>
                     <c ca="left">
                        <p>NUCOCOG_2</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>domain sequences</p>
                     </c>
                     <c ca="left">
                        <p>144,320</p>
                     </c>
                     <c ca="left">
                        <p>81,616</p>
                     </c>
                     <c ca="left">
                        <p>204,890</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nucleotides</p>
                     </c>
                     <c ca="left">
                        <p>142,675,176</p>
                     </c>
                     <c ca="left">
                        <p>72,324,636</p>
                     </c>
                     <c ca="left">
                        <p>195,633,198</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>stop codons</p>
                     </c>
                     <c ca="left">
                        <p>94</p>
                     </c>
                     <c ca="left">
                        <p>62</p>
                     </c>
                     <c ca="left">
                        <p>115</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. a. a.: B</p>
                     </c>
                     <c ca="left">
                        <p>41</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>41</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. a. a.: U</p>
                     </c>
                     <c ca="left">
                        <p>24</p>
                     </c>
                     <c ca="left">
                        <p>27</p>
                     </c>
                     <c ca="left">
                        <p>42</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. a. a.: X</p>
                     </c>
                     <c ca="left">
                        <p>1,243</p>
                     </c>
                     <c ca="left">
                        <p>89</p>
                     </c>
                     <c ca="left">
                        <p>1,288</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. a. a.: Z</p>
                     </c>
                     <c ca="left">
                        <p>12</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>12</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: b</p>
                     </c>
                     <c ca="left">
                        <p>9</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: d</p>
                     </c>
                     <c ca="left">
                        <p>9</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>9</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: h</p>
                     </c>
                     <c ca="left">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: k</p>
                     </c>
                     <c ca="left">
                        <p>189</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>189</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: m</p>
                     </c>
                     <c ca="left">
                        <p>163</p>
                     </c>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>164</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: n</p>
                     </c>
                     <c ca="left">
                        <p>195</p>
                     </c>
                     <c ca="left">
                        <p>110</p>
                     </c>
                     <c ca="left">
                        <p>301</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: r</p>
                     </c>
                     <c ca="left">
                        <p>328</p>
                     </c>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>328</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: s</p>
                     </c>
                     <c ca="left">
                        <p>258</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>258</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: v</p>
                     </c>
                     <c ca="left">
                        <p>7</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: w</p>
                     </c>
                     <c ca="left">
                        <p>113</p>
                     </c>
                     <c ca="left">
                        <p>-</p>
                     </c>
                     <c ca="left">
                        <p>113</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>a. n.: y</p>
                     </c>
                     <c ca="left">
                        <p>660</p>
                     </c>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>660</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The abbreviations used are a. a. a. for ambiguous amino acids and a. n. for ambiguous nucleotides.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Construction of the arNUCOCOG database</p>
            </st>
            <p>The arNUCOCOG database (arnucocog.xml) was essentially constructed as described above using the information from ar40.fa and arCOG.csv <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> to build the initial XML-file. The current version of arCOGs includes the genome of <it>Thermoproteus tenax </it>which has not been published at the time of its release and by request of the sequencing consortium those proteins were removed from the ar40.fa file and are also not contained in the arNUCOCOG database. In addition, 34 sequences listed in arCOG.csv could not be located in ar40.fa. These proteins for various reasons were not translated and the authors detected them by tBLASTn, using an orthologous sequence from a close relative as a query (Kira Makarova, personal communication). We included those sequences manually by reproducing her work. Searching for matching amino acid sequences in the GBK files resulted in including 99.8 % of all nucleotide sequences. The remaining sequences were detected by the alternative methods described above and only three sequences needed to be searched manually. Many of the arCOGs are new and consequently not assigned to a classical COG-number. In all those cases, we included "NO_COG" between the respective tags. Because arCOG.csv contains protein gi-numbers as the domain-ids, no unique domain-ids are assigned to all split sequences. We improved this situation by adding consecutively numbered suffixes to those gi-numbers separated by an underscore <it>e.g. </it>&lt;DOMAINNAME>118430839_1&lt;/DOMAINNAME>.</p>
         </sec>
         <sec>
            <st>
               <p>Combining NUCOCOG with arNUCOCOG (NUCOCOG_2)</p>
            </st>
            <p>We also combined NUCOCOG and arNUCOCOG resulting in nucocog_2.xml. For that purpose, we removed all sequences from the 13 archaeal genomes from NUCOCOG and included all data from arNUCOCOG instead. In addition, we added those sequences from ar40.fa that according to the information from arCOG.csv are assigned to classical COGs but are not part of any arCOG. Finally, for those amino acid sequences their corresponding nucleotide sequences were included as described above and "NO_COG" was written between the arCOG-tags.</p>
         </sec>
         <sec>
            <st>
               <p>Content of the NUCOCOG database</p>
            </st>
            <p>The content of the three database files is summarized in table <tblr tid="T1">1</tblr>. As can be seen, some nucleotide sequences contain stop codons within coding regions and there are both ambiguous amino acids (a. a. a.) and ambiguous nucleotides (a. n.). Consequently, in those cases a distinct translation of a codon to an amino acid is impossible. For that reason, although resulting in larger database files containing redundancies, we did not delete the amino acid sequences from our files after the databases had been constructed.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Utility</p>
         </st>
         <p>Most of the users will probably use the set of databases according to the aim we primarily constructed it for:<it>i. e. </it>by downloading the XML files and analyzing them with respect to their own research questions and their individually developed software tools. Nevertheless, we also provide a web based utility to browse the databases for sequence retrieval by COG-number, arCOG-number, protein name, and GI-number. For that purpose, we used the Apache HTTP Server and an SQL backend. The XML files were converted to tables of an SQL database, one for all COG data and the other ones for the nucleotide and amino acid sequences, respectively. The names of the protein or protein domains were used as unique keys. Queries can be made by using the frontend written in PHP providing the option to select certain entries for displaying their corresponding amino acid and nucleotide sequences.</p>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>The COG and arCOG databases represent excellent collections of proteins (or protein domains). The version presented here including amino acid and nucleotide sequences allows answering all sequence related questions with respect to orthologous proteins <it>i. e. </it>proteins that are assumed to exhibit identical functions. For instance, one may ask whether enzymes involved in a certain metabolic pathway have constraints in their amino acid composition. This is described for enzymes involved in tryptophan biosynthesis since the 5 protein chains of the <it>E. coli trp </it>operon contain only 5 tryptophan residues <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. Indeed, the "cognate bias hypothesis" stating that early in evolutionary history the biosynthetic enzymes for amino acid &#215; gradually lost residues of &#215; <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> could elegantly be tested using the NUCOCOG files presented here. Questions related to deviations of nucleotide sequence compositions such as codon usage or GC-content in dependence on the functions of the respective proteins could also be answered by exploring the XML files provided here. Furthermore, the COG framework had proved to be a powerful tool in conjunction with phylogenetic protein sequence distributions <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. The possibility to examine clade specific features of nucleotide or amino acid sequences within the COG context could also uncover more precise data than those made available by simply comparing the sequences of whole genomes. For example, there are several studies dealing with differences in sequence specific properties between (hyper)thermophiles and mesophiles by comparing the sequence data of their complete genomes <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. Those surveys do not account for possible differences in sequence signatures that depend solely on the function of the respective protein rather than the phylogenetic relationship of the organisms under investigation. To refine such studies, only proteins derived from different organisms but exhibiting identical biochemical functions should be compared on a large scale rather than just comparing complete genomes. With that intention we constructed NUCOCOG and our future work will exactly deal with the refinement described here of detecting thermophile-specific sequence signatures considering possible distortions due to comparing functionally different proteins.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>NUCOCOG is a version of the current COG and arCOG databases assembled in single XML files containing both amino acid and nucleotide sequences associated to their respective entries. In depth analysis of this XML files makes it possible to investigate any sequence specific property in the COG context, taking into account functional and phylogenetic relationships.</p>
      </sec>
      <sec>
         <st>
            <p>Availability and requirements</p>
         </st>
         <p>The NUCOCOG database can be browsed by any web-browser at <url>http://www.uni-wh.de/nucocog</url>. <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. In addition, the databases as three XML files and the source codes are freely available at the same URL.</p>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>ASCII: American standard code for information interchange; BLAST: basic local alignment search tool; CDS: coding sequence; COG: cluster of orthologous groups; FTP: file transfer protocol; GBK: GenBank (file-extension <it>.gbk</it>); GC: guanine-cytosine; GI: geninfo identifier; HTTP: hypertext transfer protocol; MB: megabyte; NCBI: National Center for Biotechnology Information; NUCOCOG: nucleotide sequences containing COG; PERL: practical extraction and report language; PHP: PHP hypertext pre-processor; SQL: structured query language; URL: uniform resource locator; XML: extensible markup language</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>FM wrote the web-interface and implemented the databases on our server, MK constructed the NUCOCOG-database, conceived of the study and wrote the manuscript. Both authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Daniela Kaufmann for her help in editing the manuscript and the staff at Bereich f&#252;r Informationstechnologie at Witten/Herdecke University for supporting us to implement NUCOCOG on our web servers.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>A genomic perspective on protein families</p>
            </title>
            <aug>
               <au>
                  <snm>Tatusov</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Science (New York, NY)</source>
            <pubdate>1997</pubdate>
            <volume>278</volume>
            <issue>5338</issue>
            <fpage>631</fpage>
            <lpage>637</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pubmed">9381173</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The COG database: an updated version includes eukaryotes</p>
            </title>
            <aug>
               <au>
                  <snm>Tatusov</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Fedorova</snm>
                  <fnm>ND</fnm>
               </au>
               <au>
                  <snm>Jackson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Jacobs</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Kiryutin</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Krylov</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Mazumder</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Mekhedov</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Nikolskaya</snm>
                  <fnm>AN</fnm>
               </au>
               <etal/>
            </aug>
            <source>BMC bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>41</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">222959</pubid>
                  <pubid idtype="pmpid" link="fulltext">12969510</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-4-41</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The COG database: a tool for genome-scale analysis of protein functions and evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Tatusov</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Galperin</snm>
                  <fnm>MY</fnm>
               </au>
               <au>
                  <snm>Natale</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <issue>1</issue>
            <fpage>33</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102395</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592175</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.33</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>The COG database: new developments in phylogenetic classification of proteins from complete genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Tatusov</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Natale</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Garkavtsev</snm>
                  <fnm>IV</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Shankavaram</snm>
                  <fnm>UT</fnm>
               </au>
               <au>
                  <snm>Rao</snm>
                  <fnm>BS</fnm>
               </au>
               <au>
                  <snm>Kiryutin</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Galperin</snm>
                  <fnm>MY</fnm>
               </au>
               <au>
                  <snm>Fedorova</snm>
                  <fnm>ND</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <issue>1</issue>
            <fpage>22</fpage>
            <lpage>28</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29819</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125040</pubid>
                  <pubid idtype="doi">10.1093/nar/29.1.22</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>The Role of the COG Database in Comparative and Functional Genomics</p>
            </title>
            <aug>
               <au>
                  <snm>Kaufmann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Current Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>1</volume>
            <issue>3</issue>
            <fpage>291</fpage>
            <lpage>300</lpage>
            <xrefbib>
               <pubid idtype="doi">10.2174/157489306777828017</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea</p>
            </title>
            <aug>
               <au>
                  <snm>Makarova</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Sorokin</snm>
                  <fnm>AV</fnm>
               </au>
               <au>
                  <snm>Novichkov</snm>
                  <fnm>PS</fnm>
               </au>
               <au>
                  <snm>Wolf</snm>
                  <fnm>YI</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
            </aug>
            <source>Biology direct</source>
            <pubdate>2007</pubdate>
            <volume>2</volume>
            <fpage>33</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2222616</pubid>
                  <pubid idtype="pmpid" link="fulltext">18042280</pubid>
                  <pubid idtype="doi">10.1186/1745-6150-2-33</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <url>ftp://ftp.ncbi.nih.gov/pub/COG/</url>
         </bibl>
         <bibl id="B8">
            <url>ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/arCOG</url>
         </bibl>
         <bibl id="B9">
            <url>ftp://ftp.ncbi.nih.gov/genomes/</url>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Artemis: sequence visualization and annotation</p>
            </title>
            <aug>
               <au>
                  <snm>Rutherford</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Parkhill</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Crook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Horsnell</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Rice</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Rajandream</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Barrell</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Bioinformatics (Oxford, England)</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>10</issue>
            <fpage>944</fpage>
            <lpage>945</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.10.944</pubid>
                  <pubid idtype="pmpid" link="fulltext">11120685</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Informatics Software: Artemis</p>
            </title>
            <url>http://www.sanger.ac.uk/Software/Artemis/</url>
         </bibl>
         <bibl id="B12">
            <title>
               <p>SourceForge.net:JAligner</p>
            </title>
            <url>http://sourceforge.net/projects/jaligner</url>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Basic local alignment search tool</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Journal of molecular biology</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <issue>3</issue>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">2231712</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Yeast gene TRP5: structure, function, regulation</p>
            </title>
            <aug>
               <au>
                  <snm>Zalkin</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Yanofsky</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>The Journal of biological chemistry</source>
            <pubdate>1982</pubdate>
            <volume>257</volume>
            <issue>3</issue>
            <fpage>1491</fpage>
            <lpage>1500</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">6276387</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Evolutionarily conserved optimization of amino acid biosynthesis</p>
            </title>
            <aug>
               <au>
                  <snm>Perlstein</snm>
                  <fnm>EO</fnm>
               </au>
               <au>
                  <snm>de Bivort</snm>
                  <fnm>BL</fnm>
               </au>
               <au>
                  <snm>Kunes</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Schreiber</snm>
                  <fnm>SL</fnm>
               </au>
            </aug>
            <source>Journal of molecular evolution</source>
            <pubdate>2007</pubdate>
            <volume>65</volume>
            <issue>2</issue>
            <fpage>186</fpage>
            <lpage>196</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-007-0013-x</pubid>
                  <pubid idtype="pmpid" link="fulltext">17684697</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>EPPS: mining the COG database by an extended phylogenetic patterns search</p>
            </title>
            <aug>
               <au>
                  <snm>Reichard</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kaufmann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Bioinformatics (Oxford, England)</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <issue>6</issue>
            <fpage>784</fpage>
            <lpage>785</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/btg089</pubid>
                  <pubid idtype="pmpid" link="fulltext">12691996</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>PCOGR: phylogenetic COG ranking as an online tool to judge the specificity of COGs with respect to freely definable groups of organisms</p>
            </title>
            <aug>
               <au>
                  <snm>Meereis</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Kaufmann</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>BMC bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>150</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">526202</pubid>
                  <pubid idtype="pmpid" link="fulltext">15488147</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-5-150</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Structural and genomic correlates of hyperthermostability</p>
            </title>
            <aug>
               <au>
                  <snm>Cambillau</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Claverie</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>The Journal of biological chemistry</source>
            <pubdate>2000</pubdate>
            <volume>275</volume>
            <issue>42</issue>
            <fpage>32383</fpage>
            <lpage>32386</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/jbc.C000497200</pubid>
                  <pubid idtype="pmpid" link="fulltext">10940293</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Genomic correlates of hyperthermostability, an update</p>
            </title>
            <aug>
               <au>
                  <snm>Suhre</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Claverie</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>The Journal of biological chemistry</source>
            <pubdate>2003</pubdate>
            <volume>278</volume>
            <issue>19</issue>
            <fpage>17198</fpage>
            <lpage>17202</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/jbc.M301327200</pubid>
                  <pubid idtype="pmpid" link="fulltext">12600994</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content</p>
            </title>
            <aug>
               <au>
                  <snm>Singer</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Hickey</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>2003</pubdate>
            <volume>317</volume>
            <issue>1&#8211;2</issue>
            <fpage>39</fpage>
            <lpage>47</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(03)00660-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">14604790</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>The Protein Chemistry Group&#183;NUCOGOG online</p>
            </title>
            <url>http://www.uni-wh.de/nucocog/</url>
         </bibl>
      </refgrp>
   </bm>
</art>
