<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2164-6-43</ui>
   <ji>1471-2164</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Protein encoding genes in an ancient plant: analysis of codon usage, retained genes and splice sites in a moss, <it>Physcomitrella patens</it></p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Rensing</snm>
               <mi>A</mi>
               <fnm>Stefan</fnm>
               <insr iid="I1"/>
               <email>stefan.rensing@biologie.uni-freiburg.de</email>
            </au>
            <au id="A2">
               <snm>Fritzowsky</snm>
               <fnm>Dana</fnm>
               <insr iid="I1"/>
               <email>dana@fritzowsky.net</email>
            </au>
            <au id="A3">
               <snm>Lang</snm>
               <fnm>Daniel</fnm>
               <insr iid="I1"/>
               <email>daniel.lang@biologie.uni-freiburg.de</email>
            </au>
            <au id="A4">
               <snm>Reski</snm>
               <fnm>Ralf</fnm>
               <insr iid="I1"/>
               <email>ralf.reski@biologie.uni-freiburg.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Plant Biotechnology, Faculty of Biology, University of Freiburg, Schaenzlestr. 1, 79104 Freiburg, Germany</p>
            </ins>
         </insg>
         <source>BMC Genomics</source>
         <issn>1471-2164</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>1</issue>
         <fpage>43</fpage>
         <url>http://www.biomedcentral.com/1471-2164/6/43</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15784153</pubid>
               <pubid idtype="doi">10.1186/1471-2164-6-43</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>24</day>
               <month>9</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>22</day>
               <month>3</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>22</day>
               <month>3</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Rensing et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The moss <it>Physcomitrella patens </it>is an emerging plant model system due to its high rate of homologous recombination, haploidy, simple body plan, physiological properties as well as phylogenetic position. Available EST data was clustered and assembled, and provided the basis for a genome-wide analysis of protein encoding genes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have clustered and assembled <it>Physcomitrella patens </it>EST and CDS data in order to represent the transcriptome of this non-seed plant. Clustering of the publicly available data and subsequent prediction resulted in a total of 19,081 non-redundant ORF. Of these putative transcripts, approximately 30% have a homolog in both rice and <it>Arabidopsis </it>transcriptome.</p>
               <p>More than 130 transcripts are not present in seed plants but can be found in other kingdoms. These potential "retained genes" might have been lost during seed plant evolution. Functional annotation of these genes reveals unequal distribution among taxonomic groups and intriguing putative functions such as cytotoxicity and nucleic acid repair.</p>
               <p>Whereas introns in the moss are larger on average than in the seed plant <it>Arabidopsis thaliana</it>, position and amount of introns are approximately the same. Contrary to <it>Arabidopsis</it>, where CDS contain on average 44% G/C, in <it>Physcomitrella </it>the average G/C content is 50%. Interestingly, moss orthologs of <it>Arabidopsis </it>genes show a significant drift of codon fraction usage, towards the seed plant. While averaged codon bias is the same in <it>Physcomitrella </it>and <it>Arabidopsis</it>, the distribution pattern is different, with 15% of moss genes being unbiased.</p>
               <p>Species-specific, sensitive and selective splice site prediction for <it>Physcomitrella </it>has been developed using a dataset of 368 donor and acceptor sites, utilizing a support vector machine. The prediction accuracy is better than those achieved with tools trained on <it>Arabidopsis </it>data.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>Analysis of the moss transcriptome displays differences in gene structure, codon and splice site usage in comparison with the seed plant <it>Arabidopsis</it>. Putative retained genes exhibit possible functions that might explain the peculiar physiological properties of mosses.</p>
               <p>Both the transcriptome representation (including a BLAST and retrieval service) and splice site prediction have been made available on <url>http://www.cosmoss.org</url>, setting the basis for assembly and annotation of the <it>Physcomitrella </it>genome, of which draft shotgun sequences will become available in 2005.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Flowering plants have developed from a common ancestor with mosses, liverworts, ferns, and gymnosperms over the last 450 million years <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Most recent angiosperms do not closely resemble their ancestors, as known from the fossil record. Quite a few gymnosperms (like <it>Ginkgo </it>or <it>Cycas</it>) still resemble the plants known from the fossil record, and this is even more true for "lower" land plants, namely mosses, liverworts and ferns <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. In addition, mosses seem to evolve with a slow molecular clock <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. So, if these plants appear to be more ancient than modern flowering plants as measured by morphological means and mutation rate, does this also hold true for how they employ their genetic system?</p>
         <p>A lot of comparative studies on protein encoding genes have already been carried out within and between the two major groups of flowering plants, mono- and dicotyledons, with rice (<it>Oryza sativa</it>) and <it>Arabidopsis thaliana </it>as the most prominent examples. Currently more than two million <it>Liliopsida </it>(monocotyledons) EST are publicly available, the corresponding number for the <it>Magnoliophyta </it>(dicotyledons) even exceeds four million sequences. However, sequence information for other plant phyla is still scarce. There are only about 160,000 EST sequences available of both <it>Coniferophyta </it>(part of the gymnosperms) and <it>Chlorophyta </it>(green algae), 130,000 from <it>Bryophyta </it>(mosses) and 3,700 from <it>Filicophyta </it>(ferns) (all numbers from Genbank). For the moss <it>Physcomitrella patens</it>, more than 102,000 nucleic acid sequences (mainly EST) are publicly available to date. This "ancient" land plant therefore is an ideal candidate to unravel some details about how simple plants encode proteins and whether they do so in a different manner from "modern" plants, as represented by the monocotyledon rice and the dicotyledon <it>Arabidopsis </it>in this study.</p>
         <p><it>Physcomitrella </it>is increasingly being used as a model plant because of its unrivalled capability among plants to include ectopic DNA into its genome by means of homologous recombination (see e.g. <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>), thus enabling gene replacement in a straight forward manner. As in all mosses, the haploid gametophyte is the dominant generation in the heteromorphic life cycle. In this respect the moss is different from seed plants (gymnosperms and flowering plants), in which the polyploid sporophyte dominates the life cycle. It has been argued before <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp> that the set of genes of the respective dominant generation is equivalent, while a large proportion of moss transcripts cannot currently be assigned a putative function. These "orphan" genes might encode functions that are specific to mosses and are not present in other taxonomic groups. Besides species-specific orphan genes, mosses might also possess retained genes, that have been lost in seed plants during evolution. Both types of genes are candidates to encode functions that make mosses unique in terms of physiology and metabolism. For example, <it>Physcomitrella </it>exhibits increased tolerance towards abiotic stresses <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>, uses proteins derived from the same gene in different cellular compartments by dual targeting <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp> and displays secondary metabolite pathways not known in seed plants <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>. In this study, following up on the initial analyses by Nishiyama et al. <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, we aimed to increase our knowledge of the moss transcriptome.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Comparative BLAST searches</p>
            </st>
            <p>Around 30% of the <it>Physcomitrella </it>ORF have homologs in both rice and <it>Arabidopsis </it>transcriptome whereas 80% of the <it>Arabidopsis </it>genes have a homolog in rice and 40% of the rice genes in <it>Arabidopsis </it>(Fig. <figr fid="F1">1</figr>). Although these numbers are lower than the actual amount of sequence homologs because of filtering (see below), they demonstrate that <it>Physcomitrella </it>contains a lot of as yet unknown protein encoding genes that might be specific for mosses. A homology search against the taxprot dataset (Table <tblr tid="T1">1</tblr>, Fig. <figr fid="F2">2</figr>) reveals that 45.8% of the predicted moss ORF find a query in plants (E-value threshold 1E-4), after rigorous filtering 28.1% remain and 21.7% are non-redundant, i.e. do not match multiple subject sequences. The rigorous filtering (see methods for details) for true homologs thus necessarily decreases the set of available sequences, so that false conclusions are not made based on comparison of non-homologous sequences.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Comparative BLAST searches between <it>Arabidopsis</it>, rice and moss</p>
               </caption>
               <text>
                  <p><b>Comparative BLAST searches between <it>Arabidopsis</it>, rice and moss</b>. Comparative BLAST searches of the <it>Arabidopsis </it>(At, yellow), rice (Os, cyan) and <it>Physcomitrella </it>(Pp, green) transcriptomes. Each search was done with the respective sets once as query and once as search space (subject). The area of the circles represents the percentage of the query/subject sequence space that yielded filtered hits.</p>
               </text>
               <graphic file="1471-2164-6-43-1"/>
            </fig>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Taxonomic constitution of the taxprot dataset</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>
                           <b>taxonomic group</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b>txid</b>
                        </p>
                     </c>
                     <c ca="right">
                        <p>
                           <b># of sequences</b>
                           <sup>1</sup>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Metazoa</p>
                     </c>
                     <c ca="right">
                        <p>33208</p>
                     </c>
                     <c ca="right">
                        <p>862,420</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Fungi</p>
                     </c>
                     <c ca="right">
                        <p>4751</p>
                     </c>
                     <c ca="right">
                        <p>184,282</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Viridiplantae (plants and green algae)</p>
                     </c>
                     <c ca="right">
                        <p>33090</p>
                     </c>
                     <c ca="right">
                        <p>293,156</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Non-green algae<sup>2</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>21,889</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Other Eukaryotes<sup>3</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>49,732</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Eubacteria (without Cyanobacteria)</p>
                     </c>
                     <c ca="right">
                        <p>2</p>
                     </c>
                     <c ca="right">
                        <p>1,386,089</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cyanobacteria</p>
                     </c>
                     <c ca="right">
                        <p>1117</p>
                     </c>
                     <c ca="right">
                        <p>94,920</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Archaea</p>
                     </c>
                     <c ca="right">
                        <p>2157</p>
                     </c>
                     <c ca="right">
                        <p>122,394</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Viruses</p>
                     </c>
                     <c ca="right">
                        <p>10239</p>
                     </c>
                     <c ca="right">
                        <p>331,246</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Total</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>
                           <b>3,346,128</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><sup>1</sup>Genbank amino acid sequences as of 2004&#8211;04&#8211;07, NCBI taxon ids are shown under "txid", all taxonomic crown groups with at least 100 sequence members were used; <sup>2</sup>Cercozoa [136419], Cryptophyta [3027], Euglenozoa [33682], Glaucocystophyceae [38254], Haptophyceae [2830], Rhodophyta [2763], Stramenopiles [33634]; <sup>3</sup>Acanthamoebidae [33677], Alveolata [33630], Diplomonadida [207245], Entamoebidae [33084], Heterolobosea [5752], Jakobidae [143015], Mycetozoa [142796], Parabasalidea [5719]</p>
               </tblfn>
            </tbl>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>BLAST hits of <it>Physcomitrella </it>protein genes against the taxprot dataset</p>
               </caption>
               <text>
                  <p><b>BLAST hits of <it>Physcomitrella </it>protein genes against the taxprot dataset</b>. a) Absolute number of hits against different taxonomic groups. b) Amount of non-redundant hits as percentage of the respective sequence space.</p>
               </text>
               <graphic file="1471-2164-6-43-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Full-length transcripts</p>
            </st>
            <p>The total number of clusters after EST clustering do not equal the number of protein encoding genes. This is mainly due to partial (as opposed to full-length) transcripts, i.e. a single gene is represented by more than one sequence because they do not overlap. How many of the clustered public (PPP) transcripts represent full-length coding sequences? Of those sequences that yield a filtered hit against plant mRNAs, 7.9% are putatively full-length. Of those, 53.9% start with Methionine, of the latter, 32.4% contain no X (X represents an indeterminable codon, which can be included by the ORF prediction).</p>
         </sec>
         <sec>
            <st>
               <p>Orthologs, paralogs and mapping</p>
            </st>
            <p>The filtered hits against the <it>Arabidopsis </it>transcriptome (1,994 in total) were divided into non-redundant orthologs (722) and paralogs (1,015). As <it>Arabidopsis </it>orthologs, we defined all those sequences for which the initial subject matches the query in the reverse search (reciprocal hit). Paralogs were defined as those sequences for which this rule does not apply. This method of detecting potential orthologs has been used previously for cross-species comparisons (e.g. <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>). The three sequence sets were mapped against the <it>Arabidopsis </it>chromosomes using BLAST (Fig. <figr fid="F3">3</figr>). The distribution pattern clearly reveals the centromeric regions but otherwise does not display significant differences. Although there are some chromosome and sequence set-specific differences in the rate of hits per Mbp, these are not significant as measured by absolute average deviation.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Mapping of <it>Physcomitrella </it>transcripts to the <it>Arabidopsis </it>chromosomes</p>
               </caption>
               <text>
                  <p><b>Mapping of <it>Physcomitrella </it>transcripts to the <it>Arabidopsis </it>chromosomes</b>. Mapping of filtered BLAST hits (grey), paralogs (red) and orthologs (green) against the five <it>Arabidopsis </it>chromosomes (left to right / top to bottom). a) Hits per Mbp; error bars: average absolute deviation (AAD); column 6: mean values. b) Graphical representation using a finer granularity (100 kbp), each vertical step represents one hit.</p>
               </text>
               <graphic file="1471-2164-6-43-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Taxonomic distribution and retained genes</p>
            </st>
            <p>The highest total number of non-redundant, filtered BLAST hits is (as expected) derived from the plant subset of the taxprot dataset (Table <tblr tid="T1">1</tblr>, Fig. <figr fid="F2">2a</figr>), followed by the animal and fungi subsets. When looking at the hits as percentage of the search space size (Fig. <figr fid="F2">2b</figr>), it becomes evident that quite a proportion of the sequence space of lower eukaryotes (including non-green algae) is covered. The comparatively high coverage of this "ancient" gene space suggests that mosses share many specialized genes with unicellular organisms.</p>
            <p>134 <it>Physcomitrella </it>ORF have their best BLAST hit not among plants (Fig. <figr fid="F4">4a</figr>). Consequently, these are candidates for horizontal gene transfer or, more likely, retained genes that were lost in seed plants during evolution. We had a closer look at those 57 transcripts which are specific to a single taxonomic group, namely bacteria, cyanobacteria, animals or fungi (unique hits). For 25 of those, a putative function could be assigned manually (Fig. <figr fid="F4">4b</figr>, Table <tblr tid="T2">2</tblr>). The broad functional categories of these taxon-specific retained genes are to some extent unevenly distributed. Whereas transport associated proteins are found solely among fungi, signal transduction gene products are found in both bacteria and animals. Transport and metabolism associated gene products support the wealth of secondary pathways found in moss (e.g., <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>), whereas the signal transduction genes also separate the moss from seed plants in this regard. Of special interest are two other functional categories among these candidate retained genes: cytotoxicity and nucleic acid modification. A broad range of cytotoxic abilities might explain why mosses can survive in moist environments mainly unplagued by microbial parasites, without the protection of a cuticula. Furthermore, it is, up until now, puzzling why <it>Physcomitrella </it>is able to integrate ectopic DNA into the genome by homologous recombination with an extraordinarily high rate <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp> so far only found in bacteria and yeast, but in no other plant or any animal. Hints to unravel this mystery might be found in the presence of genes involved in DNA repair, binding and modification, as we discovered during this research.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Retained genes in moss: taxonomic distribution and functional categories</p>
               </caption>
               <text>
                  <p><b>Retained genes in moss: taxonomic distribution and functional categories</b>. a) <it>Physcomitrella </it>transcripts which have their best BLAST hit not among plants, divided by taxonomic category, further subdivided into specific hits (unique to a single taxonomic group &#8211; yellow) and those that could be assigned a putative function by means of homology searches (green). b) Distribution of functional categories among those taxonomic groups that yielded unique hits.</p>
               </text>
               <graphic file="1471-2164-6-43-4"/>
            </fig>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Functional annotation of retained genes into broad functional categories, assembled transcripts can be retrieved via <url>http://www.cosmoss.org</url>.</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="center">
                        <p>
                           <b>Pp transcript (putative retained gene)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>taxonomic group</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>homolog</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>broad functional category (potential)</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>functional category</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>details</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_2925_C1</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>Membrane-bound lytic murein transglycosylase B</p>
                     </c>
                     <c ca="center">
                        <p>cytotoxicity</p>
                     </c>
                     <c ca="center">
                        <p>murein degradation</p>
                     </c>
                     <c ca="center">
                        <p>murein-degrading enzyme, may play a role in recycling of muropeptides during cell</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>BJ203770</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>putative protease</p>
                     </c>
                     <c ca="center">
                        <p>cytotoxicity</p>
                     </c>
                     <c ca="center">
                        <p>protease</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_4234_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>cytolysin I</p>
                     </c>
                     <c ca="center">
                        <p>cytotoxicity</p>
                     </c>
                     <c ca="center">
                        <p>cytotoxicity</p>
                     </c>
                     <c ca="center">
                        <p>involved in pore-formation</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_3510_C1</p>
                     </c>
                     <c ca="center">
                        <p>cyano</p>
                     </c>
                     <c ca="center">
                        <p>RTX toxins and related Ca2+-binding proteins</p>
                     </c>
                     <c ca="center">
                        <p>cytotoxicity</p>
                     </c>
                     <c ca="center">
                        <p>cytotoxicity</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_1172_C1</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>Enoyl-CoA hydratase/carnithine racemase</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>fatty acid metabolism</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_6629_C1</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>mannosylglycerate synthase</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>sugar metabolism</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_5746_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>L-kynurenine 3-monooxygenase Fpk</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>amino acid metabolism</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_8479_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>COMMD2</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>copper metabolism</p>
                     </c>
                     <c ca="center">
                        <p>COMM (copper metabolism MURR1) domain containing 2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>BJ173412</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>ubiquitin</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>protein metabolism</p>
                     </c>
                     <c ca="center">
                        <p>ribosomal protein in C. elegans dehydrogenases</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_3987_C1</p>
                     </c>
                     <c ca="center">
                        <p>fungi</p>
                     </c>
                     <c ca="center">
                        <p>MNN9</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>N-glycosylation</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_6514_C1</p>
                     </c>
                     <c ca="center">
                        <p>cyano</p>
                     </c>
                     <c ca="center">
                        <p>oxidoreductase</p>
                     </c>
                     <c ca="center">
                        <p>metabolism</p>
                     </c>
                     <c ca="center">
                        <p>energy metabolism</p>
                     </c>
                     <c ca="center">
                        <p>related to aryl-alcohol dehydrogenases</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_11394_C1</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>homolog of eukaryotic DNA ligase III</p>
                     </c>
                     <c ca="center">
                        <p>nucleic acid binding / modification</p>
                     </c>
                     <c ca="center">
                        <p>DNA repair</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>BJ191550</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>formamidopyrimidine-DNA glycosylase</p>
                     </c>
                     <c ca="center">
                        <p>nucleic acid binding / modification</p>
                     </c>
                     <c ca="center">
                        <p>DNA repair</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>BJ160862</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>Osa1 nuclear protein</p>
                     </c>
                     <c ca="center">
                        <p>nucleic acid binding / modification</p>
                     </c>
                     <c ca="center">
                        <p>DNA binding</p>
                     </c>
                     <c ca="center">
                        <p>chromatin regulation</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>BJ582496</p>
                     </c>
                     <c ca="center">
                        <p>cyano</p>
                     </c>
                     <c ca="center">
                        <p>SAM-dependent methyltransferase</p>
                     </c>
                     <c ca="center">
                        <p>nucleic acid binding / modification</p>
                     </c>
                     <c ca="center">
                        <p>nucleic acid modification</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_2586_C1</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>CarD protein</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>DNA binding</p>
                     </c>
                     <c ca="center">
                        <p>leucine zipper transcription factor, light- and starvation-induced response</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_3689_C1</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>serine/threonine protein kinase</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>BJ172132</p>
                     </c>
                     <c ca="center">
                        <p>bacteria</p>
                     </c>
                     <c ca="center">
                        <p>serine/threonine protein kinase</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_460_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>HLA-B-associated transcript</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_1041_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>calcium/calmodulin-dependent protein kinase II delta</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_6326_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>tumor suppressor tout-velu</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>involved in diffusion of hedgehog</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_11399_C1</p>
                     </c>
                     <c ca="center">
                        <p>metazoa</p>
                     </c>
                     <c ca="center">
                        <p>dual-specificity tyrosine phosphatase YVH1</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>signal transduction</p>
                     </c>
                     <c ca="center">
                        <p>Non-receptor class dual specificity subfamily</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_184_C1</p>
                     </c>
                     <c ca="center">
                        <p>fungi</p>
                     </c>
                     <c ca="center">
                        <p>high-affinity iron permease</p>
                     </c>
                     <c ca="center">
                        <p>transport</p>
                     </c>
                     <c ca="center">
                        <p>transport</p>
                     </c>
                     <c ca="center">
                        <p>high affinity iron uptake</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_7115_C2</p>
                     </c>
                     <c ca="center">
                        <p>fungi</p>
                     </c>
                     <c ca="center">
                        <p>uric acid-xanthine permease</p>
                     </c>
                     <c ca="center">
                        <p>transport</p>
                     </c>
                     <c ca="center">
                        <p>transport</p>
                     </c>
                     <c ca="center">
                        <p>belongs to the Xanthine/Uracil oermeases family</p>
                     </c>
                  </r>
                  <r>
                     <c ca="center">
                        <p>PPP_11191_C1</p>
                     </c>
                     <c ca="center">
                        <p>fungi</p>
                     </c>
                     <c ca="center">
                        <p>inorganic phosphate transporter</p>
                     </c>
                     <c ca="center">
                        <p>transport</p>
                     </c>
                     <c ca="center">
                        <p>transport</p>
                     </c>
                     <c ca="center">
                        <p>probable inorganic phosphate transporter; yeast pho99 homologue</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Gene structure and splice sites</p>
            </st>
            <p>The average rate of introns per gene (~5) is the same in <it>Physcomitrella</it>, <it>Arabidopsis </it>and human <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. The average <it>Physcomitrella </it>intron (252 bp) is longer than those of <it>Arabidopsis </it>(146 bp) and shorter than the typical human intron (740 bp). Furthermore, the <it>Physcomitrella </it>intron is longer than the exon, whereas in <it>Arabidopsis </it>it is the other way round. While the size distribution of <it>Arabidopsis </it>introns is centered around 70 bp, the longer moss introns are mainly clustered around 180 bp (data not shown). This fits the weak correlation of intron length and genome size generally found in eukaryotic organisms <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Intron positions of close homologs between <it>Physcomitrella </it>and <it>Arabidopsis </it>are generally conserved (e.g., <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>).</p>
            <p>The <it>Physcomitrella </it>G/C content of 40% in the intron and 50% in the exon differs significantly from that of <it>Arabidopsis</it>; 33% and 44%, respectively. Generally, <it>Physcomitrella </it>introns contain more thymine (T) than the exons. In terms of mononucleotide composition, T is overrepresented in the intron and C is underrepresented in the exon. In terms of dinucleotides, there is a significant overrepresentation of TT in the introns. Outstanding trinucleotide usage are the overrepresented TTT in the intron and the stop codon TGA in the exon, while the other two stop codons TAA and TAG are underrepresented in the moss.</p>
            <p>A visualisation of the <it>Physcomitrella </it>donor and acceptor sites is shown in figure <figr fid="F5">5a</figr>. Comparison of the <it>Arabidopsis</it>-trained Netplantgene <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> and the <it>Physcomitrella</it>-trained svmsplice (Fig. <figr fid="F5">5b</figr>) reveals a better overall performance of the support vector machine. Although Netplantgene exhibits a high recall, precision is low, which is due to the large amount of false positive predictions. Svmsplice predicts a lower rate of true positives (thus lower recall), however, precision is much better. The mean values of recall and precision for both donor and acceptor site are higher for svmsplice and thus make it the method of choice for accurate prediction of <it>Physcomitrella </it>splice sites.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Splice site sequence logos and efficiency of splice site prediction</p>
               </caption>
               <text>
                  <p><b>Splice site sequence logos and efficiency of splice site prediction</b>. a) Sequence logos of <it>Physcomitrella </it>donor and acceptor sites. b) Prediction performance of Netplantgene and svmsplice for <it>Physcomitrella </it>splice sites. TP = true positive, FN = false negative, FP = false positive, measured on the lefthand (%) axis. Recall (sensitivity) = tp/(tp+fn), precision = tp/(tp+fp), measured on the righthand axis.</p>
               </text>
               <graphic file="1471-2164-6-43-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Composition of coding sequences and codon usage</p>
            </st>
            <p>Significant differences in codon fraction usage for the three above mentioned sequence subsets (<it>Arabidopsis </it>orthologs and paralogs, retained genes) when compared with the averaged codon usage in <it>Physcomitrella </it>and <it>Arabidopsis </it>are shown in Fig. <figr fid="F6">6a</figr>. The Average G/C content of the <it>Arabidopsis </it>CDS is ~43%, whereas it is ~50% for <it>Physcomitrella </it>(Table <tblr tid="T3">3</tblr>). It might be argued that the EST-based estimation of G/C content in <it>Physcomitrella </it>is too high because of potential decay of AT-rich sequences <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. However, when calculating the G/C content for all available 399 full-length CDS from Genbank, the percentage value is also ~50% (50.67%). This rate is also found in the retained genes and the <it>Arabidopsis </it>paralogs (Table <tblr tid="T3">3</tblr>), whereas the ortholog fraction has a significantly lower G/C content of ~49%, i.e. towards the <it>Arabidopsis </it>nucleotide composition. Codon bias in <it>Physcomitrella </it>is positively correlated with gene expression level and G/C content of the CDS <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. It was argued that weak natural selection for translational efficiency is the driving force behind codon bias in the moss rather than mutational bias. Given the G/C rate of 50% in the CDS, a mutational bias indeed seems unlikely.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Trinucleotide frequencies and codon usage</p>
               </caption>
               <text>
                  <p><b>Trinucleotide frequencies and codon usage</b>. a) The averaged <it>Physcomitrella </it>codon fraction usage measured as percentage of the total amount of counted codons is shown as grey diamonds, including a margin of 2&#215; average absolute deviation (AAD, error bars), in comparison with <it>Arabidopsis </it>(yellow circles). Significantly deviating codons of the sequence subsets are presented as colored circles, namely retained genes (blue), paralogs (red) and orthologs (green). b) The effective number of codons (enc) for <it>Physcomitrella </it>(green) and <it>Arabidopsis </it>(yellow) as a range distribution scatter plot (y axis: % of analysed genes) and as averaged values (horizontal bar chart; error bars: standard deviation).</p>
               </text>
               <graphic file="1471-2164-6-43-6"/>
            </fig>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Codon usage of <it>Physcomitrella </it>retained genes, orthologs and paralogs</p>
               </caption>
               <tblbdy cols="10">
                  <r>
                     <c ca="left">
                        <p>
                           <b>sequence set</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b># bases</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>mean # of each triplet</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>G/C content</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b># significant codon usage changes</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>codon usage towards At</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>codon usage away from At</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>codon over represented</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>codon under represented</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>significant changes per aa</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>At mRNAs</p>
                     </c>
                     <c ca="right">
                        <p>10,755,859</p>
                     </c>
                     <c ca="right">
                        <p>56,020</p>
                     </c>
                     <c ca="center">
                        <p>43.32</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Pp ORFs</p>
                     </c>
                     <c ca="right">
                        <p>7,638,122</p>
                     </c>
                     <c ca="right">
                        <p>39,782</p>
                     </c>
                     <c ca="center">
                        <p>49.94</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                     <c ca="center">
                        <p>n.a.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>retained genes</p>
                     </c>
                     <c ca="right">
                        <p>77,998</p>
                     </c>
                     <c ca="right">
                        <p>406</p>
                     </c>
                     <c ca="center">
                        <p>50.30</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>6</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>Phe under represented</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>paralogs</p>
                     </c>
                     <c ca="right">
                        <p>1,115,937</p>
                     </c>
                     <c ca="right">
                        <p>5,812</p>
                     </c>
                     <c ca="center">
                        <p>50.07</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>none</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>orthologs</p>
                     </c>
                     <c ca="right">
                        <p>953,293</p>
                     </c>
                     <c ca="right">
                        <p>4,965</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>49.04</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>8</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>Pro under reprensented</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>sum</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>7</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The predicted <it>Physcomitrella </it>ORF were used as background to check for significant changes in percentage codon fraction usage in the orthologs, paralogs and retained genes (best BLAST hit not among plants). In case of significant deviation (two times average absolute deviation &#8211; AAD) from the total set, the direction of the change relative to the <it>Arabidopsis </it>codon usage was checked. Significant deviations are shown enlarged, At = <it>Arabidopsis thaliana</it>, Pp = <it>Physcomitrella patens</it>.</p>
               </tblfn>
            </tbl>
            <p>In retained genes, Phenylalanine codons are underrepresented, in the orthologs this is the case for Proline codons. As can also be seen from the G/C content drift mentioned above, the majority of deviating codons in the orthologs changed in the direction of the <it>Arabidopsis </it>percentage usage. For retained genes, it is the direct opposite: the significantly deviating codons in these genes point away from the <it>Arabidopsis </it>codon fraction usage. Orthologs are thought to be functionally equivalent across taxonomic groups. The common ancestor of land plants might have had a G/C content similar to mosses, i.e. around 50%. In order to preserve efficient functioning of orthologs it might have been necessary to evolve a slightly different codon usage for these genes in mosses, as is e.g. the case in <it>Arabidopsis</it>. The retained genes, on the other hand, are not found in seed plants and do not reflect the codon usage found there.</p>
            <p>The average number of synonymous codons that is used in <it>Physcomitrella </it>and <it>Arabidopsis </it>CDS is not significantly different (Fig. <figr fid="F6">6b</figr>, bar chart). However, the percentage distribution of synonymous codon usage, as measured by the effective number of codons (enc), is surprisingly dissimilar (Fig. <figr fid="F6">6b</figr>, scatter plot). Most <it>Arabidopsis </it>coding sequences use a lot of synonymous codons (enc 45&#8211;59), whereas <it>Physcomitrella </it>displays a linear percentage increase from low to high values. Interestingly, around 15% of the moss genes contain no codon bias at all (enc = 61).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>The genome of the ancient land plant <it>Physcomitrella patens</it>, a moss, harbours genes of which at least 30% have a detectable homolog in seed plants. EST clustering yielded a database that covers a large proportion of the transcriptome, approximately 8% of the virtual transcripts contain full-length CDS.</p>
         <p>Transcripts that are clear homologs of <it>Arabidopsis </it>genes were mapped against the <it>Arabidopsis </it>chromosomes, along with the set of paralogs and orthologs between the two organisms. All three sequence sets could be mapped evenly across the chromosomes, revealing neither hot nor cold spots (despite centromeric regions) nor differences in gene density.</p>
         <p>While moss genes resemble those of <it>Arabidopsis</it>, there are significant differences. Introns are larger than those of the seed plant and are also longer than exons within moss, which is not the case in <it>Arabidopsis</it>. The G/C content of exons equals the A/T content. This might reflect a certain tenacity of the mosses' genetic system and its slow mutational rate. These might be necessary characteristics, as due to haploidy in the dominant gametophyte, the chance for the propagation of a disadvantageous change is higher than in a polyploid organism.</p>
         <p>Whereas orthologs display a codon fraction usage drift towards <it>Arabidopsis</it>, the contrary is the case for retained genes. Thus, evolution of codon usage seems to be correlated with evolutionary history of protein genes. Mutational bias does not seem to play a role in the evolution of moss coding sequences. While the majority of <it>Physcomitrella </it>CDS displays codon bias, there is a significant fraction (~15%) of genes that is not biased at all, possibly representing a more ancient nucleotide composition than oberved in <it>Arabidopsis</it>. Splice sites in the moss resemble those in <it>Arabidopsis</it>, however, species-specific prediction models, like the one presented here, are necessary in order to avoid false positives. The same is true for the prediction of ORF based on EST data.</p>
         <p>A high proportion of the sequence space of unicellular eukaryotes is covered by moss homologs, which apparently have not been lost since the days of the last common ancestor. The majority of moss genes find their best scoring homolog in plants. However, there are 134 putative retained genes that have their best BLAST hit among other taxonomic groups. Of those, 57 genes are specific to a single taxonomic group, putative functional annotation could be carried out for 25 of these proteins. The functional annotation revealed deviations in the taxonomic distributions: certain sets of genes seem to be shared with specific taxonomic groups, for example, transport proteins with fungi or signal transduction genes with bacteria and animals. Of special interest are genes that are possibly involved in cytotoxicity, metabolism and nucleic acid repair. These genes might be the reasons for some of the extraordinary capabilities of mosses, namely resistance against microbial pathogens, additional secondary pathways (as compared with seed plants) and a high rate of homologous recombination.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Clustering of EST data</p>
            </st>
            <p>All publicly available protein encoding DNA sequences of <it>Physcomitrella </it>were retrieved using Entrez <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and divided into 399 "seeds" (full length mRNA sequences) as well as 102,535 EST and other sequences. This dataset is called the <it>Physcomitrella patens </it>public set, or PPP.</p>
            <p>A set of 17 moss-specific repetitive elements, detected mainly in the untranslated regions of <it>Physcomitrella </it>genes <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and used for filtering (see below) is available via cosmoss.org <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. Filtering, clustering and assembly of EST data were done using the Paracel transcript assembler, PTA <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. A species-specific parameter set has been developed and is available upon request.</p>
            <p>For sequences where electropherograms were available, base-calling was carried out using phred <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. Base quality values of EST sequences without available sequencer raw data was set arbitrarily to a low value, 10%, and in the case of seed sequences to a higher confidence value of 50%. Filtering included steps for removal of synthetic (vector/linker) and low quality sequences as well as of contaminants (homologs of <it>E. coli </it>as well as <it>Physcomitrella </it>mitochondrial, rRNA and chloroplast genes). Low-complexity regions were annotated together with poly-A tails, untranslated regions (UTR, UTRdb see <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>) and repetitive elements (repeats, repbase see <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>), in order not to disturb clustering and assembly. In a final step, sequences containing less than 150 bases of sense characters were removed. For PPP, a total of 100,079 sequences went into the clustering.</p>
            <p>Prior to clustering, homologs of the seed sequences were pulled out of the sequence pool and assembled independently. Where possible, sequences were placed into 5' and 3' partitions based on detected poly-A tails and inherent annotated information. Both during clustering and assembly, putative chimeras (cloning artefacts) were detected and tagged. During assembly, contigs were built within clusters and putative splice variants detected. After clustering and assembly, the PPP set contained a total of 26,131 sequences. By using only the longest sequence in each cluster, a non-redundant set of 22,218 sequences was produced. The PP dataset contained 63,685 sequences in the complete and 48,961 sequences in the non-redundant set.</p>
         </sec>
         <sec>
            <st>
               <p>Splice site prediction</p>
            </st>
            <p>For the splice site prediction, all publicly available pairs of genomic and cDNA/mRNA sequences were retrieved (40 genes). Together with 29 unpublished sequences, these sequences were aligned using MGAlign 1.3.6 <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> in order to determine the splice sites. The procedure yielded a total of 438 exons and 368 introns. The complete dataset is available via cosmoss.org <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <p>The sequence logos (Fig. <figr fid="F5">5a</figr>) were created via the web interface at <abbrgrp><abbr bid="B38">38</abbr></abbrgrp> using 10 nucleotides up- and downstream of the donor / acceptor sites.</p>
            <p>Suppor vector machine: The software used for training and classification was SVMlight <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, libsvm <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> and svmsplice <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. The complete set of splice sites was divided into training/testing sets of sizes 10&#8211;90%, for each set three samples were drawn. The set containing 90% of the sites for training proved to yield the best results. Optimization of parameters was done by 10-fold cross-validation, plotting precision vs. recall and chosing the best curve. The best performing model could be constructed using 50 nucleotides up- and downstream of the splice sites as context with the basepairing feature set of svmsplice and a polynomial kernel function of 4<sup>th </sup>order.</p>
         </sec>
         <sec>
            <st>
               <p>BLAST searches and filtering</p>
            </st>
            <p>BLAST searches were carried out using Paracel BLAST <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, a parallelized version of BLAST 2 <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, on amino acid level whenever applicable. In order to exclude random hits which are not based on true sequence homology, alignments had to contain at least 30% identical positions and a minimum length of 100 amino acid characters. This rigorous filtering excludes some true positive hits but removes almost all false positives <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. Putative full-length CDS had to pass the same filtering. In addition, in this case only those hits were counted that covered at least 90% of the subjects length. For the determination of identical sequences, BLASTN was performed and hits were filtered to be at least 95% identical and 300 nucleotides long. Non-redundant hits were counted by removing all subjects that were present more than once in the search result.</p>
         </sec>
         <sec>
            <st>
               <p>Additional sequence datasets</p>
            </st>
            <p>The predicted coding sequences of rice (56,056 sequences) and <it>Arabidopsis </it>(28,581 sequences) genes were taken from release 1.0 and 4.0 of the TIGR database <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, respectively. The taxprot dataset (3,346,100 sequences, see Table <tblr tid="T1">1</tblr> for details) was created by downloading the respective sequences from Genbank <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> using appropriate Entrez queries. All three datasets consist of amino acid sequences. The <it>Arabidopsis thaliana </it>chromosome sequences were retrieved from Genbank <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>ORF prediction</p>
            </st>
            <p>ESTScan 2.0 <abbrgrp><abbr bid="B46">46</abbr></abbrgrp> was used to predict open reading frames. The species-specific model for <it>Physcomitrella </it>was built by using the 399 public full length seed sequences (complete mRNAs) mentioned above. ORF were predicted from the clustered EST data (non-redundant datasets). For the PPP set, 19,081 ORF were predicted; 34,981 for the PP set. Predictions were done using both the <it>Arabidopsis </it>and the <it>Physcomitrella </it>model for comparison. Manual inspection of several known CDS revealed that the <it>Arabidopsis</it>-based prediction contained false-positive stretches, which was not the case for the <it>Physcomitrella</it>-based prediction. Although the <it>Physcomitrella </it>model predicted a lower number of ORF, it was used in order to keep false-positives to a minimum.</p>
         </sec>
         <sec>
            <st>
               <p>Codon usage</p>
            </st>
            <p>Four different sets of coding sequences were used (see table <tblr tid="T3">3</tblr>). A set of 7,765 well annotated <it>Arabidopsis </it>mRNAs was retrieved using Entrez. The <it>Physcomitrella </it>datasets contained the predicted ORF for the complete PPP set (19,081 sequences), the <it>Arabidopsis </it>paralogs (1,659) and orthologs (1,476) described above as well as the putatively retained genes not found in higher plants (134). The smallest set contained 77,998 nucleotides and thus a theoretical average of 406 instances of each triplet, which allows significant analyses.</p>
            <p>Nucleotide frequencies were calculated with the GCG 10.3 <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> software composition. Codon usage fractions for individual datasets were calculated as percentage of the respective total amount of counted codons. Absolute deviations in comparison to the full <it>Physcomitrella </it>ORF set were calculated for the three subsets (retained genes, <it>Arabidopsis </it>orthologs and paralogs). The computed mean value over all sets (average absolute deviation) was 0.069. Codon fraction usage deviation was counted as significant only if it differed at least twice as much (+/- 0.138%) from the full set.</p>
            <p>The effective number of codons (enc) was calculated using CodonW (J. Peden, <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>). The enc values range from 20 (maximum bias, i.e. only one synonymous codon is used per amino acid) to 61 (no bias, all synonymous codons are being used).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>CDS = coding sequence(s), EST = expressed sequence tag(s), ORF = open reading frame(s)</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>SAR carried out most of the analyses, drafted the manuscript and designed the work. DF carried out the splice site prediction. DL participated in the analyses and generated the databases and the web interface. RR participated in drafting and conception of the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We would like to thank Sven Degroeve (University of Gent, Belgium) for providing of and assistance with svmsplice as well as Hans Sten&#248;ien, Colette Matthewman and several anonymous reviewers for helpful comments on the manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Why don't mosses flower?</p>
            </title>
            <aug>
               <au>
                  <snm>Theissen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>M&#252;nster</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Henschel</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>New Phytologist</source>
            <pubdate>2001</pubdate>
            <volume>150</volume>
            <fpage>1</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1046/j.1469-8137.2001.00089.x</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Tertiary and quarternary fossils</p>
            </title>
            <aug>
               <au>
                  <snm>Miller</snm>
                  <fnm>ND</fnm>
               </au>
            </aug>
            <source>New manual of Bryology</source>
            <publisher>Miyazaki: Hattori Bot Lab</publisher>
            <editor>Schuster RM</editor>
            <pubdate>1984</pubdate>
            <volume>2</volume>
            <fpage>1194</fpage>
            <lpage>1232</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Moose &#8211; lebende Fossilien</p>
            </title>
            <aug>
               <au>
                  <snm>Frahm</snm>
                  <fnm>J-P</fnm>
               </au>
            </aug>
            <source>BuZ</source>
            <pubdate>1994</pubdate>
            <volume>24</volume>
            <issue>3</issue>
            <fpage>120</fpage>
            <lpage>124</lpage>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Molecular evolution and phylogeny of the atpB-rbcL spacer of chloroplast DNA in the true mosses</p>
            </title>
            <aug>
               <au>
                  <snm>Chiang</snm>
                  <fnm>TY</fnm>
               </au>
               <au>
                  <snm>Schaal</snm>
                  <fnm>BA</fnm>
               </au>
            </aug>
            <source>Genome</source>
            <pubdate>2000</pubdate>
            <volume>43</volume>
            <issue>3</issue>
            <fpage>417</fpage>
            <lpage>426</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1139/gen-43-3-417</pubid>
                  <pubid idtype="pmpid" link="fulltext">10902703</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>An improved and highly standardised transformation procedure allows efficient production of single and multiple targeted gene-knockouts in a moss, <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Hohe</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Egener</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Lucht</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Holtorf</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Reinhard</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Schween</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Curr Genet</source>
            <pubdate>2004</pubdate>
            <volume>44</volume>
            <issue>6</issue>
            <fpage>339</fpage>
            <lpage>347</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00294-003-0458-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">14586556</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Quick guide: <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Cove</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Curr Biology</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>R261</fpage>
            <lpage>R262</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.cub.2004.03.016</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>A New Moss Genetics: Targeted Mutagenesis in <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Schaefer</snm>
                  <fnm>DG</fnm>
               </au>
            </aug>
            <source>Annual Review of Plant Physiology</source>
            <pubdate>2002</pubdate>
            <volume>53</volume>
            <fpage>477</fpage>
            <lpage>501</lpage>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Comparative genomics of Physcomitrella patens gametophytic transcriptome and <it>Arabidopsis thaliana</it>: implication for land plant evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Nishiyama</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Fujita</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Shin</snm>
                  <fnm>IT</fnm>
               </au>
               <au>
                  <snm>Seki</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Nishide</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Uchiyama</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Kamiya</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Carninci</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hayashizaki</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Shinozaki</snm>
                  <fnm>K</fnm>
               </au>
               <etal/>
            </aug>
            <source>Proceedings of the National Academy of Sciences of the United States of America</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <issue>13</issue>
            <fpage>8007</fpage>
            <lpage>8012</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">164703</pubid>
                  <pubid idtype="pmpid" link="fulltext">12808149</pubid>
                  <pubid idtype="doi">10.1073/pnas.0932694100</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Moss transcriptome and beyond</p>
            </title>
            <aug>
               <au>
                  <snm>Rensing</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Rombauts</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Van de Peer</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Trends in Plant Science</source>
            <pubdate>2002</pubdate>
            <volume>7</volume>
            <issue>12</issue>
            <fpage>535</fpage>
            <lpage>538</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1360-1385(02)02363-4</pubid>
                  <pubid idtype="pmpid" link="fulltext">12475493</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p><it>Physcomitrella patens </it>is highly tolerant against drought, salt and osmotic stress</p>
            </title>
            <aug>
               <au>
                  <snm>Frank</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Ratnadewi</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Planta</source>
            <pubdate>2005</pubdate>
            <volume>220</volume>
            <fpage>384</fpage>
            <lpage>394</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00425-004-1351-1</pubid>
                  <pubid idtype="pmpid" link="fulltext">15322883</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Abiotic stress response in the moss <it>Physcomitrella patens</it>: evidence for an evolutionary alteration in signaling pathways in land plants</p>
            </title>
            <aug>
               <au>
                  <snm>Kroemer</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Frank</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Plant Cell Reports</source>
            <pubdate>2004</pubdate>
            <volume>22</volume>
            <issue>11</issue>
            <fpage>864</fpage>
            <lpage>870</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00299-004-0785-z</pubid>
                  <pubid idtype="pmpid" link="fulltext">15034746</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Two RpoT genes of <it>Physcomitrella patens </it>encode phage-type RNA polymerases with dual targeting to mitochondria and plastids</p>
            </title>
            <aug>
               <au>
                  <snm>Richter</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Kiessling</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hedtke</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Decker</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Borner</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Weihe</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>2002</pubdate>
            <volume>290</volume>
            <issue>1&#8211;2</issue>
            <fpage>95</fpage>
            <lpage>105</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(02)00583-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">12062804</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Dual targeting of plastid division protein FtsZ to chloroplasts and the cytoplasm</p>
            </title>
            <aug>
               <au>
                  <snm>Kiessling</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Gremillon</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Rensing</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Nick</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sarnighausen</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Decker</snm>
                  <fnm>EL</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>EMBO Rep</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <issue>9</issue>
            <fpage>889</fpage>
            <lpage>894</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/sj.embor.7400238</pubid>
                  <pubid idtype="pmpid" link="fulltext">15319781</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Identification of a novel delta 6-acyl-group desaturase by targeted gene disruption in <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Girke</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Schmidt</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Zahringer</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Heinz</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>The Plant Journal</source>
            <pubdate>1998</pubdate>
            <volume>15</volume>
            <issue>1</issue>
            <fpage>39</fpage>
            <lpage>48</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-313X.1998.00178.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">9744093</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Functional knockout of the adenosine 5'-phosphosulfate reductase gene in <it>Physcomitrella patens </it>revives an old route of sulfate assimilation</p>
            </title>
            <aug>
               <au>
                  <snm>Koprivova</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Schween</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Herschbach</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Kopriva</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Journal of Biological Chemistry</source>
            <pubdate>2002</pubdate>
            <volume>277</volume>
            <issue>35</issue>
            <fpage>32195</fpage>
            <lpage>32201</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/jbc.M204971200</pubid>
                  <pubid idtype="pmpid" link="fulltext">12070175</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Cloning and functional characterisation of an enzyme involved in the elongation of Delta6-polyunsaturated fatty acids from the moss <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Zank</snm>
                  <fnm>TK</fnm>
               </au>
               <au>
                  <snm>Zahringer</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Beckmann</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Pohnert</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Boland</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Holtorf</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Lerchl</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Heinz</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>The Plant Journal</source>
            <pubdate>2002</pubdate>
            <volume>31</volume>
            <issue>3</issue>
            <fpage>255</fpage>
            <lpage>268</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-313X.2002.01354.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">12164806</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Large-scale taxonomic profiling of eukaryotic model organisms: a comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Mushegian</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Garey</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>LX</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <issue>6</issue>
            <fpage>590</fpage>
            <lpage>598</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9647634</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Automatic clustering of orthologs and in-paralogs from pairwise species comparisons</p>
            </title>
            <aug>
               <au>
                  <snm>Remm</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Storm</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2001</pubdate>
            <volume>314</volume>
            <issue>5</issue>
            <fpage>1041</fpage>
            <lpage>1052</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2000.5197</pubid>
                  <pubid idtype="pmpid" link="fulltext">11743721</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Cloning of the PpMSH-2 cDNA of <it>Physcomitrella patens</it>, a moss in which gene targeting by homologous recombination occurs at high frequency</p>
            </title>
            <aug>
               <au>
                  <snm>Brun</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Gonneau</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Doutriaux</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Laloue</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Nogue</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Biochimie</source>
            <pubdate>2001</pubdate>
            <volume>83</volume>
            <issue>11&#8211;12</issue>
            <fpage>1003</fpage>
            <lpage>1008</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0300-9084(01)01350-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">11879728</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Isolation of cDNAs encoding typical and novel types of phosphoinositide-specific phospholipase C from the moss <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Mikami</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Repp</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Graebe-Abts</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Hartmann</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Journal of Experimental Botany</source>
            <pubdate>2004</pubdate>
            <volume>55</volume>
            <issue>401</issue>
            <fpage>1437</fpage>
            <lpage>1439</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/jxb/erh140</pubid>
                  <pubid idtype="pmpid" link="fulltext">15073208</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Calmodulin-binding proteins in bryophytes: identification of abscisic acid-, cold-, and osmotic stress-induced genes encoding novel membrane-bound transporter-like proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Takezawa</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Minami</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Biochemical and Biophysical Research Communications</source>
            <pubdate>2004</pubdate>
            <volume>317</volume>
            <issue>2</issue>
            <fpage>428</fpage>
            <lpage>436</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.bbrc.2004.03.052</pubid>
                  <pubid idtype="pmpid" link="fulltext">15063776</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>The moss <it>Physcomitrella patens </it>releases a tetracyclic diterpene</p>
            </title>
            <aug>
               <au>
                  <snm>Von Schwartzenberg</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Schultze</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kassner</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Plant Cell Reports</source>
            <pubdate>2004</pubdate>
            <volume>22</volume>
            <issue>10</issue>
            <fpage>780</fpage>
            <lpage>786</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00299-004-0754-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">14963693</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Plant nuclear gene knockout reveals a role in plastid division for the homolog of the bacterial cell division protein FtsZ, an ancestral tubulin</p>
            </title>
            <aug>
               <au>
                  <snm>Strepp</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Scholz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kruse</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Speth</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proceedings of the National Academy of Sciences of the United States of America</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <issue>8</issue>
            <fpage>4368</fpage>
            <lpage>4373</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">22495</pubid>
                  <pubid idtype="pmpid" link="fulltext">9539743</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.8.4368</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>A tool for understanding homologous recombination in plants</p>
            </title>
            <aug>
               <au>
                  <snm>Hohe</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Plant Cell Reports</source>
            <pubdate>2003</pubdate>
            <volume>21</volume>
            <issue>12</issue>
            <fpage>1135</fpage>
            <lpage>1142</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00299-003-0644-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">12910366</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Splice site prediction in <it>Arabidopsis thaliana </it>pre-mRNA by combining local and global sequence information</p>
            </title>
            <aug>
               <au>
                  <snm>Hebsgaard</snm>
                  <fnm>SM</fnm>
               </au>
               <au>
                  <snm>Korning</snm>
                  <fnm>PG</fnm>
               </au>
               <au>
                  <snm>Tolstrup</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Engelbrecht</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rouze</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1996</pubdate>
            <volume>24</volume>
            <issue>17</issue>
            <fpage>3439</fpage>
            <lpage>3452</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146109</pubid>
                  <pubid idtype="pmpid" link="fulltext">8811101</pubid>
                  <pubid idtype="doi">10.1093/nar/24.17.3439</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Intron-exon structures of eukaryotic model organisms</p>
            </title>
            <aug>
               <au>
                  <snm>Deutsch</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Long</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <issue>15</issue>
            <fpage>3219</fpage>
            <lpage>3228</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148551</pubid>
                  <pubid idtype="pmpid" link="fulltext">10454621</pubid>
                  <pubid idtype="doi">10.1093/nar/27.15.3219</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Diversification of ftsZ during early land plant evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Rensing</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Kiessling</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Decker</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2004</pubdate>
            <volume>58</volume>
            <issue>2</issue>
            <fpage>154</fpage>
            <lpage>162</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-003-2535-1</pubid>
                  <pubid idtype="pmpid" link="fulltext">15042335</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Identification and prevention of a GC content bias in SAGE libraries</p>
            </title>
            <aug>
               <au>
                  <snm>Margulies</snm>
                  <fnm>EH</fnm>
               </au>
               <au>
                  <snm>Kardia</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Innis</snm>
                  <fnm>JW</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <issue>12</issue>
            <fpage>E60</fpage>
            <lpage>60</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">55759</pubid>
                  <pubid idtype="pmpid" link="fulltext">11410683</pubid>
                  <pubid idtype="doi">10.1093/nar/29.12.e60</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Adaptive basis of codon usage in the haploid moss <it>Physcomitrella patens</it></p>
            </title>
            <aug>
               <au>
                  <snm>Stenoien</snm>
                  <fnm>HK</fnm>
               </au>
            </aug>
            <source>Heredity</source>
            <pubdate>2005</pubdate>
            <volume>94</volume>
            <fpage>87</fpage>
            <lpage>93</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/sj.hdy.6800547</pubid>
                  <pubid idtype="pmpid" link="fulltext">15483656</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>NCBI Entrez</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/Entrez</url>
         </bibl>
         <bibl id="B31">
            <title>
               <p>In silico prediction of UTR repeats using clustered EST data</p>
            </title>
            <aug>
               <au>
                  <snm>Rensing</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Lang</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Reski</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Proceedings of the German Conference on Bioinformatics: 2003</source>
            <publisher>Munich, Germany: Belleville Verlag Michael Farin</publisher>
            <pubdate>2003</pubdate>
            <fpage>117</fpage>
            <lpage>122</lpage>
         </bibl>
         <bibl id="B32">
            <title>
               <p>cosmoss.org</p>
            </title>
            <url>http://www.cosmoss.org</url>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Paracel</p>
            </title>
            <url>http://www.paracel.com</url>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Base-calling of automated sequencer traces using phred. I. Accuracy assessment</p>
            </title>
            <aug>
               <au>
                  <snm>Ewing</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hillier</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Wendl</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <issue>3</issue>
            <fpage>175</fpage>
            <lpage>185</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9521921</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Databases of mRNA untranslated regions for metazoa</p>
            </title>
            <aug>
               <au>
                  <snm>Pesole</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Grillo</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Liuni</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Computers &amp; Chemistry</source>
            <pubdate>1996</pubdate>
            <volume>20</volume>
            <issue>1</issue>
            <fpage>141</fpage>
            <lpage>144</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0097-8485(96)80016-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">8867845</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Repbase update: a database and an electronic journal of repetitive elements</p>
            </title>
            <aug>
               <au>
                  <snm>Jurka</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Trends in Genetics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <issue>9</issue>
            <fpage>418</fpage>
            <lpage>420</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(00)02093-X</pubid>
                  <pubid idtype="pmpid" link="fulltext">10973072</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>MGAlignIt: A web service for the alignment of mRNA/EST and genomic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>BT</fnm>
               </au>
               <au>
                  <snm>Tan</snm>
                  <fnm>TW</fnm>
               </au>
               <au>
                  <snm>Ranganathan</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <issue>13</issue>
            <fpage>3533</fpage>
            <lpage>3536</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">168968</pubid>
                  <pubid idtype="pmpid" link="fulltext">12824360</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg561</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Sequence logo</p>
            </title>
            <url>http://www.cbs.dtu.dk/~gorodkin/appl/slogo.html</url>
         </bibl>
         <bibl id="B39">
            <title>
               <p>SVMlight</p>
            </title>
            <url>http://svmlight.joachims.org</url>
         </bibl>
         <bibl id="B40">
            <title>
               <p>libsvm</p>
            </title>
            <url>http://www.csie.ntu.edu.tw/~cjlin/libsvm</url>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Feature subset selection for splice site prediction</p>
            </title>
            <aug>
               <au>
                  <snm>Degroeve</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>De Baets</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Van De Peer</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Rouze</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>2</issue>
            <fpage>S75</fpage>
            <lpage>S83</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12385987</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <issue>17</issue>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Twilight zone of protein sequence alignments</p>
            </title>
            <aug>
               <au>
                  <snm>Rost</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Protein Eng</source>
            <pubdate>1999</pubdate>
            <volume>12</volume>
            <issue>2</issue>
            <fpage>85</fpage>
            <lpage>94</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/protein/12.2.85</pubid>
                  <pubid idtype="pmpid" link="fulltext">10195279</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B44">
            <title>
               <p>TIGR</p>
            </title>
            <url>http://www.tigr.org</url>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Genbank eukaryotic genomes</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html</url>
         </bibl>
         <bibl id="B46">
            <title>
               <p>ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Iseli</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Jongeneel</snm>
                  <fnm>CV</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>International Conference on Intelligent Systems for Molecular Biology: 1999</source>
            <pubdate>1999</pubdate>
            <fpage>138</fpage>
            <lpage>148</lpage>
         </bibl>
         <bibl id="B47">
            <title>
               <p>Accelrys</p>
            </title>
            <url>http://www.accelrys.com</url>
         </bibl>
         <bibl id="B48">
            <title>
               <p>CodonW</p>
            </title>
            <url>http://www.molbiol.ox.ac.uk/cu/</url>
         </bibl>
      </refgrp>
   </bm>
</art>
