<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-8-308</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>Analysis of the role of retrotransposition in gene evolution in vertebrates</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Yu</snm>
               <fnm>Zhan</fnm>
               <insr iid="I1"/>
               <email>zyu9@po-box.mcgill.ca</email>
            </au>
            <au id="A2">
               <snm>Morais</snm>
               <fnm>David</fnm>
               <insr iid="I1"/>
               <email>david.delimamorais@mail.mcgill.ca</email>
            </au>
            <au id="A3">
               <snm>Ivanga</snm>
               <fnm>Mahine</fnm>
               <insr iid="I1"/>
               <email>mahine.ivanga@crchul.ulaval.ca</email>
            </au>
            <au id="A4" ca="yes">
               <snm>Harrison</snm>
               <mi>M</mi>
               <fnm>Paul</fnm>
               <insr iid="I1"/>
               <email>paul.harrison@mcgill.ca</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Biology, McGill University, Stewart Biology Building, 1205 Docteur Penfield Ave., Montreal, QC, H3A 1B1 Canada</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>1</issue>
         <fpage>308</fpage>
         <url>http://www.biomedcentral.com/1471-2105/8/308</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17718914</pubid>
               <pubid idtype="doi">10.1186/1471-2105-8-308</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>07</day>
               <month>4</month>
               <year>2007</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>24</day>
               <month>8</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>24</day>
               <month>8</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Yu et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The dynamics of gene evolution are influenced by several genomic processes. One such process is retrotransposition, where an mRNA transcript is reverse-transcribed and reintegrated into the genomic DNA.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have surveyed eight vertebrate genomes (human, chimp, dog, cow, rat, mouse, chicken and the puffer-fish <it>T. nigriviridis</it>), for putatively retrotransposed copies of genes. To gain a complete picture of the role of retrotransposition, a robust strategy to identify putative retrogenes (<it>PRs</it>) was derived, in tandem with an adaptation of previous procedures to annotate processed pseudogenes, also called retropseudogenes (<it>R&#968;Gs</it>). Mammalian genomes are estimated to contain 400&#8211;800 <it>PRs </it>(corresponding to ~3% of genes), with fewer <it>PRs </it>and <it>R&#968;Gs </it>in the non-mammalian vertebrates. Focussing on human and mouse, we aged the <it>PRs</it>, analysed for evidence of transcription and selection pressures, and assigned functional categories. The <it>PRs </it>have significantly less transcription evidence mappable to them, are significantly less likely to arise from alternatively-spliced genes, and are statistically overrepresented for ribosomal-protein genes, when compared to the proteome in general. We find evidence for spurts of gene retrotransposition in human and mouse, since the lineage of either species split from the dog lineage, with >200 <it>PRs </it>formed in mouse since its divergence from rat. To examine for selection, we calculated: <it>(i) </it>K<sub>a</sub>/K<sub>s </sub>values (ratios of non-synonymous and synonymous substitutions in codons), and <it>(ii) </it>the significance of conservation of reading frames in <it>PRs</it>. We found >50 <it>PRs </it>in both human and mouse formed since divergence from dog, that are under pressure to maintain the integrity of their coding sequences. For different subsets of PRs formed at different stages of mammalian evolution, we find some evidence for non-neutral evolution, despite significantly less expression evidence for these sequences.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>These results indicate that retrotranspositions are a significant source of novel coding sequences in mammalian gene evolution.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="bmc" subtype="user_supplied_xml" id="endnote"/>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Genes are subject to many different processes that give rise to novel sequences, such as segmental and local duplication, gene conversion, and retrotransposition. The extent to which these different processes contribute to gene evolution is unclear. In the present paper, we focus on the phenomenon of <it>gene retrotransposition</it>. Retrotransposition entails the reverse transcription of an mRNA transcript and the subsequent re-integration of the resulting cDNA into genomic DNA, in germ-line cells <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. There is substantial genomic evidence for large-scale retrotransposition of mRNAs in mammalian genomes, from detection of thousands of apparent retropseudogenes in human, mouse and rat <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Such retropseudogenes (<it>R&#968;Gs</it>) are decayed or disabled gene sequence copies (typically bearing frameshifts and stop codons) that demonstrate the hallmark characteristics of retrotransposition, namely lack of introns of the parental gene, and also 3' polyadenine tails, if formed more recently <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Other features include short direct repeats flanking the sequence (for young retrotranspositions) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, frequent 5' truncations, and genomic location different from that of the parent gene <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. It has been demonstrated experimentally that <it>R&#968;G</it>s can be formed through the action of LINE-1 reverse transcriptases <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. The computational comparison of LINEs and <it>R&#968;G</it>s also supports the generation of <it>R&#968;G</it>s by LINEs <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. The poly(A) tails and frequent truncations found at the 5' end in the <it>R&#968;G</it>s are typical for LINEs <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Moreover, they share similar structures, including a common TT|AAAA insertion motif <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>.</p>
         <p>Since the substantial majority of these retrosequences bear disablements (frameshifts and stop codons), or have codon substitution patterns indicative of decay <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B5">5</abbr><abbr bid="B3">3</abbr></abbrgrp>, gene retrotransposition appears generally to lead to non-functional sequences that decay in the genomic DNA as evolution progresses <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B2">2</abbr><abbr bid="B9">9</abbr></abbrgrp>. However, even though the promoters of these gene retrosequences are not transferred, a small minority of them appears to be transcribed <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. For the human genome, there is a small population of at least ~200 transcribed processed pseudogenes, which have the symptoms of a lack of coding ability despite evidence of transcription, and are significantly likely to be found near others genes (as would be expected if they are co-opting promoters) <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>.</p>
         <p>Generation of a new functional gene is also a possible outcome of retrotransposition <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. There is an increasing number of transcribed, functionally characterized genes in mammalian and invertebrate animal genomes reported to bear the characteristics of retrosequences <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. Over ninety such <it>retrogenes </it>have been annotated in the human and mouse genomes <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Most of the functional retrogenes identified are expressed in testis and may have provided important raw material for rapid testis evolution in primates <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>.</p>
         <p>Here, to derive an overview of the role of <it>gene retrotransposition </it>in the genome evolution of vertebrates, and particularly mammals, we derive and apply a robust procedure to annotate gene retrotranspositions, built on our previous analyses of retropseudogenes <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B3">3</abbr><abbr bid="B2">2</abbr><abbr bid="B14">14</abbr></abbrgrp>. Our strategy incorporates a new rapid procedure for annotating retrocopies in the genomic DNA, in tandem with a pipeline to identify them in existing gene annotations. This PR annotation pipeline incorporates aging of the sequences through evolutionary rate analysis relative to putative parents and their orthologs, as well as analysis of the chromosomal milieus of these sequences and their putative parents. We find evidence for, on average, several hundred <it>PRs </it>in each proteome. Focussing on human and mouse, we find evidence for spurts of gene retrotransposition in both human and mouse, since divergence from dog. A small number (>50) of <it>PRs </it>have formed in both mouse and human since divergence form dog, that show signs of being under selection to maintain their coding sequences.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Genome data</p>
            </st>
            <p>The genome sequences and annotations of seven organisms analyzed in this paper (human, dog, cow, mouse, rat, chicken and <it>Tetraodon nigriviridis</it>), were downloaded from the Ensembl Web site <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, in January 2005. Version 2.1 of the chimpanzee assembly (downloaded in April 2006) was also used. Putative retrogenes (<it>PRs</it>) were identified in the annotated proteomes of eight vertebrates using the pipeline in Figure <figr fid="F1">1</figr>. This procedure is described in detail below. In tandem, putative retropseudogenes (<it>R&#968;Gs</it>), and additional <it>PRs </it>outside of current protein annotations, were assigned using a modification of previous procedures (Figures <figr fid="F1">1</figr> &amp;<figr fid="F2">2</figr>) <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B2">2</abbr></abbrgrp>. Genes from which <it>PR</it>s and <it>R&#968;G</it>s are thought to have originated are called <it>parent genes</it>.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Pipeline summarizing the annotation of PRs and retropseudogenes</p>
               </caption>
               <text>
                  <p><b>Pipeline summarizing the annotation of PRs and retropseudogenes</b>. The pipeline for PR annotation is summarized. There is an inset at the bottom, that summarizes the tests for local gene order and chromosomal milieu.</p>
               </text>
               <graphic file="1471-2105-8-308-1"/>
            </fig>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Rapid annotation of retropseudogenes</p>
               </caption>
               <text>
                  <p><b>Rapid annotation of retropseudogenes</b>. (1) TBLASTN matches (e-value &#8804; 10<sup>-4</sup>) of the annotated proteome against the genomic DNA are sorted by coordinates and collated for each protein to form a set of matches {M}. (2) The sets {M} are filtered using length-based heuristics. (3) Each protein is realigned to the genomic DNA using FASTY, and the best-matching proteins at each point have disablements and that matches >70% of the length of the parent sequence are picked as retropseudogene annotations.</p>
               </text>
               <graphic file="1471-2105-8-308-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Rapid identification of retropseudogenes (<it>R&#968;Gs</it>)</p>
            </st>
            <p>Retropseudogenes were annotated on the Ensembl genome versions used in this present analysis, using the rapid improvement of previous procedures to identify retropseudogenes described above (summarised in Figure <figr fid="F2">2</figr>) <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B5">5</abbr><abbr bid="B14">14</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Identification of putative retrotransposed genes (<it>PRs</it>)</p>
            </st>
            <p>(1) <it>Homology detection</it>: Each proteome was compared against itself using BLAST to find similarities with e-value &#8804; 10<sup>-4 </sup><abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Any match to a potential pseudogene contaminant in the proteome annotations was removed (Figure <figr fid="F1">1</figr>).</p>
            <p>(2) <it>Exon seam analysis</it>: Exon boundary information for each protein was extracted from the appropriate Ensembl genome annotation files. The positioning of exon boundaries in encoded protein sequences, <it>i.e</it>., 'exon seams', was then deduced <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Using the positioning of exon seams, the BLAST matches between proteins were filtered to pick out alignments between a protein encoded by a multiple-exon gene, and a single exon of another gene. To define <it>PRs</it>, the length of the exon was required to be between 0.9 and 1.1 of the whole length of the multiple-exon protein. (This is stricter than the criterion of 0.7 of the length used for annotation of retropseudogenes <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>).</p>
            <p><it>(3) Assignment of parent genes: </it>To assign parent genes to <it>PRs</it>, we calculated substitution rates at synonymous codon sites (<it>i.e</it>., Ks values) for all matches to <it>PRs </it>using the package PAML, for all instances where the amino-acid sequence identity for the pair of sequences is &#8805; 70%. The sequence with the smallest Ks value was chosen as the 'parent gene'. For sequence identities &lt;70%, saturation of substitutions is likely <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>, and so the sequence with the highest BLAST bitscore in alignment with the <it>PR </it>was chosen as the putative parent.</p>
            <p>(4) <it>Additional filtering: </it>In addition, <it>PRs </it>were discarded if they matched olfactory receptors (ORs) with BLAST e-value &#8804; 10<sup>-4 </sup>over &#8805; 0.5 of the length of the OR, since recent olfactory receptors (ORs) have probably originated from different mechanism other than retrotransposition <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. Olfactory receptor sequences were taken from ORDB <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>.</p>
            <p>(5) <it>Local gene order test for similarity of the chromosomal milieus for PRs and parent genes: </it>To check that the <it>PRs </it>did not arise from local or segmental duplication, we derived a 'local gene order' test. For this test, we compared the chromosomal milieus of <it>PRs </it>and parents for significant similarity, as follows. Proteins encoded by genes adjacent to <it>PRs </it>in the chromosomes were BLASTed against the corresponding proteins from genes adjacent to putative parents, for a given window (<it>w</it><sub><it>genes</it></sub>) of number of genes in either direction (5' and 3'). A <it>w</it><sub><it>genes </it></sub>size of 7 (the <it>PR </it>or parent, plus 3 genes in either direction), with an allowance for one gap between the positions of matches within <it>w</it><sub><it>genes</it></sub>, was found to be suitable. The number of significant homologous matches <it>N</it><sub><it>homologs </it></sub>(BLAST e-value &#8804; 10<sup>-4</sup>, sequence identity >40%, and match &#8805; 0.6 length of both <it>PR </it>and parent) between the milieus of <it>PR </it>and parent was tallied. An expected benchmark distribution for <it>N</it><sub><it>homologs </it></sub>was derived for the chromosomal milieus of 1,000 randomly-sampled pairs of proteins that have any significant BLAST match to each other (e-value &#8804; 10<sup>-4</sup>). From examination of this distribution, we found that 80% of such random pairs have <it>N</it><sub><it>homologs </it></sub>&lt;1, and 87% have <it>N</it><sub><it>homologs </it></sub>&#8804; 1. We thus chose <it>N</it><sub><it>homologs </it></sub>= 1, as a suitable threshold for local similarity arising from duplication of genomic DNA. However, the results differ little if a threshold of <it>N</it><sub><it>homologs </it></sub>= 0 is used. This procedure was applied to the genomes of the human and mouse. Interestingly, application of this criterion resulted in the exclusion of many sequences with large individual exons (<it>PRs </it>with <it>FLE</it><sub><it>parent </it></sub>&#8805; 0.8; 36/86 = 42% of those excluded), that may be false positives in our data set of <it>PRs</it>. A large fraction of these sequences (68%) tend to have long, tandem arrays of Zn-finger domains covering more than a third of their sequences (Additional Figure <figr fid="F1">1</figr>).</p>
         </sec>
         <sec>
            <st>
               <p>Additional filtering and annotation</p>
            </st>
            <p>The following additional analysis was performed on the <it>PRs</it>:</p>
            <p><it>(i) Fraction of largest exon in parent: </it>We calculated the fraction of the length any parent gene that is taken up by its largest exon. This is denoted <it>FLE</it><sub><it>parent</it></sub>. We found that there is no peculiar tendency for the parents of <it>PRs </it>to have a single large exon (which would yield a tendency for high FLE values) [Additional Figure <figr fid="F1">1</figr>].</p>
            <p><it>(ii) Overlap with retropseudogenes annotations: </it>Retropseudogenes were annotated on the Ensembl genome versions used in this present analysis, using the rapid improvement of previous procedures to identify retro(pseudo)genes described above. Any PR that overlapped one of these annotations was flagged (Table <tblr tid="T1">1</tblr>).</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Overview of gene retrotransposition analysis for eight vertebrates</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Species</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Number of genes *</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Number of PRs</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>PR matches retrotransposed TE **</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>PR overlaps pseudogene annotation</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>FLE &#8804; 0.8 ***</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Number of retro- pseudogenes (<it>R&#968;Gs</it>)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>PRs passing local gene order test</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Matching Refseq mRNA or Unigene consensus ****</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Human</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>22219</p>
                     </c>
                     <c ca="left">
                        <p>631 (3%)</p>
                     </c>
                     <c ca="left">
                        <p>78 (12%)</p>
                     </c>
                     <c ca="left">
                        <p>36 (6%)</p>
                     </c>
                     <c ca="left">
                        <p>504</p>
                     </c>
                     <c ca="left">
                        <p>2493</p>
                     </c>
                     <c ca="left">
                        <p>545</p>
                     </c>
                     <c ca="left">
                        <p>145/631 (23%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Chimp</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>20980</p>
                     </c>
                     <c ca="left">
                        <p>476 (2%)</p>
                     </c>
                     <c ca="left">
                        <p>17 (4%)</p>
                     </c>
                     <c ca="left">
                        <p>5 (1%)</p>
                     </c>
                     <c ca="left">
                        <p>339</p>
                     </c>
                     <c ca="left">
                        <p>1889</p>
                     </c>
                     <c ca="left">
                        <p>----</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Dog</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>18199</p>
                     </c>
                     <c ca="left">
                        <p>409 (2%)</p>
                     </c>
                     <c ca="left">
                        <p>18 (4%)</p>
                     </c>
                     <c ca="left">
                        <p>25 (6%)</p>
                     </c>
                     <c ca="left">
                        <p>363</p>
                     </c>
                     <c ca="left">
                        <p>3505</p>
                     </c>
                     <c ca="left">
                        <p>----</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Cow</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>23147</p>
                     </c>
                     <c ca="left">
                        <p>790 (3%)</p>
                     </c>
                     <c ca="left">
                        <p>46 (6%)</p>
                     </c>
                     <c ca="left">
                        <p>104 (13%)</p>
                     </c>
                     <c ca="left">
                        <p>479</p>
                     </c>
                     <c ca="left">
                        <p>1996</p>
                     </c>
                     <c ca="left">
                        <p>----</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Mouse</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>25021</p>
                     </c>
                     <c ca="left">
                        <p>663 (3%)</p>
                     </c>
                     <c ca="left">
                        <p>31 (5%)</p>
                     </c>
                     <c ca="left">
                        <p>75 (11%)</p>
                     </c>
                     <c ca="left">
                        <p>518</p>
                     </c>
                     <c ca="left">
                        <p>2969</p>
                     </c>
                     <c ca="left">
                        <p>533</p>
                     </c>
                     <c ca="left">
                        <p>58/663 (9%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Rat</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>22157</p>
                     </c>
                     <c ca="left">
                        <p>567 (3%)</p>
                     </c>
                     <c ca="left">
                        <p>21 (4%)</p>
                     </c>
                     <c ca="left">
                        <p>62 (11%)</p>
                     </c>
                     <c ca="left">
                        <p>492</p>
                     </c>
                     <c ca="left">
                        <p>4520</p>
                     </c>
                     <c ca="left">
                        <p>----</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Chicken</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>17707</p>
                     </c>
                     <c ca="left">
                        <p>321 (2%)</p>
                     </c>
                     <c ca="left">
                        <p>15 (5%)</p>
                     </c>
                     <c ca="left">
                        <p>26 (8%)</p>
                     </c>
                     <c ca="left">
                        <p>267</p>
                     </c>
                     <c ca="left">
                        <p>720</p>
                     </c>
                     <c ca="left">
                        <p>----</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Tetraodon </it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>28005</p>
                     </c>
                     <c ca="left">
                        <p>227 (1%)</p>
                     </c>
                     <c ca="left">
                        <p>4 (2%)</p>
                     </c>
                     <c ca="left">
                        <p>10 (4%)</p>
                     </c>
                     <c ca="left">
                        <p>203</p>
                     </c>
                     <c ca="left">
                        <p>644</p>
                     </c>
                     <c ca="left">
                        <p>----</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>* The number of gene annotations (both those labelled 'known' and 'novel') for the genome version downloaded from Ensembl [[15]; see <it>Methods </it>for details].</p>
                  <p>** TE = transposable element ; see <it>Methods </it>for details.</p>
                  <p>*** FLE = fraction of largest exon ; see <it>Methods </it>for details.</p>
                  <p>**** Number of PRs with complete Refseq mRNAs or complete Unigene consensus sequences (percentage of these in brackets); see <it>Methods </it>for details. PRs have significantly much less mapping of this transcription information than the whole annotated proteome for these organisms.</p>
               </tblfn>
            </tbl>
            <p><it>(iii) Filtering for potential transposable elements (TEs): </it>Each proteome was compared using TBLASTN to libraries of transposable elements taken from the RepeatMasker distribution <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, using an e-value threshold of &#8804; 10<sup>-4</sup>. Any proteins containing SINEs, or near-complete matches to LINEs (&#8805; 0.8 of their lengths), were labeled as potentially TE-containing.</p>
            <p><it>(iv) Whether single-exon gene or multiple-exon gene: </it>The <it>PRs </it>were labeled as either <it>single-exon genes </it>or part of <it>multiple-exon genes</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Orthologs</p>
            </st>
            <p>Orthologs of parent genes were identified using the bi-directional best hits method, using BLAST (e-value &#8804; 10<sup>-4</sup>, amino-acid sequence identity &#8805; 40% and requiring the alignment to cover &#8805; 0.6 of the lengths of each sequence. The bi-directional best hits method is a common procedure for guarding against considering paralogs.</p>
         </sec>
         <sec>
            <st>
               <p>Analysis of K<sub>s </sub>and K<sub>a</sub>/K<sub>s </sub>values, and derivation of genome- and lineage-specific gene lists</p>
            </st>
            <p>The package PAML <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> was used to calculate maximum-likelihood Ka, Ks and Ka/Ks values for pairs of sequences (either <it>PR </it>versus computed ancestral sequences, <it>or PR </it>versus parent). In addition, branch-specific maximum-likelihood Ka/Ks values were calculated for three-way alignments of <it>PR</it>, parent and parent's orthologs from another close species.</p>
            <p>We applied three different strategies based on analysis of Ks, to determine lists of genome-specific and lineage-specific <it>PRs</it>. For example, for the human genome, we calculated <it>human-specific </it>lists relative to the chimpanzee genome. Also, we calculated <it>primate-specific </it>lists for human <it>plus </it>chimpanzee, relative to a mammalian outgroup, such as dog or cow. To determine genome-specific lists of <it>PRs</it>, we investigated each of the following three methods ("<it>parent's ortholog</it>" refers to the ortholog of the parent in the most closely related genome):</p>
            <p><it>(1) </it>The distribution of Ks values for orthologous genes in the two organisms was calculated, and the median value <it>m </it>derived from this. If Ks [<it>PR</it>&#8592;&#8594;<it>parent</it>] &lt;<it>m </it>and Ks [<it>parent </it>&#8592;&#8594;<it>parent's ortholog</it>] ><it>m</it>, then a <it>PR </it>is labeled genome-specific;</p>
            <p><it>(2) </it>Secondly, a <it>PR </it>could be labeled genome-specific if Ks [<it>PR </it>&#8592;&#8594;<it>parent</it>] &lt; Ks [<it>parent </it>&#8592;&#8594;<it>parent's ortholog</it>];</p>
            <p><it>(3) </it>Thirdly, a <it>PR </it>could be labeled genome-specific if, in a three-way tree of <it>PR</it>, parent and parent's ortholog, the branch-specific Ks [<it>PR</it>] is &lt; (Ks [<it>parent</it>] + Ks [<it>parent's ortholog</it>])/2 ;</p>
            <p>Lineage-specific lists were derived in a similar fashion. Additional Figure <figr fid="F2">2</figr> shows how these three methods overlap which each other. Based on the overlaps observed, we used Method (3) for further analysis.</p>
         </sec>
         <sec>
            <st>
               <p>Analysis of reading frame conservation</p>
            </st>
            <p>We assessed the reading frame conservation (RFC) in sequences using simulations of insertion and deletion governed by power-law insertion/deletion (indel) statistics <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Power-law statistics for indels were extracted from recently-formed R&#936;Gs having &#8805; 85% amino-acid identity with their parent sequences. Power-law relationships were fitted, omitting points for any indels of size 3<it>n</it>, with <it>n </it>any positive integer. Expected ratios for insertions versus deletions were taken from this data; the expected number of indels per nucleotide substitution for several mammals was culled from the literature <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>. The program DNADIST <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> was used to calculate the nucleotide-level divergence of the PRs from ancestral sequences (calculated using PAML <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>; see section on Ka/Ks analysis above). This divergence value is used as a target in simulations. For each <it>PR</it>, repeated simulations of the evolution of the ancestral sequence towards present-day, for 1000 iterations, was performed using a Kimura two-parameter model. In each case, the resulting simulated protein coding sequence was marked for frame disablements (stop codons and frameshifts). <it>PR </it>sequences whose simulations yielded frame-disrupted sequences &#8805; 99% of the time were labeled as having significant RFC.</p>
         </sec>
         <sec>
            <st>
               <p>Assignment of functional categories</p>
            </st>
            <p>GO (Gene Ontology; <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>) functional categories were taken from the annotation files provided on the Ensembl <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> and Gene Ontology websites <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Further GO term annotations were derived by mapping functional GO annotations for the PDB (also downloaded from the GO website) onto Ensembl protein annotations, using 50% sequence identity and 0.8 fractional sequence coverage (for the protein domain) as thresholds, using alignment made by the program BLASTP (e-value &#8804; 0.0001) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. These thresholds were benchmarked on the complete SCOP protein domain sequence database <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, to give a 2% false positive rate for GO term transfer.</p>
         </sec>
         <sec>
            <st>
               <p>Mapping of cDNAs/mRNAs</p>
            </st>
            <p>Refseq mRNAs and complete Unigene consensus sequences were downloaded from the NCBI website <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, for both human and mouse. These were mapped to the coding sequences of Ensembl gene annotations, using blastn (e-value &#8804; 1 &#215; 10<sup>-10 </sup>for alignments &#8805; 100 nucleotides) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. All mappings that match with &#8805; 99% sequence identity over &#8805; 0.99 of the sequence length of the cDNA or mRNA, after removal of any polyadenylation, were retained. Further restriction of analysis of cDNA/mRNA mappings to those that do not match their putative parent sequences with >95% sequence identity, does not change the trends reported with regard to transcription evidence reported below.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>The pipelines for annotating the complement of gene retrotranspositions (both <it>retropseudogenes </it>(<it>R&#968;G</it>s) and putative <it>retrogenes </it>(<it>PRs</it>)) were applied to eight vertebrates. In particular, we focused on the mammals, to analyse the ages of putative retrogenes (<it>PRs</it>), to derive genome- and lineage-specific lists and to check for spurts of gene retrotransposition activity. We then examined for evidence of transcription (mRNA and cDNA mapping), involvement in alternative splicing, selection pressures (significant Ka/Ks values and reading-frame conservation), and for functional categorizations of parent genes.</p>
         <sec>
            <st>
               <p>Overview of gene retrotranspositions in vertebrates</p>
            </st>
            <p>Our analysis suggests that up to ~3% of the genes encoded in a vertebrate genome contain a <it>PR </it>(Table <tblr tid="T1">1</tblr>), with the smallest percentages in the chicken and puffer fish <it>T. nigroviridis</it>. By comparison, mammalian genomes have ~2,000&#8211;5,000 retropseudogenes (<it>R&#968;G</it>s), that have at least 70% of the coding sequence of their parent genes, again with smaller numbers in non-mammal vertebrates (just 644 <it>R&#968;G</it>s in <it>T. nigroviridis</it>) (Table <tblr tid="T1">1</tblr>). These results together indicate that there has been less, recent gene-retrotransposition activity in the two non-mammal vertebrates. These observations tally well with other evidence for less retrotransposition activity in chicken and <it>Tetraodon</it>. In chicken, there appear to be little or no SINEs <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>, and only ~8% of the genomic DNA is comprised of the CR1 ('chicken repeat 1') LINE-1 <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>, whose reverse transcriptase is thought not to copy polyadenylated mRNAs <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. In Tetraodon, &lt;1% of the genome is comprised of retrotransposons, so gene retrotransposition should consequently be less likely <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. For the eight genomes studied here, there are no significant linear correlations between the number of genes <it>or </it>proteins from a genome, versus the number of <it>PRs</it>, <it>or </it>R&#968;Gs (data not shown). Small percentages of the <it>PRs </it>could be classified as homologs of retrotransposed transposable elements, such as LINEs (2&#8211;12%), or as overlapping pseudogene annotations (4&#8211;13%).</p>
            <p>As described in detail in <it>Methods</it>, we applied a 'local gene order' test, to set aside any <it>PRs </it>that may have arisen through local or segmental duplication, specifically for the human and mouse genomes (Table <tblr tid="T1">1</tblr>). This filter allows for at most one homologous protein encoded within a window of +/-3 genes along the genomic DNA (<it>i.e</it>., N<sub><it>homologs </it></sub>&#8804; 1) (Table <tblr tid="T1">1</tblr>). The substantial majority of human and mouse <it>PRs </it>pass this filter (80&#8211;87%).</p>
         </sec>
         <sec>
            <st>
               <p>Ages of primate and rodent gene retrotranspositions</p>
            </st>
            <p>How old are these <it>PRs</it>? Is there any evidence for spurts of gene retrotransposition activity in mammalian evolution? To answer these questions, we examined the distribution of Ks values for <it>PRs </it>compared to their assigned parent genes, in the human and mouse genomes. (Only <it>PRs </it>passing the <it>local gene order </it>test, with threshold N<sub><it>homologs </it></sub>&#8804; 1 were analysed.) Ks is the rate of synonymous substitutions per synonymous site in codons, and has been generally used to age coding sequences. From comparing Ks values for <it>PRs</it>, their parents, and orthologs of their parents, we have also been able to derive lists of genome-specific and lineage-specific <it>PRs </it>(Figure <figr fid="F4">4</figr>; see <it>Methods </it>for details).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Lineage-specific lists of PRs</p>
               </caption>
               <text>
                  <p><b>Lineage-specific lists of PRs</b>: The number of species-specific PRs relative to other species. PRs specific relative to other species were obtained by comparison of <it>Ks </it>between the PR and its parent and the <it>Ks </it>between the parent (<it>Ks</it><sub><it>PR_parent</it></sub>) and the ortholog of the parent in the other species (<it>Ks</it><sub><it>parent_ortholog</it></sub>). PRs with <it>Ks</it><sub><it>PR_parent </it></sub>&lt;<it>Ks</it><sub><it>parent_ortholog </it></sub>were defined as specific PRs relative to the other species. Only PRs which amino acid identity to parents is more than 70% and have an ortholog in other species were subjected to this calculation. Orthology criteria used are 40% identity over 60% length overlap. 'Human-specific' and 'Chimp-specific' PRs are those formed since the species diverged from each other; similarly, for 'Mouse-specific' and 'Rat-specific' PRs. 'Other primate-specific' are any other PRs formed in human or chimp since divergence from dog (in <b>bold </b>typeface), or from cow (in <it>italic </it>typeface); similarly, for 'Other rodent-specific'.</p>
               </text>
               <graphic file="1471-2105-8-308-4"/>
            </fig>
            <p>In human <it>PRs</it>, we see that there is a bimodal distribution of Ks (Figure <figr fid="F3">3A</figr>). The median Ks values for lists of <it>PRs </it>that are <it>human-specific </it>or that have otherwise been formed since divergence from dog, are labelled on the Ks histogram. The peak at Ks ~0.06&#8211;0.08 corresponds approximately to the median Ks value for <it>PRs </it>formed between human divergence from dog and from chimpanzee. This peak has been noted previously in analyses of retropseudogenes and total retrosequence populations <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B12">12</abbr></abbrgrp>, peaking at approximately the point of human lineage divergence from the New World Monkeys, some ~40 million years ago <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. The peak at 0.0&#8211;0.02 (containing 21% of <it>PRs</it>) obviously corresponds to human-specific <it>PRs</it>. Some of these <it>PRs </it>may simply be too young to differentiate as a <it>PR </it>or retropseudogene (They have not been around long enough to acquire (apparent) reading-frame disablements.) Evidence for selection pressures on these sequences is discussed below.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Ks distributions:(A) Ks distribution for human PRs meeting the local gene order test with threshold of N<sub><it>homologs </it></sub>= 0, from comparison to their parent sequences</p>
               </caption>
               <text>
                  <p><b>Ks distributions: (A) </b>Ks distribution for human PRs meeting the local gene order test with threshold of N<sub><it>homologs </it></sub>= 0, from comparison to their parent sequences. Labelled are the median values for the 'Human-specific' set, and those PRs formed between divergence from dog and from chimp [see panel (C)]. A similar distribution is observed with an N<sub><it>homologs </it></sub>threshold of &#8804; 1 for the local gene order test. <b>(B) </b>Ks distribution for mouse PRs meeting the local gene order test with threshold of N<sub><it>homologs </it></sub>= 0, from comparison to their parent sequences. Labelled are the median values for the 'Mouse-specific' set, and those PRs formed between divergence from dog and from chimp [see panel (C)]. A similar distribution is observed with an N<sub><it>homologs </it></sub>threshold of &#8804; 1 for the local gene order test.</p>
               </text>
               <graphic file="1471-2105-8-308-3"/>
            </fig>
            <p>By comparison, in the rodents, there is more, very recent gene retrotransposition activity. In mouse, we find proportionately more, genome-specific <it>PRs </it>(relative to rat), with 44% having Ks &#8804; 0.02 (Figure <figr fid="F3">3B</figr>). In the two rodents, mouse and rat, there are >200 genome-specific <it>PRs</it>, compared to ~40 in each of the primates human and chimp. However, setting aside genome-specific examples, there are more gene retrotranspositions appearing in the primate lineage since its divergence from the dog or cow lineage (Figure <figr fid="F4">4</figr>).</p>
            <p>These observations are in keeping with the apparent maintenance of greater levels of LINE and SINE retrotransposition activity in the rodents <abbrgrp><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>; also, they tally well with previous observations for a general fall-off in such retrotransposition activity in the primate lineage <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B33">33</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Transcription evidence</p>
            </st>
            <p>Focussing on human and mouse, we examined the proportion of <it>PRs </it>that could be mapped to a complete Unigene consensus cDNA or a complete Refseq mRNA from the NCBI (<abbrgrp><abbr bid="B27">27</abbr></abbrgrp>; see <it>Methods </it>for details). For both organisms, we found that the <it>PRs </it>have significantly less mapping of this transcription evidence (P &lt; 0.001, using the z-score for distribution of the sample mean). For human, only 23% of human <it>PRs </it>mapped to a Refseq mRNA or Unigene cDNA consensus sequence (compared to 41% for the whole proteome). This may be due to lower transcription levels, because they are novel gene sequences using co-opted promoter elements at a site distal to the genomic location of their parent genes <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. This general reduction in transcription is also to be expected, if some of the sequences are recent pseudogenes without disablements.</p>
            <p>In addition, we examined how many <it>PRs </it>arise from alternatively-spliced genes. To do this, we cross-referenced the PR data with alternative splicings classified in the Alternative Splicing Database (ASD) at the EBI <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. We found that 24% of genes for human PRs arose from an alternatively spliced gene, compared to 59% of genes overall (significantly less, P &lt; 0.001 using the z-score for distribution of the sample mean). A significant reduction in representation from alternatively-spliced genes was also observed in mouse (4%, compared to 29% overall).</p>
            <p>We examined the divergence of <it>PRs </it>from their putative parents, in the context of transcription evidence. This is illustrated in Figure <figr fid="F5">5</figr> for those <it>PRs </it>that pass the local gene order test for both mouse and human, with N<sub><it>homologs </it></sub>&#8804; 1. In human, there is a marked difference in the behaviour of transcribed <it>PRs </it>(purple bars in Figure <figr fid="F5">5A</figr>), compared to those without transcription evidence (blue bars in Figure <figr fid="F5">5A</figr>). There are relatively very few transcribed <it>PRs </it>with high sequence identities (<it>i.e</it>., that formed relatively recently). The bimodal character of these plots may arise because some of the PRs: (i) are in a younger population of <it>PRs </it>that are not under selection pressures, but which have not accumulated deleterious mutations, simply by chance; <it>i.e</it>., they are pseudogenes without disablements; <it>or </it>(ii) are in a state of relaxed selection, and thus concomitantly have low transcription levels. Similarly, only a small fraction of the <it>PRs </it>calculated to have formed since divergence from the dog lineage (Figure <figr fid="F4">4</figr>), in either human (15/207, 7%) or mouse (20/292, 7%), are transcribed (significantly less, P &lt; 0.05 using &#967;<sup>2 </sup>tests for both cases).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Distributions of percentage protein sequence identity between PRs and parents</p>
               </caption>
               <text>
                  <p><b>Distributions of percentage protein sequence identity between PRs and parents</b>. <b>(A) </b>Distribution of % protein sequence identity for all human PRs that pass the local gene order test (N<sub><it>homologs </it></sub>&#8804; 1). These are broken down into 'transcribed' and 'not transcribed'. <b>(B) </b>The fraction that are transcribed in each bin of the histogram in panel A. <b>(C) </b>Distribution of % protein sequence identity for all mouse PRs that pass the local gene order test (N<sub><it>homologs </it></sub>&#8804; 1). These are broken down into 'transcribed' and 'not transcribed'. <b>(D) </b>The fraction that are transcribed in each bin of the histogram in panel C.</p>
               </text>
               <graphic file="1471-2105-8-308-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Ka/Ks and reading frame conservation (RFC) analysis</p>
            </st>
            <p>Is there any evidence for selection pressures on the <it>PRs </it>in human and mouse? We investigated this question for <it>PRs </it>that have been formed in human and mouse, since their divergences from dog. One standard indicator of selection pressures is the Ka/Ks ratio. This is the number of non-synonymous mutations per non-synonymous site, over the number of synonymous mutations per synonymous site, in codons. Negative (or 'purifying') selection in a specific lineage is indicated by a value significantly &lt;1.0, where positive ('diversifying') selection is demonstrated by a value significantly >1.0. We calculated Ka/Ks values for <it>PRs </it>relative to ancestral sequences for the parents of the <it>PRs </it>(see <it>Methods </it>for details). We tested whether any of these Ka/Ks values were significantly &lt; or >1.0 by generating 500 random pairs of sequences as diverged as the PR and ancestral sequence (to calculate expected means and standard deviations for the Ka/Ks values), and then deriving a P-value for the observed Ka/Ks.</p>
            <p>Strikingly, when we correct for multiple hypothesis testing in the Ka/Ks calculations, we find only one <it>PR </it>sequence (formed since divergence from dog) that is under significant selection at the codon level in the human genome, and none in the mouse genome. (The one significant human example is a PR under purifying selection, from a family of proteins with the GTP-binding SAR1 domain.)</p>
            <p>In addition, we calculated the distribution of Ka/Ks values from directly comparing the <it>PRs versus </it>their parents. From this specific sort of comparison, the neutral expectation for Ka/Ks is not ~1, because of non-synonymous mutations accumulating in the parent genes <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. A significant excess of Ka/Ks values &lt;0.5, however, may be indicative of purifying selection in the data set. For comparison, we also similarly calculated a Ka/Ks distribution for <it>R&#968;G</it>s <it>versus </it>their respective parents, carefully parsing out disablements (frameshifts and premature stop codons) from the <it>R&#968;G </it>sequences. This Ka/Ks distributional analysis is performed for both human and mouse (Figure <figr fid="F6">6</figr>). For human (Figure <figr fid="F6">6A</figr>), we find no significant excess of <it>PRs </it>with Ka/Ks values &lt;0.5 relative to <it>R&#968;G</it>s, either for the whole data set of <it>PRs</it>, or for the subset formed in the primate lineage, contrary to a previous report <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> (&#967;<sup>2 </sup>test or Fisher's exact test). This distribution is thus consistent with a set of largely neutral retrotranspositions, behaving like <it>R&#968;G</it>s. However, for mouse-specific <it>PRs</it>, there is a significant excess with Ka/Ks &lt;0.5 (P &#8804; 0.001, &#967;<sup>2 </sup>test and Fisher's exact test) (Figure <figr fid="F6">6B</figr>), indicating that some of these mouse <it>PR </it>sequences are under purifying selection.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Ka/Ks distributions for PRs and for retropseudogenes (<it>R&#968;G </it>s)</p>
               </caption>
               <text>
                  <p><b>Ka/Ks distributions for PRs and for retropseudogenes (<it>R&#968;G </it>s)</b>. <b>(A) </b>Distribution of Ka/Ks for human PRs (n = 262) meeting the local gene order test (N<sub><it>homologs </it></sub>&#8804; 1), compared to Ka/Ks for the <it>R&#968;G</it>s (n = 183). All sequences were required to have protein sequence identity &#8805; 60.0% with their parent sequences. <b>(B) </b>As in (A), but for mouse PRs (n = 318) and <it>R&#968;G</it>s (n = 220).</p>
               </text>
               <graphic file="1471-2105-8-308-6"/>
            </fig>
            <p>Conservation of open reading frames without disablements (frameshifts or stop codons), can also be an indicator of coding ability <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>. We derived a method for assessing significant conservation of open reading frames, using simulation with power-law insertion/deletion (indel) statistics <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Using simulations with calculated neutral rates of substitution, insertion and deletion, the likelihood of conservation of an open reading frame without interruption by frameshifts and stop codons, can be determined (see <it>Methods </it>for details). To give sufficient power, a P-value threshold of &#8804; 0.01 was used as an indicator for significant reading-frame conservation (RFC). This calculation is complementary to the Ka/Ks analysis.</p>
            <p>The results are listed Table <tblr tid="T2">2</tblr>. Even though there were no significant Ka/Ks values in mouse, we find over 30 mouse-specific <it>PRs </it>that have significant reading frame conservation, and a further 17 that were formed since divergence from dog (Table <tblr tid="T2">2</tblr>). In human, we find in total, 59 <it>PRs </it>with significant RFC, that have arisen since divergence from dog (Table <tblr tid="T2">2</tblr>). A phylogenetic tree for an example of one of the mouse PRs with significant RFC, which is homologous to citrate synthase, is depicted in Additional Figure <figr fid="F3">3</figr>, with a depiction of the chromosomal milieu of this PR and of its parent in Additional Figure <figr fid="F4">4</figr>. Two further representative examples of human PRs are also shown in Additional Figure <figr fid="F4">4</figr> (one with significant RFC and the other without), with varying degrees of age and transcription evidence.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Results of analysis for reading-frame conservation (RFC)</p>
               </caption>
               <tblbdy cols="3">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Species</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Set &#8224;</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Number with significantly conserved reading frame &#8224;&#8224;</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Human</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Human-specific relative to chimp</p>
                     </c>
                     <c ca="left">
                        <p>10/40 (25%)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Others in human, that were formed since divergence from dog</p>
                     </c>
                     <c ca="left">
                        <p>49/171 (29%)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Other older PRs</p>
                     </c>
                     <c ca="left">
                        <p>162/378 (43%)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>TOTAL</p>
                     </c>
                     <c ca="left">
                        <p>221/589 (38%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Mouse</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Mouse-specific relative to rat</p>
                     </c>
                     <c ca="left">
                        <p>35/240 (15%)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Others in mouse, that were formed since divergence from dog</p>
                     </c>
                     <c ca="left">
                        <p>17/58 (29%)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Other older PRs</p>
                     </c>
                     <c ca="left">
                        <p>123/233 (53%)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>TOTAL</p>
                     </c>
                     <c ca="left">
                        <p>175/531 (33%)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>&#8224; These are the sets illustrated in Figure 4. All of the PRs that pass the local gene order test, allowing for N<sub><it>homologs </it></sub>&#8804; 1 were analysed for reading-frame conservation.</p>
                  <p>&#8224;&#8224; For PRs formed since divergence from dog, RFC simulations were performed using ancestral sequences calculated using PAML [20]. For PRs formed before divergence from dog, RFC simulations were conservatively performed to half of the nucleotide-level divergence between the PR and its assigned parent sequence.</p>
               </tblfn>
            </tbl>
            <p>These results are evidence for conservation of protein open reading frames, even though we found no evidence for purifying selection from examination of the sequences individually for Ka/Ks. This would arise if the <it>PRs </it>were generally under relaxed or positive selection pressures at the codon level. The existence of relaxed selection is consistent with the markedly low numbers of <it>PRs </it>found to be transcribed in both human and mouse, particularly those that were formed since divergence from dog.</p>
            <p>Out of those with significant RFC, is there any evidence for non-neutral Ka/Ks trends? We checked for significant excess of <it>PRs </it>with Ka/Ks values &lt;0.5, > 0.5, &lt;1.0 or >1.0 for each of the subsets listed (Table <tblr tid="T2">2</tblr>). In the human lineage, we find a significant excess of <it>PRs </it>with Ka/Ks >0.5 (40/59, P &lt; 0.05, &#967;<sup>2 </sup>test and Fisher exact test, compared to an expectation from <it>R&#968;G </it>sequences) formed between divergence from dog and from chimp. This non-neutral trend is suggestive of positive selection distributed throughout this specific subset population of <it>R&#968;G </it>sequences. Out of the other subsets listed in Table <tblr tid="T2">2</tblr>, the only other significant non-neutral Ka/Ks tendency is for an excess of mouse-specific <it>PRs </it>to be under purifying selection (26/35 having Ka/Ks &lt;0.5, compared to an expectation from <it>R&#968;G </it>sequences, P &lt; 0.05, &#967;<sup>2 </sup>test and Fisher exact test).</p>
         </sec>
         <sec>
            <st>
               <p>Functional categories</p>
            </st>
            <p>To assess whether the <it>PRs </it>and <it>R&#968;G</it>s have any unusual functional associations, we assigned functional categories using the Gene Ontology (GO) functional classification (Table <tblr tid="T3">3</tblr>). As noted previously, '<it>Structural constituent of ribosome</it>' is a prevalent functional category for <it>R&#968;G</it>s <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B14">14</abbr></abbrgrp>. Noticeably, for mouse, there are more retropseudogenes in metabolic categories, than in human (Table <tblr tid="T3">3</tblr>). A notable prevalence indicative of origin in retrotransposition, '<it>structural constituent of ribosome</it>', occurs in the top ten of all <it>PR </it>(sub)sets, and is ranked number one for <it>PRs </it>formed since divergence from dog, for both mouse and human (Table <tblr tid="T3">3</tblr>). <it>'Structural constituent of ribosome' </it>is also the only Gene Ontology term that is statistically overrepresented in all of the retrotransposed gene sets listed (Table <tblr tid="T3">3</tblr> legend; <it>P' </it>&lt;0.05, using binomial statistics and a Bonferroni correction for multiple hypothesis testing <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>). The functional category preferences are not caused by over-representation of any one parent, since when representations of <it>PR</it>s on a parent-by-parent basis are tallied up, we find only a very small number of parents giving rise to five or more <it>PR</it>s (Suppl. Table <tblr tid="T1">1</tblr>); the substantial majority of parents have only one PR offspring (286/353 [81%] for mouse, and 279/347 [80%] for human).</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Most common Gene Ontology (GO) functional terms for different sets of sequences *</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c cspan="4" ca="left">
                        <p>
                           <b>Human</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>All genes (Total = 33930)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>retropseudogenes (Total = 2493)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>All PRs (Total = 631)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>PRs <b>formed since divergence from dog lineage (Total = 211) **</b></it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0005515, protein binding (2360)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>GO:0003735, structural constituent of ribosome (203) </it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0008270, zinc ion binding (49)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>GO:0003735, structural constituent of ribosome (11)</it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0008270, zinc-ion binding (2069)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0008270, zinc ion binding (189)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (35)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (10)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (2029)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (166)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005509, calcium ion binding (25)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (9)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0005524, ATP-binding (1687)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0003676, nucleic acid binding (132)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0005525, GTP binding (21)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005525, GTP binding (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (1339)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0003723, RNA binding (126)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0005515, protein binding (21)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003823, antigen binding (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0007165, signal transduction (1264)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005515, protein binding (114)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0004842, ubiquitin-protein ligase activity (21)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003676, nucleic acid binding (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0016740, transferase activity (1263)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (110)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (20)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0030145, manganese ion binding (4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0004872, receptor activity (1242)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005524, ATP binding (93)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003676, nucleic acid binding (20)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0020037, heme binding (4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0016787, hydrolase activity (1171)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0046872, metal ion binding (63)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>GO:0003735, structural component of the ribosome (16) </it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0016757, transferase activity, transferring glycosyl groups (4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0003700, transcription factor activity (1052)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0000166, nucleotide binding (57)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003723, RNA binding (13)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005509, calcium ion binding (4)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="4" ca="left">
                        <p>
                           <b>Mouse</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>All genes (Total = 32442)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Retropseudogenes (Total = 2969)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b><it>PRs </it>(Total = 663)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>PRs </it>
                           <b>formed since divergence from dog lineage (Total = 298) **</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0005515, protein binding (2502)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0003676, nucleic acid binding (273)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0005515, protein-binding (17)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>GO:0003735, structural constituent of ribosome (16) </it>
                           </b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0004872, receptor activity (1923)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0051287, NAD binding (243)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>GO:0003735, structural constituent of ribosome (16)</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0005524, ATP binding (8)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (1571)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0008943, glyceraldehyde-3-phosphate dehydrogenase activity (243)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0008270, zinc ion binding (12)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005515, protein-binding (7)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0008270, zinc ion binding (1481)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0004365, glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) activity (243)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (12)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0016740, transferase activity (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0005524, ATP binding (1252)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0008270, zinc ion binding (235)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0005524, ATP binding (12)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0016491, oxidoreductase activity (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0016740, transferase activity (1036)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>GO:0003735, structural constituent of ribosome (201) </it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0005509, calcium ion binding (12)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0006355, regulation of transcription, DNA-dependent (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (1017)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005515, protein-binding (101)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0016740, transferase activity (10)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0016853, isomerase activity (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0016787, hydrolase activity (911)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0016491, oxidoreductase activity (94)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0016787, hydrolase activity (9)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0016787, hydrolase activity (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0000166, nucleotide binding (873)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0005524, ATP binding (78)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (9)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0003677, DNA binding (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO:0003676, nucleic acid binding (872)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GO:0004190, aspartic-type endopeptidase activity (77)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GO:0003676, nucleic acid binding (9)</p>
                     </c>
                     <c ca="left">
                        <p>GO:0016874, ligase activity (3)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>* The most abundant Gene Ontology 'molecular function' terms are listed for each set of sequences, in decreasing order of abundance. The GO term number and a brief description are followed by the number of occurrences (in brackets). Significant overrepresentation of GO terms was calculated as described previously using binomial statistics, using a Bonferroni correction for multiple hypothesis testing (<it>P' </it>&lt; 0.05) [37]. '<it>Structural constituent of the ribosome</it>' (in italics) is the only term that is significantly overrepresented in all of the three putatively retrotransposed sequences.</p>
                  <p>** '<it>Structural constituent of the ribosome</it>' remains the most abundant GO category in this column when PRs from parents with large exons (FLE>0.67) are removed, or a more stringent N<sub>homologs </sub>threshold of = 0 is used.</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have developed two parallel pipelines for the annotation of putative retrogenes (PRs) and retropseudogenes (R&#968;G) in whole genomes. The new pipeline for retropseudogene annotation employs length-based heuristics to speed up the processing of sequence alignment data. We used these pipelines on the vertebrates here, but they are readily applicable to any genome and its set of gene/protein annotations. Genome analysis is constantly in flux, and so obviously, as vertebrate genome assemblies and their annotations are streamlined further, we will be further able to refine our retrotransposition analyses, to remove any errors from missing gene annotations, small genome assembly gaps, <it>etc</it>.</p>
         <p>We focussed on the annotation of retro(pseudo)genes in mouse and human. We were particularly interested in the retro(pseudo)genes formed since divergence from an 'outgroup' genome, that of the dog. We found evidence for excess, recent gene-retrotransposition activity in both human and mouse, since their divergences from the dog lineage. We find some evidence for selection on PRs at different phases of mouse and human genome evolution. In human, there is statistical evidence for non-neutral evolution (suggestive of positive selection), for population of PRs that have significantly conserved reading frames and that formed since divergence from the dog lineage. Also, we found that, human PRs formed since divergence from the dog lineage have significantly less transcription evidence, which is consistent with the possibility that they are pseudogenes, or in some intermediate phase of relaxed selection. Such a state of low expression coupled with relaxed selection may also arise for alternatively-spliced exons <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>. In summary, our genomic analysis suggests that some human PRs, formed since divergence from the dog lineage, are undergoing a form of non-neutral evolution, but the majority are either young pseudogenes (that are undisabled simply by chance), or lowly-expressed coding sequences in a state of 'relaxed' selection.</p>
         <p>Further information on the PRs and R&#968;Gs is available at the website: <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>PR, putative retrotransposition; <it>R&#968;G</it>, retropseudogene; RFC, reading frame conservation; TE, transposable element; GO, Gene Ontology; PDB, Protein Data Bank; SCOP, Structural Classification of Proteins; LINE, Long Interspersed Element; SINE, Short Interspersed Element; FLE, fraction of largest exon.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>ZY developed the pipelines, performed most of the data analysis, and wrote initial drafts of the manuscript. DM performed some data analysis of alternative splicing events. MI performed phylogenetic analysis. PH conceived of the project, performed evolutionary analyses, and wrote later drafts of the manuscript.</p>
         <suppl id="S1">
            <title>
               <p>Additional file 1</p>
            </title>
            <text>
               <p>Additional Figure <figr fid="F1">1</figr>: Distributions of the fraction of the largest exon (FLE). The fraction of the parent sequences that are taken up by their largest exons (denoted FLE) is plotted for various sets of sequences.</p>
            </text>
            <file name="1471-2105-8-308-S1.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional file 2</p>
            </title>
            <text>
               <p>Additional Figure <figr fid="F2">2</figr>: Overlap of the three methods for determining species-specific or lineage-specific lists of PRs, for Human (top panel) and Mouse (bottom panel). Three different methods for determining the relative age of sequences were used for generating species-specific lists. This figure demonstrates the overlap between these methods.</p>
            </text>
            <file name="1471-2105-8-308-S2.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Additional file 3</p>
            </title>
            <text>
               <p>Additional Figure <figr fid="F3">3</figr>: Example of a PR of citrate synthase in Mouse. A phylogenetic analysis was performed of citrate synthase PR from mouse.</p>
            </text>
            <file name="1471-2105-8-308-S3.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S4">
            <title>
               <p>Additional file 4</p>
            </title>
            <text>
               <p><b>Additional </b>Figure <figr fid="F4">4</figr>: <b>Chromosomal milieus for three PR examples</b>. Screenshots taken from the Ensembl database, depicting nearby features for three PRs and their putative parents.</p>
            </text>
            <file name="1471-2105-8-308-S4.pdf">
               <p>Click here for file</p>
            </file>
         </suppl>
         <suppl id="S5">
            <title>
               <p>Additional file 5</p>
            </title>
            <text>
               <p>Parents with most PRs. The parent genes that spawn most PRs are listed.</p>
            </text>
            <file name="1471-2105-8-308-S5.doc">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This work was funded by a Discovery Grant awarded to P.M.H., from the National Science and Engineering Research Council of Canada.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Pseudogenes in metazoa: origin and features</p>
            </title>
            <aug>
               <au>
                  <snm>D'Errico</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Gadaleta</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Saccone</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Briefings in functional genomics &amp; proteomics</source>
            <pubdate>2004</pubdate>
            <volume>3</volume>
            <issue>2</issue>
            <fpage>157</fpage>
            <lpage>167</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bfgp/3.2.157</pubid>
                  <pubid idtype="pmpid" link="fulltext">15355597</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Harrison</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <issue>12</issue>
            <fpage>2541</fpage>
            <lpage>2558</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">403796</pubid>
                  <pubid idtype="pmpid" link="fulltext">14656962</pubid>
                  <pubid idtype="doi">10.1101/gr.1429003</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Harrison</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <issue>10</issue>
            <fpage>1466</fpage>
            <lpage>14482</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">187539</pubid>
                  <pubid idtype="pmpid" link="fulltext">12368239</pubid>
                  <pubid idtype="doi">10.1101/gr.331902</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>5338</fpage>
            <lpage>5348</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">203328</pubid>
                  <pubid idtype="pmpid" link="fulltext">12954770</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg745</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Studying genomes through the aeons: protein families, pseudogenes and proteome evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2002</pubdate>
            <volume>318</volume>
            <issue>5</issue>
            <fpage>1155</fpage>
            <lpage>1174</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0022-2836(02)00109-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">12083509</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Retroposed new genes out of the X in Drosophila</p>
            </title>
            <aug>
               <au>
                  <snm>Betran</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Long</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome research</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <issue>12</issue>
            <fpage>1854</fpage>
            <lpage>1859</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">187566</pubid>
                  <pubid idtype="pmpid" link="fulltext">12466289</pubid>
                  <pubid idtype="doi">10.1101/gr.6049</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Human LINE retrotransposons generate processed pseudogenes</p>
            </title>
            <aug>
               <au>
                  <snm>Esnault</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Maestre</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Heidmann</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Nature genetics</source>
            <pubdate>2000</pubdate>
            <volume>24</volume>
            <issue>4</issue>
            <fpage>363</fpage>
            <lpage>367</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/74184</pubid>
                  <pubid idtype="pmpid" link="fulltext">10742098</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Processed pseudogenes of human endogenous retroviruses generated by LINEs: their integration, stability, and distribution</p>
            </title>
            <aug>
               <au>
                  <snm>Pavlicek</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Paces</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Elleder</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Hejnar</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genome research</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <issue>3</issue>
            <fpage>391</fpage>
            <lpage>399</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">155283</pubid>
                  <pubid idtype="pmpid" link="fulltext">11875026</pubid>
                  <pubid idtype="doi">10.1101/gr.216902. Article published online before print in February 2002</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>A genome-wide survey of human pseudogenes</p>
            </title>
            <aug>
               <au>
                  <snm>Torrents</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Suyama</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Zdobnov</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Genome research</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <issue>12</issue>
            <fpage>2559</fpage>
            <lpage>2567</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">403797</pubid>
                  <pubid idtype="pmpid" link="fulltext">14656963</pubid>
                  <pubid idtype="doi">10.1101/gr.1455503</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements</p>
            </title>
            <aug>
               <au>
                  <snm>Brosius</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>1999</pubdate>
            <volume>238</volume>
            <issue>1</issue>
            <fpage>115</fpage>
            <lpage>134</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(99)00227-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">10570990</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Zheng</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Carriero</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>2374</fpage>
            <lpage>2383</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1087782</pubid>
                  <pubid idtype="pmpid" link="fulltext">15860774</pubid>
                  <pubid idtype="doi">10.1093/nar/gki531</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Emergence of young human genes after a burst of retroposition in primates</p>
            </title>
            <aug>
               <au>
                  <snm>Marques</snm>
                  <fnm>AC</fnm>
               </au>
               <au>
                  <snm>Dupanloup</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Vinckenbosch</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Reymond</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kaessmann</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>PLoS biology</source>
            <pubdate>2005</pubdate>
            <volume>3</volume>
            <fpage>e357</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1251493</pubid>
                  <pubid idtype="pmpid" link="fulltext">16201836</pubid>
                  <pubid idtype="doi">10.1371/journal.pbio.0030357</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Extensive gene traffic on the mammalian X chromosom</p>
            </title>
            <aug>
               <au>
                  <snm>Emerson</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Kaessmann</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Betran</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Long</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2004</pubdate>
            <volume>303</volume>
            <issue>5657</issue>
            <fpage>537</fpage>
            <lpage>540</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1090042</pubid>
                  <pubid idtype="pmpid" link="fulltext">14739461</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Hegyi</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Balasubramanian</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Luscombe</snm>
                  <fnm>NM</fnm>
               </au>
               <au>
                  <snm>Bertone</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Echols</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <issue>2</issue>
            <fpage>272</fpage>
            <lpage>280</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">155275</pubid>
                  <pubid idtype="pmpid" link="fulltext">11827946</pubid>
                  <pubid idtype="doi">10.1101/gr.207102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Ensembl database</p>
            </title>
            <fpage>[ http://www.ensembl.org ] </fpage>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <issue>17</issue>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Patterns and rates of indel evolution in processed pseudogenes from humans and murids</p>
            </title>
            <aug>
               <au>
                  <snm>Ophir</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Graur</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>1997</pubdate>
            <volume>205</volume>
            <fpage>191</fpage>
            <lpage>202</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(97)00398-3</pubid>
                  <pubid idtype="pmpid">9461394</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>ORDB </p>
            </title>
            <fpage>[ http://senselab.med.yale.edu/senselab/ORDB ]</fpage>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Repeatmasker</p>
            </title>
            <fpage>[ http://www.repeatmasker.org ] </fpage>
         </bibl>
         <bibl id="B20">
            <title>
               <p>PAML: a program package for phylogenetic analysis by maximum likelihood</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1997</pubdate>
            <volume>13</volume>
            <fpage>555</fpage>
            <lpage>556</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9367129</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Characterization of evolutionary rates and constraints in three mammalian genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Cooper</snm>
                  <fnm>GM</fnm>
               </au>
               <au>
                  <snm>Brudno</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Stone</snm>
                  <fnm>EA</fnm>
               </au>
               <au>
                  <snm>Dubchak</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Batzoglou</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sidow</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>539</fpage>
            <lpage>548</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">383297</pubid>
                  <pubid idtype="pmpid" link="fulltext">15059994</pubid>
                  <pubid idtype="doi">10.1101/gr.2034704</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Majority of divergence between closely-related DNA samples is due to indels</p>
            </title>
            <aug>
               <au>
                  <snm>Britten</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Rowen</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Williams</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Cameron</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>PNAS</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <fpage>4665</fpage>
            <lpage>4670</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1073/pnas.0330964100</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>PHYLIP - Phylogeny Inference Package (Version 3.2)</p>
            </title>
            <aug>
               <au>
                  <snm>Felsenstein</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Cladistics</source>
            <pubdate>1989</pubdate>
            <volume>5</volume>
            <fpage>164</fpage>
            <lpage>166</lpage>
         </bibl>
         <bibl id="B24">
            <title>
               <p>The Gene Ontology (GO) database and informatics resource</p>
            </title>
            <aug>
               <au>
                  <snm>Consortium</snm>
                  <fnm>GO</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D258</fpage>
            <lpage>D261</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308770</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681407</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh036</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Gene Ontology</p>
            </title>
            <fpage>[ http://www.geneontology.org ] </fpage>
         </bibl>
         <bibl id="B26">
            <title>
               <p>The ASTRAL Compendium in 2004</p>
            </title>
            <aug>
               <au>
                  <snm>Chandonia</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Hon</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Walker</snm>
                  <fnm>NS</fnm>
               </au>
               <au>
                  <snm>Lo Conte</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Koehl</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Levitt</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Nucleic acids research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D189</fpage>
            <lpage>D192</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308768</pubid>
                  <pubid idtype="pmpid" link="fulltext">14681391</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh034</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>NCBI </p>
            </title>
            <fpage>[ http://www.ncbi.nih.gov ] </fpage>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Consortium</snm>
                  <fnm>ICGS</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>432</volume>
            <fpage>695</fpage>
            <lpage>716</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03154</pubid>
                  <pubid idtype="pmpid" link="fulltext">15592404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Chicken repeat 1 (CR1) elements, which define an ancient family of vertebrate non-LTR retrotransposons, contain two closely spaced open reading frames.</p>
            </title>
            <aug>
               <au>
                  <snm>Haas</snm>
                  <fnm>NB</fnm>
               </au>
               <au>
                  <snm>Grabowski</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Sivitz</snm>
                  <fnm>AB</fnm>
               </au>
               <au>
                  <snm>Burch</snm>
                  <fnm>JB</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>1997</pubdate>
            <volume>197</volume>
            <fpage>305</fpage>
            <lpage>309</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(97)00276-X</pubid>
                  <pubid idtype="pmpid">9332379</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Characterization and repeat analysis of the compact genome of the freshwater pufferfish Tetraodon nigroviridis</p>
            </title>
            <aug>
               <au>
                  <snm>Roest Crollius</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Jaillon</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Dasilva</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ozouf-Costaz</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fizames</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fischer</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bouneau</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Billault</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Quetier</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Saurin</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Bernot</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>J.</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>939</fpage>
            <lpage>949</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">310905</pubid>
                  <pubid idtype="pmpid" link="fulltext">10899143</pubid>
                  <pubid idtype="doi">10.1101/gr.10.7.939</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Genome sequence of the Brown Norway rat yields insights into mammalian evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Consortium</snm>
                  <fnm>RGSP</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>428</volume>
            <issue>6982</issue>
            <fpage>493</fpage>
            <lpage>521</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature02426</pubid>
                  <pubid idtype="pmpid" link="fulltext">15057822</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Initial sequencing and comparative analysis of the mouse genome</p>
            </title>
            <aug>
               <au>
                  <snm>Consortium</snm>
                  <fnm>MGS</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>420</volume>
            <issue>6915</issue>
            <fpage>520</fpage>
            <lpage>562</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature01262</pubid>
                  <pubid idtype="pmpid" link="fulltext">12466850</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>The sequence of the human genome</p>
            </title>
            <aug>
               <au>
                  <snm>Genomics</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>291</volume>
            <issue>5507</issue>
            <fpage>1304</fpage>
            <lpage>1351</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1058040</pubid>
                  <pubid idtype="pmpid" link="fulltext">11181995</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>The ENCODE (ENCyclopedia Of DNA Elements) Project</p>
            </title>
            <aug>
               <au>
                  <snm>Consortium</snm>
                  <fnm>ENCODEP</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2004</pubdate>
            <volume>306</volume>
            <issue>5696</issue>
            <fpage>636</fpage>
            <lpage>640</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1105136</pubid>
                  <pubid idtype="pmpid" link="fulltext">15499007</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Carriero</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Liu</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Gerstein</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2003</pubdate>
            <volume>333</volume>
            <fpage>885</fpage>
            <lpage>892</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.jmb.2003.09.016</pubid>
                  <pubid idtype="pmpid" link="fulltext">14583187</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Sequencing and comparison of yeast species to identify genes and regulatory elements</p>
            </title>
            <aug>
               <au>
                  <snm>Kellis</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Patterson</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Endrizzi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Birren</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2003</pubdate>
            <volume>423</volume>
            <fpage>241</fpage>
            <lpage>254</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature01644</pubid>
                  <pubid idtype="pmpid" link="fulltext">12748633</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila.</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>PM</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>441</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1618407</pubid>
                  <pubid idtype="pmpid" link="fulltext">17032452</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-7-441</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss</p>
            </title>
            <aug>
               <au>
                  <snm>Modrek</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2003</pubdate>
            <volume>34</volume>
            <issue>2</issue>
            <fpage>177</fpage>
            <lpage>180</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng1159</pubid>
                  <pubid idtype="pmpid" link="fulltext">12730695</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Origin and evolution of new exons in rodents</p>
            </title>
            <aug>
               <au>
                  <snm>Wang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <cnm>Zheng H</cnm>
               </au>
               <au>
                  <cnm>Yang S</cnm>
               </au>
               <au>
                  <cnm>Yu H</cnm>
               </au>
               <au>
                  <cnm>Li J</cnm>
               </au>
               <au>
                  <cnm>Jiang H</cnm>
               </au>
               <au>
                  <cnm>Su J</cnm>
               </au>
               <au>
                  <cnm>Yang L</cnm>
               </au>
               <au>
                  <cnm>Zhang J</cnm>
               </au>
               <au>
                  <cnm>McDermott J</cnm>
               </au>
               <au>
                  <cnm>Samudrala R</cnm>
               </au>
               <au>
                  <cnm>Wang J</cnm>
               </au>
               <au>
                  <cnm>Yang H</cnm>
               </au>
               <au>
                  <cnm>Yu J</cnm>
               </au>
               <au>
                  <cnm>Kristiansen K</cnm>
               </au>
               <au>
                  <cnm>Wong GK</cnm>
               </au>
               <au>
                  <snm>J</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <fpage>1258</fpage>
            <lpage>1264</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1199540</pubid>
                  <pubid idtype="pmpid" link="fulltext">16109974</pubid>
                  <pubid idtype="doi">10.1101/gr.3929705</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>http://biology.mcgill.ca/faculty/harrison/retro </p>
            </title>
            <fpage>[ http://biology.mcgill.ca/faculty/harrison/retro ] </fpage>
         </bibl>
      </refgrp>
   </bm>
</art>
