<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2164-11-S4-S10</ui>
   <ji>1471-2164</ji>
   <fm>
      <dochead>Proceedings</dochead>
      <bibl>
         <title>
            <p>Evolutionary patterns of amino acid substitutions in 12 <it>Drosophila</it> genomes</p>
         </title>
         <aug>
            <au ca="yes" id="A1">
               <snm>Yampolsky</snm>
               <mi>Y</mi>
               <fnm>Lev</fnm>
               <insr iid="I1"/>
               <email>yampolsk@etsu.edu</email>
            </au>
            <au id="A2">
               <snm>Bouzinier</snm>
               <mi>A</mi>
               <fnm>Michael</fnm>
               <insr iid="I2"/>
               <email>mbouzin@intersystems.com</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Biological sciences, East Tennessee State University, Johnson City, TN 37614, USA</p>
            </ins>
            <ins id="I2">
               <p>InterSystems Corporation, One Memorial Drive, Cambridge, MA 02142, USA</p>
            </ins>
         </insg>
         <source>BMC Genomics</source>
         <supplement>
            <title>
               <p>Ninth International Conference on Bioinformatics (InCoB2010): Computational Biology</p>
            </title>
            <editor>Christian Sch&#246;nbach, Kenta Nakai, Tin Wee Tan and Shoba Ranganathan</editor>
            <note>Proceedings</note>
            <url>http://www.biomedcentral.com/content/pdf/1471-2164-11-S4-info.pdf</url>
         </supplement>
         <conference>
            <title>
               <p>Asia Pacific Bioinformatics Network (APBioNet) Ninth International Conference on Bioinformatics (InCoB2010)</p>
            </title>
            <location>Tokyo, Japan</location>
            <date-range>26-28 September 2010</date-range>
            <url>http://incob.apbionet.org/incob10/</url>
         </conference>
         <issn>1471-2164</issn>
         <pubdate>2010</pubdate>
         <volume>11</volume>
         <issue>Suppl 4</issue>
         <fpage>S10</fpage>
         <url>http://www.biomedcentral.com/1471-2164/11/S4/S10</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">21143793</pubid>
               <pubid idtype="doi">10.1186/1471-2164-11-S4-S10</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <pub>
            <date>
               <day>2</day>
               <month>12</month>
               <year>2010</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2010</year>
         <collab>Yampolsky and Bouzinier; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Harnessing vast amounts of genomic data in phylogenetic context stemming from massive sequencing of multiple closely related genomes requires new tools and approaches. We present a tool for the genome-wide analysis of frequencies and patterns of amino acid substitutions in multiple alignments of genes&#8217; coding regions, and a database of amino acid substitutions in the phylogeny of 12 <it>Drosophila</it> genomes. We illustrate the use of these resources to address three types of evolutionary genomics questions: about fluxes in amino acid composition in proteins, about asymmetries in amino acid substitutions and about patterns of molecular evolution in duplicated genes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We demonstrate that amino acid composition of <it>Drosophila</it> proteins underwent a significant shift over the last 70 million years encompassed by the studied phylogeny, with less common amino acids (Cys, Met, His) increasing in frequency and more common ones (Ala, Leu, Glu) becoming less frequent. These fluxes are strongly correlated with polarity of source and destination amino acids, resulting in overall systematic decrease of mean polarity of amino acids found in <it>Drosophila</it> proteins. Frequency and radicality of amino acid substitutions are higher in paralogs than in orthologous single-copy genes and are higher in gene families with paralogs than in gene families without surviving duplications. Rate and radicality of substitutions, as expected, are negatively correlated with overall level and uniformity of gene expression. However, these correlations are not observed for substitutions occurring in duplicated genes, indicating a different selective constraint on the evolution of paralogous sequences. Clades resulting from duplications show a marked asymmetry in rate and radicality of amino acid substitutions, possibly a signal of widespread neofunctionalization. These patterns differ among protein families of different functionality, with genes coding for RNA-binding proteins differing from most other functional groups in terms of amino acid substitution patterns in duplicated and single-copy genes.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>We demonstrate that deep phylogenetic analysis of amino acid substitutions can reveal interesting genome-wide patterns. Amino acid composition of drosophilid proteins is shaped by fluxes similar to those previously observed in prokaryotic, yeast and mammalian genomes, indicating globally present patterns. Increased frequency and radicality of amino acid substitutions in duplicated genes and the presence of asymmetry of these parameters between paralogous clades indicate widespread neofunctionalization among paralogs as the mechanism of duplication retention.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Until recently, evolutionary genomics questions, including questions about amino acid composition of proteins, patterns of stabilising and positive selection and mechanisms of retention of duplicated genes and new function evolution, were typically answered either by analyzing phylogenies of select gene families <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp> or by full-genome analysis of triplets of genomes with two ingroup genomes compared to measure evolutionary rates, while the third, outgroup, genome used to polarize the observed changes <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. As the strategy of genome sequencing shifts from broad taxonomic coverage to sequencing multiple closely related genomes <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, a need arises in a set of tools to accomplish a phylogenetic analysis of amino acid substitutions in coding portions of a large number of protein families simultaneously and to address the question of generality of patterns observed in limited and possibly biased set of select gene families. Questions that can be asked using such approach include, but certainly are not limited to enquiries about long-term changes in amino acid compositions of proteins, about selective constrains and pressures across the genome and evolution of novel gene functions by retention and modification of duplicated genes. Here we present a tool to accomplish phylogenetic analysis of amino acids substitutions on the whole-genome scale using multiple amino acid alignments of over 11,000 gene families from twelve completely sequenced <it>Drosophila</it> genomes and illustrate its utility by the analysis of the resulting database of amino acid substitutions spanning 70 million years of drosophilid proteins evolution.</p>
         <p>Global patterns of amino acid compositions of proteins is thought to not be at a detailed balance, but rather appears to be gradually evolving by consistently adding rare to and removing common amino acids from the amino acid repertoire of protein sequences <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B5">5</abbr></abbrgrp>. There is an on-going debate on whether is pattern reflects the order in which amino acids have been added to the genetic code <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B6">6</abbr></abbrgrp> or is caused by biases in mutability of particular codons <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. As pointed out by <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, one way to address this controversy is to analyze the observed trends in a range of genomes of increasing degree of divergence: if the observed patterns are caused by the effect of amino acid polymorphism reflecting mutation-selection biases they are expected to become less pronounced as divergence between genomes increases. Furthermore, there may be substantial differences in selection pressures on reciprocal amino acid substitutions <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>: changes from polar to non-polar amino acids in human proteins are more permissive than vice-versa. Such asymmetry and the degree to which is can contribute to the large-scale changes in amino acid composition has not yet been measured on the scale of several genomes.</p>
         <p>Differences in patterns of selective pressure have also been predicted between evolutionary retained duplicated genes and single-copy genes <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>. Duplicated genes can persist in genomes either because one of the copies has acquired a new function (neofunctionalization <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>), or because both copies are now needed to perform the function or functions previously accomplished by a single copy (subfunctionalization). Subfunctionalizaton can occur either by means of partitioning of the ancestral functions between the two copies (for example by loss of one of alternative promoters in each copy), or by means of balanced degradation, i.e., fixation of hypomorphic alleles in each copy <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>. Each of these mechanisms implies relaxation of stabilizing selection, resulting in faster evolution in paralogs than in single-copy genes. Specifically, pure neofunctionalization occurs by accumulation of mutations in one of the copies, while the other remains under stabilizing selection <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp>. Subfunctionalization occurring through balanced degradation, on the other hand, is accompanied by accumulation of deleterious mutations in both paralogs. Finally, subfunctionalization occurring by tissue- or developmental stage-specialization of gene expression without a change in functionality would result in retention of stabilizing selection action in both paralogs. It is much harder to make predictions about other types of subfunctionalization, such as subdivision of pre-existing multiple substrate specificity between duplicated genes, because the two functions may depend on different parts of coding portion of the gene and, therefore, retaining one but not the other may relax selective constraints acting on at least part of the sequence. Previous studies of duplicated genes in <it>Drosophila</it> genomes (e.g., <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>) detected elevated signal of positive selection in a subset of gene families with duplications using K<sub>a</sub>/K<sub>s</sub> approach. Here we report a genome-wide analysis of differences between duplicated and single copy genes in frequency and spectrum of amino acid substitutions.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Application of AcidMiner to <it>Drosophila</it> data: a database of amino acid substitutions in 12 genomes</p>
            </st>
            <p>The main purpose of AcidMiner is to extract amino acid substitutions data from multiple alignments and to expand them in the form of relational tables so then standard SQL can be used to perform queries by any combination of criteria and to calculate aggregates. AcidMiner takes raw data in the form of multiple alignments and Newick protein and species trees, processes it to produce derivative data such as parsimony-based polarization of substitutions and stores the result in a relational database structure. The raw data for the analysis reported here was a set of multiple amino acids alignments from 12 completely sequenced Drosophila genomes (<abbrgrp><abbr bid="B4">4</abbr><abbr bid="B19">19</abbr></abbrgrp>; see Methods). A set of SQL queries that can be run against this database to produce custom datasets with given restrictions and/or calculate any aggregates including statistical parameters on different datasets. In addition, for tasks not easily expressible in SQL, data already in the database to produce further derivative data. Examples of such tasks are: defining clades for each duplication, calculating number of substitutions in each clade (including cases when we can not unambiguously determine exactly which substitutions has occurred), calculating protein lengths in clades, calculating ages (timing data) of substitutions and duplications.</p>
            <p>The resulting database in its current form includes 3,697,627 amino acid substitutions occurring in 12 drosophilid genomes spanning 11258 gene families. It consists of 14 tables defining the base data model. Two additional tables contain preloaded data for gene ontology and amino acid substitution properties, such as pair-wise change in polarity. Main tables include Families table, Tree Structure tables for protein and species trees with a separate record for each tree node and a branch terminating in this node, a Substitutions table with a record for each unambiguous and ambiguous substitution including a reference to branch where it occurred (or might have occurred for ambiguous substitutions) and a Duplications table, which includes phylogenetic information about each duplication and the two clades generated it. The database is available for download from AcidMiner website <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> in the form of a virtual machine. Any standard SQL tool can be used; queries for most of the queries we used for this study are also available in the AcidMiner repository, along with the source code and a detailed description of the database structure.</p>
         </sec>
         <sec>
            <st>
               <p>Fluxes and asymmetries in amino acid substitutions</p>
            </st>
            <p>Figure <figr fid="F1">1</figr> shows the results of amino acid fluxes analysis (data available in Additional file <supplr sid="S1">1</supplr>). As has been previously shown<abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, frequent amino acids, in particular alanine, glutamic acid, leucine and proline, tend to be lost more often than created in protein sequences, while rare amino acids (in particular cysteine, histidine and methionine) are created more often than lost (Fig. <figr fid="F1">1 A, B</figr> ). There is a strong rank correlation between relative gain of amino acids in this study and in Ref <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>, based on a variety of genome triplets, mostly prokaryotic (Fig. <figr fid="F1">1C</figr>). The general pattern of relative gain-loss is the same in the entire 12-genome phylogeny (Fig. <figr fid="F1">1A</figr>, red bars) and in pairs of sister species of different divergence depth (Fig. <figr fid="F1">1A</figr>, blue bars), however, there are exceptions. For example, phenylalanine and asparagines, which are moderate gainers in the entire phylogeny, show a net loss in the shallowest branch (<it>D. persimulans/D. pseudoobscura</it>), while arginine, a weak loser in the whole phylogeny shows a strong net gain in the shallow branches.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Fluxes of amino acids in 12 <it>Drosophila</it> genomes</p>
               </caption>
               <text>
                  <p><b>Fluxes of amino acids in 12 <it>Drosophila</it> genomes</b>. A: Loser and gainer amino acids in the whole phylogeny (red bars) and terminal branches of different depth leading to sister species (blue bars; colour darkness increases with the depth of terminal branches). D_pseper &#8211; substitutions <it>in D. pseudoobscura</it> and <it>D. persimulans</it> branches; D_simsec &#8211; in <it>D. simulans</it> and <it>D. sechelia</it> branches; D_yakere &#8211; in <it>D. yakuba</it> and <it>D. erecta</it> branches; D_virmoj &#8211; in <it>D. virilis</it> and <it>D. mojavensis</it> branches. Relative amino acid gain D = (Gain-Loss)/(Gain+Loss) <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. B: Relationship between relative amino acid gain (D) and frequency of each amino acid in 12 <it>Drosophila</it> genomes. Red circles &#8211; all 12 genomes, blue circles &#8211; only substitutions in the most shallow branches (<it>in D. pseudoobscura</it> and <it>D. persimulans</it>). C. Relationship between relative amino acid gain (D) and gain-loss rank in Ref <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Symbols as on Fig 1B. D. Rank (Spearman) correlation between relative amino acid gain (D) in branches of different depths in this study and in Ref <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> (&#961;; black circles); Pearson coefficient of correlation between D and amino acid frequency in 12 <it>Drosophila</it> genomes (r, green diamonds). E. Mean pair-wise asymmetry of reciprocal substitutions (|D|, red squares). Branch depth (K<sub>s</sub>) on parts D and E is in synonymous substitutions per 4-fold degenerative site <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p>
               </text>
               <graphic file="1471-2164-11-S4-S10-1"/>
            </fig>
            <suppl id="S1">
               <title>
                  <p>Additional file 1</p>
               </title>
               <caption>
                  <p>Data by amino acids (terminal branches)</p>
               </caption>
               <text>
                  <p>Excel spreadsheet with pair-wise amino acid substitution frequencies mapped to terminal branches of the phylogeny, by species.</p>
               </text>
               <file name="1471-2164-11-S4-S10-S1.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>Contrary to the prediction based on the effect of intraspecific polymorphism <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B9">9</abbr></abbrgrp>, the observed gain-loss pattern does not become less pronounced as the divergence between genomes increases (Fig <figr fid="F1">1 D, E</figr> ; Additional file <supplr sid="S2">2</supplr>). Rank correlation with the global gain-loss pattern from Ref <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> slightly increases with branch depth, while mean pair-wise asymmetry (|D| calculated for each amino acid pair) and correlation with amino acid frequency remains flat. There is a slight tendency towards decrease of mean asymmetry (|D|) with the depth of phylogeny (Fig. <figr fid="F1">1 E</figr>), but neither of the pair-wise comparison of shallow vs. deeper branches is significant.</p>
            <suppl id="S2">
               <title>
                  <p>Additional file 2</p>
               </title>
               <caption>
                  <p>Data by amino acids (entire phylogeny; terminal vs. non-terminal branches)
</p>
               </caption>
               <text>
                  <p>Excel spreadsheet with pair-wise amino acid substitution frequencies, separately for terminal and non-terminal branches.</p>
               </text>
               <file name="1471-2164-11-S4-S10-S2.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
            <p>Pair-wise asymmetry of amino acid gains and losses had a clear manifestation in terms of average change in amino acid polarity. Amino acid pairs with the largest polarity gain had the highest asymmetry towards net gain of the less polar amino acid (Fig. <figr fid="F2">2A</figr>). The degree of polarity asymmetry differed among genes of different functionality (Fig. <figr fid="F2">2B</figr>): nucleic acid- and nucleotide-binding proteins had the strongest asymmetry towards net gain of non-polar amino acids, while in receptor and transporter proteins such asymmetry was not observed. Likewise, net loss of polarity was the highest in proteins with intracellular localization, intermediate in proteins with extracellular localization and the lowest in membrane proteins, indicating the role of hydrophobicity of the protein&#8217;s cellular environment on relative gain and loss rate of polar and non-polar amino acids.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Amino acid polarity and asymmetry of net gains and losses</p>
               </caption>
               <text>
                  <p><b>Amino acid polarity and asymmetry of net gains and losses</b>. A. Correlation between relative net gain (D) and difference in polarity (Destination-Source) for 190 pair of amino acids. B. Net decrease of mean amino acid polarity due to substitutions in proteins of different molecular functions. C. Net decrease of mean amino acid polarity due to substitutions in proteins of different cellular localization.</p>
               </text>
               <graphic file="1471-2164-11-S4-S10-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Frequencies and radicality of amino acid substitutions in duplicated genes</p>
            </st>
            <p>Duplicated genes appeared to accumulate more amino acid changes since duplication (per unit of time measured in units of synonymous substitutions per 4-fold degenerative site) than single copy genes (Fig. <figr fid="F3">3</figr>). Although the difference was statistically significant, it was not drastic: among 1701 gene families with duplications and with at least 1 substitution in both duplicated and unduplicated parts of the phylogeny paralogs accumulated more substitutions per unit of branch lengths than single copy genes in 988 families (58%; sign test P&lt;0.00001). This relationship also varied across functional groups of genes, being the strongest in non-TF DNA-binding proteins, weaker in enzymes and protein-binding proteins and undetectable or reversed in other functional groups of proteins. Overall the rate of substitutions was the greatest in paralogs and the lowest in unduplicated sections of phylogenies of gene families with duplications, both when all substitutions and unambiguous substitutions only were considered (Fig. <figr fid="F3">3</figr> inset).</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Relative frequency of amino acid substitutions in single copy and duplicated genes</p>
               </caption>
               <text>
                  <p><b>Relative frequency of amino acid substitutions in single copy and duplicated genes</b>. K<sub>a</sub> = Number of amino acid substitutions per amino acid site; K<sub>s</sub> = cumulative number of synonymous substitutions per 4-fold degenerative site <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, i.e. cumulative length of branches leading to either single copy or duplicated genes. K<sub>a</sub> / K<sub>s</sub> for single copy and duplicated branches calculated for each gene family separately and averaged by molecular function without weighing. Standard errors shown reflect variance among gene families. Red bars: single copy genes in gene families without duplications; orange bars: single copy genes in gene families with duplication; green bars &#8211; duplicated genes. Inset: All substitutions and unambiguous substitutions for all gene families combined.</p>
               </text>
               <graphic file="1471-2164-11-S4-S10-3"/>
            </fig>
            <p>Paralogs also evolved by more radical substitutions. Across functional groups of proteins (with the exception of transporter proteins) duplicated portions of phylogenies accumulated amino acid substitutions with greater average absolute change in polarity (Fig. <figr fid="F4">4A</figr>), while single copy genes typically did not differ significantly from gene families without duplications. Likewise, both overall and in every single functional category, paralogs differed by amino acid pairs with lower Exchangeability <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> (Fig. <figr fid="F4">4B</figr>). Again, single copy genes in families with duplications were intermediate between genes with no duplications and paralogs overall and typically did not differ from genes with no duplications within each functional category.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Radicality of amino acid substitutions in single copy and duplicated genes</p>
               </caption>
               <text>
                  <p><b>Radicality of amino acid substitutions in single copy and duplicated genes.</b> A: mean absolute change of polarity between destination and source amino acids in gene families with different molecular function. B: mean exchangeability <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Colours as on Fig. <figr fid="F3">3</figr>. Insets: comparison of ambiguous and unambiguous substitutions.</p>
               </text>
               <graphic file="1471-2164-11-S4-S10-4"/>
            </fig>
            <p>As expected, substitution rates and radicalilty decreased with mean expression rate in the whole fly and increased with the coefficient of variance of expression rate across larval and adult tissues <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> (Fig. <figr fid="F5">5</figr>), corroborating previously observed patterns of stronger selective constraints in highly expressed genes and in household genes <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. However, both effects were much less pronounced in paralogs than in single-copy genes; neither regression over mean expression level was significant (Fig. <figr fid="F5">5 A, B</figr>) and, while relative rate of substitutions increased with CV of expression rates across tissues, difference in polarity showed no correlation in paralogs. To summarize this pattern, the rate and radicality of duplicated genes evolution appeared to be uniformly high independently from gene expression rate and ubiquity. Data on rates and radicality of amino acid substitutions organized by gene family are available in Additional file <supplr sid="S3">3</supplr>.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Rates and radicality of amino acid substitution vs. expression level and ubiquity</p>
               </caption>
               <text>
                  <p><b>Rates and radicality of amino acid substitution vs. expression level and ubiquity.</b> Relationship between relative substitution rate (K<sub>a</sub>/K<sub>a</sub>; A, C) and mean absolute change of polarity (|&#916;P|; B, D) and log mean gene expression rate in whole fly (A, B) and coefficient of variation of expression rate across larval and adult tissues (C, D). Expression data from <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Solid lines: regressions significant at P&lt;0.0001; dashed lines: regression without significant terms (shown for a comparison). Second-degree polynomial regression lines are shown when the quadratic term is significant, otherwise a linear regression is used.</p>
               </text>
               <graphic file="1471-2164-11-S4-S10-5"/>
            </fig>
            <suppl id="S3">
               <title>
                  <p>Additional file 3</p>
               </title>
               <caption>
                  <p>Data by gene family</p>
               </caption>
               <text>
                  <p>Excel spreadsheet with data on rates and radicalities of substitutions by gene family.</p>
               </text>
               <file name="1471-2164-11-S4-S10-S3.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
         <sec>
            <st>
               <p>Clade asymmetries in duplicated genes</p>
            </st>
            <p>Table <tblr tid="T1">1</tblr> summarizes the extent of asymmetry among clades resulting from duplication events. Substitution counts show a significant clade asymmetry in a large number of duplications. Asymmetry in radicality measures (|DPolarity| and Exchangeability) survives multiple tests correction in a lower number of tests. Total number of tests is different, because asymmetry was tested for all duplications, while other parameters &#8211; only for duplications, in which both clades had at least 2 unambiguous substitutions. Excluding terminal branches of the phylogeny, potentially contaminated by substitutions in pseudogenes and therefore biased towards clade asymmetry, does not change the result.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Summary of clade asymmetries: the number of tests withstanding false discovery rate and Bonferroni adjustments for multiple tests. Tests: number of substitutions &#8211; &#967;<sup>2</sup> test for heterogeneity; |DPolarity| and Exchangeability &#8211; t-test.</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p/>
                     </c>
                     <c ca="center" cspan="3">
                        <p>All duplications</p>
                     </c>
                     <c ca="center" cspan="3">
                        <p>Terminal duplications excluded</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Asymmetry parameter</p>
                     </c>
                     <c ca="center">
                        <p>Number of duplications tested</p>
                     </c>
                     <c ca="center">
                        <p>FDR = 0.01</p>
                     </c>
                     <c ca="center">
                        <p>Bonferroni adjusted P = 0.01</p>
                     </c>
                     <c ca="center">
                        <p>Number of duplications tested</p>
                     </c>
                     <c ca="center">
                        <p>FDR = 0.01</p>
                     </c>
                     <c ca="center">
                        <p>Bonferroni adjusted P = 0.01</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total substitutions</p>
                     </c>
                     <c ca="center">
                        <p>4646</p>
                     </c>
                     <c ca="center">
                        <p>908</p>
                     </c>
                     <c ca="center">
                        <p>805</p>
                     </c>
                     <c ca="center">
                        <p>3118</p>
                     </c>
                     <c ca="center">
                        <p>804</p>
                     </c>
                     <c ca="center">
                        <p>741</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Unambiguous substitutions</p>
                     </c>
                     <c ca="center">
                        <p>4646</p>
                     </c>
                     <c ca="center">
                        <p>721</p>
                     </c>
                     <c ca="center">
                        <p>621</p>
                     </c>
                     <c ca="center">
                        <p>3118</p>
                     </c>
                     <c ca="center">
                        <p>613</p>
                     </c>
                     <c ca="center">
                        <p>543</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>|DPolarity|</p>
                     </c>
                     <c ca="center">
                        <p>2964</p>
                     </c>
                     <c ca="center">
                        <p>66</p>
                     </c>
                     <c ca="center">
                        <p>39</p>
                     </c>
                     <c ca="center">
                        <p>2351</p>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Exchangeability</p>
                     </c>
                     <c ca="center">
                        <p>2964</p>
                     </c>
                     <c ca="center">
                        <p>62</p>
                     </c>
                     <c ca="center">
                        <p>38</p>
                     </c>
                     <c ca="center">
                        <p>2351</p>
                     </c>
                     <c ca="center">
                        <p>58</p>
                     </c>
                     <c ca="center">
                        <p>30</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>Clade asymmetries by molecular function categories are presented on Fig. <figr fid="F6">6</figr>. Protein- and RNA-binding proteins were characterized by the highest asymmetry of substitutions rates, while nucleotide-binding proteins and transcription factors had the lowest (although only enzymes vs. protein-binding proteins comparison is significant by Tukey-Kramer test). Nucleotide-binding proteins, on the other hand, demonstrated the highest asymmetry in both absolute polarity change and exchangeability of substitutions in the two clades, along with transcription factors, enzymes and structural proteins. The lowest radicality clade asymmetry was seen in RNA-binding and transporter proteins. Data on rates and radicality of amino acid substitutions organized by duplications are available in Additional file <supplr sid="S4">4</supplr>.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Clade asymmetries in families with duplications</p>
               </caption>
               <text>
                  <p><b>Clade asymmetries in families with duplications.</b> Clade asymmetry (A) in relative substitution rate (top), absolute change in polarity (middle) and exchangeability (bottom) by molecular function. Molecular function category means were calculated by unweighted averaging over families. One-way ANOVA, respectively: F = 3.91, P &lt; 0.00001; F = 7.63, P &lt; 0.00001; F = 2.99, P &lt; 0.001. Different letters signify categories different by Tukey-Kramer test, P = 0.05.</p>
               </text>
               <graphic file="1471-2164-11-S4-S10-6"/>
            </fig>
            <suppl id="S4">
               <title>
                  <p>Additional file 4</p>
               </title>
               <caption>
                  <p>Data by duplications</p>
               </caption>
               <text>
                  <p>Excel spreadsheet with data on rates and radicalities of substitutions by duplication with separate columns for each of the two clades resulting from each duplication events.</p>
               </text>
               <file name="1471-2164-11-S4-S10-S4.xls">
                  <p>Click here for file</p>
               </file>
            </suppl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Several caveats in the data and analysis require attention. Firstly, alignments we used may contain pairs of paralogs, in which one of the copies is undergoing pseudogenization and is nor longer expressed, but has not yet acquired a frameshift, which would allow it to be recognized as a pseudogene. Indeed, there is a significant excess of nonsense mutations (per missense) present in the terminal branches of phylogeny (data not presented), indicating presence of pseudogenes in the alignments. Pairs of paralogs, in which one gene copy is undergoing pseudogenization, will demonstrate clade asymmetry, mimicking the signature of neofunctionalization. However, such paralogs are almost certainly present only in the most terminal branches of Drosophila phylogeny spanning over 70 mln years, because the half-life of duplications, in which one of the copies undergoes pseudogenization, is 2-4 mln years (12; 26). Terminal branches include a minority of duplications in our database and excluding such branches from the analysis does not alter the results (Table <tblr tid="T1">1</tblr>). This indicates that the observed clade asymmetry is not an artefact of pseudogenes. A direct comparison of clade asymmetries in terminal vs. non-terminal duplications is not possible for two reasons. Firstly, there are much fewer substitutions in the terminal branches, so there is an intrinsic difference in statistical power. Secondly, clade asymmetry analysis is based on unambiguous substitutions and the frequency of unambiguous substitutions increases with the depths of the phylogeny, possibly biasing such comparison.</p>
         <p>On the other hand, some true functional paralogs may be missing from the alignments, particularly those resulting from ancient duplications, due to homology below the threshold used by the reciprocal BLAST algorithm (see Methods). This creates a bias towards less divergent paralogs, reducing our ability to detect elevated rates of evolution in duplicated genes. Relative magnitude of these opposing biases remains unknown.</p>
         <p>Further, results presented in Table <tblr tid="T1">1</tblr> do not necessarily indicate that clade asymmetries are more likely to manifest themselves in substitution rates than in substitution radicality. The number of test surviving multiple test correction probably reflects differences in statistical power rather than a true biological phenomenon.</p>
         <p>Systematic loss/gain asymmetry in amino acid composition in 12 <it>Drosophila</it> genomes corroborates patterns previously observed in a variety of taxonomically diverse triplets of genomes <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. This pattern does not become less pronounced as more and more distant genomes are included into consideration, indicating that it is not caused by the effect of polymorphisms reflecting mutation-selection balance influenced by mutational asymmetries <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B9">9</abbr></abbrgrp>.</p>
         <p>We also demonstrate that this net loss/gain asymmetry is strongly correlated with source and destination amino acid polarities: substitutions of polar amino acids by non-polar ones have a higher net rate than the reciprocal substitutions. In the past we have demonstrated a similar polarity-related asymmetry in selection coefficients against amino acid substitutions in human proteins <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>; however this asymmetry was largely limited to strong selection (i.e., selection against clinically important phenotypes) and was not seen in evolutionary substitution rates.</p>
         <p>One may hypothesise that replacing polar amino acids by any is less disruptive for the protein function because polar amino acids have a lower tendency to be located internally in the tertiary protein structure <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. If so, we would expect the decrease of polarity due to amino acid substitutions to be the lowest in membrane proteins, in which polar amino acids in within-membrane domains tend to be internally located. Indeed, the decrease of polarity due to substitutions is the weakest in receptor and transporter proteins, many of which have membrane-embedded hydrophobic regions (Fig. <figr fid="F2">2 B</figr>) and in proteins with membrane localization (Fig. <figr fid="F2">2 C</figr>).</p>
         <p>A question remains how is it possible that asymmetry in amino acid gains and losses systematically removed polar amino acids more often than non-polar ones (Fig.<figr fid="F2">2A</figr>) over 70 mln years of drosophilid evolution (and actually over much longer period of evolution of proteins of much broader taxonomic spectrum <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>)? There is no evidence that the relationship shown on Fig. <figr fid="F2">2A</figr> has a tendency to weaken in the most recent branches of the phylogeny (data not reported), which would have indicated an approach to an equilibrium. Rather, the frequencies of amino acids in proteins appear to be far from an equilibrium and we observe a constant turnover of polar amino acids due to more relaxed selective constraint acting on the amino acid of external location. One may further speculate that perhaps such systematic loss of surface polar amino acids would gradually change protein folding as external sites become occupied by more hydrophobic amino acid residuals. This process may be a potentially important mechanism of acquiring new functions by duplicated genes.</p>
         <p>We have demonstrated that, in a genome-wide assessment, duplicated genes evolve both faster (higher K<sub>a</sub>/K<sub>s</sub>) and through more radical amino acid substitutions (higher |DPolarity|, lower exchangeability) than single copy genes (Figs <figr fid="F3">3</figr> and <figr fid="F4">4</figr>). Likewise, single copy genes in families with extant duplications tend to evolve faster and more radically than single copy genes in families without extant duplications, indicating that duplications are more likely to be retained in gene families with weaker selective constraints.</p>
         <p>Just like with the signed polarity change, the absolute change of polarity is not significantly different between duplicated and single copy genes among genes coding for transporter proteins, corroborating the hypothesis of the importance of relaxed selective constraint on surface sites of water-soluble proteins (Fig. <figr fid="F4">4 A</figr>). (This difference is, however, significant for receptor proteins.) The exchangeability index, on the other hand, is significantly lower in duplicated transporter proteins, suggesting that paralogs in these genes families do evolve through more radical substitutions, just without systematic net loss of polar residuals.</p>
         <p>Data on the asymmetry of clades resulting from duplications supports the hypothesis of widespread neofunctionalization accompanying retention of duplicated genes: over 1/3 of all duplications show a significant asymmetry in amino acid substitution rates with false discovery rate 0.05 and almost 1/5 of all substitution show asymmetry, which stands Bonferroni correction (Table <tblr tid="T1">1</tblr>). Much fewer duplications show a significant asymmetry in radicality of substitutions, although about 6% have a significant asymmetry in absolute polarity change (with false discovery rate 0.05). Gene families of different functionality differ from each other in the degree of clade asymmetry with a hint of a negative correlation between asymmetry in rates (Fig. <figr fid="F6">6</figr>, top) and asymmetry in radicality (Fig. <figr fid="F6">6</figr>, middle and bottom). No molecular function category stands out in terms of tendency to display signatures of neofunctionalization, although RNA-binding proteins have the lowest (non significant) difference in rates and radicality of substitutions between duplicated and single copy genes (Fig. <figr fid="F3">3</figr> and <figr fid="F4">4</figr>) and the lowest clade asymmetry of substitution radicality in paralogs (Fig. <figr fid="F6">6</figr>), indicating that, perhaps, in these proteins neofunctionalization is less common. Interestingly, transcription factors appear to show low neofunctionalization signal in terms of substitution rates (no difference between duplicated and single-copy genes, Fig. <figr fid="F3">3</figr>; low asymmetry between paralogs, Fig <figr fid="F6">6</figr>, top), but a strong neofunctionalization signal in terms of substitution radicality (Fig <figr fid="F4">4</figr>; Fig. <figr fid="F6">6</figr> middle and bottom). One may hypothesize that positive selection for a novel functionality can operate either by increased rate of substitutions, or by favouring more radical changes without the increase of rates.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>We have designed a tool, which allows a detailed phylogenetic analysis of amino acid substitutions in a large number of multiple alignments with or without duplicated genes present. The algorithm is capable to polarize and establish phylogenetic position of all substitutions for which it is possible (unambiguous) and to list all possible alternatives for other, ambiguous substitutions. It results in a database, which can be used to answer questions about patterns of amino acids substitutions genome-wide or in particular categories of genes such as molecular functions or duplication status.</p>
         <p>The analysis of such database of substitutions in 12 Drosophila genomes confirmed previously observed non-equilibrium patterns of net losses and gains of individual amino acids, demonstrated that these patterns do not weaken with the depth of phylogeny and revealed a strong correlation between polarity of amino acid and propensity to display a net loss. We hypothesize that this effect can be explained by relaxed selective constraints on externally located amino acid sites occupied by polar residuals. Evolution of duplicated genes is characterized by both higher relative rate of substitution and more radical nature of these substitutions, as compared to single copy genes. The rate and radicality in paralogs displays a weaker relation with mean expression rate and variance of expression rates across tissues than in single copy genes. This pattern, along with the strong asymmetry between clades resulting from duplication events, indicates widespread neofunctionalization of retained duplications.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Algorithm, data provenance and phylogenetic analysis</p>
            </st>
            <p>A new a phylogenetic analysis tool AcidMiner <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> is used to convert raw data in the form of protein alignments and Newick protein and species trees into a relational database of amino acid substitutions searchable by standard SQL queries and containing a number of preset queries. Additionally, it allows further derivative data to be produced for tasks not easily expressible in SQL. Code for such purposes can be written either in Java or as stored procedures in the DBMS proprietary language, which in some cases results in faster processing. AcidMiner Java code, custom DBMS procedures and most of the complex SQL queries used in this study are also available <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.</p>
            <p>Protein alignments and corresponding phylogenies were acquired from Dfam database at Indiana University <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B27">27</abbr></abbrgrp>. These alignments have been obtained by means of modified reciprocal BLAST method <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B19">19</abbr></abbrgrp>. Briefly (see <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> for details), the results of an all-by-all comparison between the 12 genomes using BLASTP are filtered to retain as homologs all hits with E-values within two orders of magnitude of the highest hit. Gene families (clusters of homologs) are then deterimined by finding the maximally connected clusters that are disjoint from one another while discarding nonreciprocal relationships <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
            <p>NOTUNG phylogenies reconciling topological incongruence between species trees and proteins trees <abbrgrp><abbr bid="B28">28</abbr></abbrgrp> were used to map duplications and substitutions. We considered 11258 gene families (with at least 6 species represented), which contained 8,766,256 amino acid sites. Areas of alignments with >1 indels in a row in one or more species were excluded from the analysis. Of the amino acid sites retained for the analysis 2,131,864 sites had at least one substitution in at least one clade. These sites contained a total of 3,697,627 substitutions. A substitution was called unambiguous if it could be unequivocally polarized and placed on the phylogeny by the genotype of the outgroup clade; there were 2,004,536 such substitutions. Substitutions without a single most parsimonious placement were called ambiguous; such substitutions were included into the rates calculated, but excluded from the analysis of radicality of substitutions. Substitution data arranged by amino acids, by gene families and by duplications are available in supplemental materials or by request.</p>
            <p>Paralogs were identified as homologs present in the same genome and substitutions were considered to have been acquired by duplicated genes if their most parsimonious placement on the phylogeny is more terminal than the placement of the duplication event. Conversely, substitutions occurring in branches basal to the most ancient surviving duplication in a clade were considered to have occurred in a single-copy gene.</p>
         </sec>
         <sec>
            <st>
               <p>Fluxes, asymmetries, radicality and substitutions rates</p>
            </st>
            <p>Net relative gain (or loss) of amino acids through substitutions (flux) was characterized by the parameter D = (C-R)/(C+R), where C is the number of times each amino acids was created and R &#8211; the number of times the same amino acid has been removed by substitutions <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. The parameter D was be calculated separately for each amino acid pair, or for each amino acid as a marginal value. Change of amino acid polarity due to substitutions was calculated as mean difference between source and destination amino acid polarities (Polarity values taken from AAIndex, Ref. <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>). The absolute value of this difference, |DPolarity|, was used as a measure of radicality of each amino acid substitution; an alternative, inverse measure of radicality used was the Exchangeability index <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            <p>Each gene family was characterized by a K<sub>a</sub>/K<sub>s</sub> value, obtained in the following manner. K<sub>a</sub> was estimated as the ratio of the number of substitutions (in either the whole tree, or separately for duplicated and unduplicated portions of the tree) to the number of amino acid sites in the alignment. K<sub>s</sub> was calculated as the sum of branch lengths of the corresponding section of the tree expressed as the frequency of synonymous substitutions per 4-fold degenerative site <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Ontology and expression data and statistical analysis</p>
            </st>
            <p>Gene ontology and gene expression data were merged with amino acid substitution data by <it>D. melanogaster</it> genes FlyBase IDs <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>. Therefore, for all analyses involving molecular functions and gene expression level, genes families lacking a <it>D. melanogaster</it> gene were excluded. Conversely, families with duplicated <it>D. melanogaster</it> genes appeared in these types of analysis with the number of times equal to the number of <it>D. melanogaster</it> paralogs they contained. Gene families were subdivided into the following molecular function categories using FlyBase ontology data <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>: structural proteins, enzymes, transcription factors, other DNA-binding proteins, RNA-binding proteins, ATP- and GTP-binding proteins, receptors and signal transduction proteins, transporters, proteins with other functions and proteins with unknown function. Gene expression data were obtained from FlyAtlas database <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>LYY proposed the study methodology, accomplished data analysis and prepared the manuscript. MAB wrote software, generated the dataset and contributed to the manuscript preparation.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The authors declare that they have no competing interests.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We are grateful to M. Hahn for providing alignments and useful discussion and to A. Kondrashov, Y. Wolf and three anonymous reviewers for helpful suggestions on improving the analysis and the manuscript. Work was partially supported by NSF-0525447.</p>
            <p>This article has been published as part of <it>BMC Genomics</it> Volume 11 Supplement 4, 2010: Ninth International Conference on Bioinformatics (InCoB2010): Computational Biology. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2164/11?issue=S4.</url></p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Rates of Conservative and Radical Nonsynonymous Nucleotide Substitutions in Mammalian Nuclear Genes</p>
            </title>
            <aug>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2000</pubdate>
            <volume>50</volume>
            <fpage>56</fpage>
            <lpage>68</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10654260</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Detecting excess radical replacements in phylogenetic trees</p>
            </title>
            <aug>
               <au>
                  <snm>Pupko</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sharan</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hasegawa</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shamir</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Graur</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>2003</pubdate>
            <volume>319</volume>
            <fpage>127</fpage>
            <lpage>135</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0378-1119(03)00802-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">14597178</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>A universal trend of amino acid gain and loss in protein evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Jordan</snm>
                  <fnm>IK</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>FA</fnm>
               </au>
               <au>
                  <snm>Adzhubei</snm>
                  <fnm>IA</fnm>
               </au>
               <au>
                  <snm>Wolf</snm>
                  <fnm>YI</fnm>
               </au>
               <au>
                  <snm>Koonin</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>AS</fnm>
               </au>
               <au>
                  <snm>Sunyaev</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2005</pubdate>
            <volume>433</volume>
            <fpage>633</fpage>
            <lpage>638</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03306</pubid>
                  <pubid idtype="pmpid" link="fulltext">15660107</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Drosophila 12 Genomes Consortium: Evolution of genes and genomes on the Drosophila phylogeny</p>
            </title>
            <source>Nature</source>
            <pubdate>2007</pubdate>
            <volume>450</volume>
            <fpage>203</fpage>
            <lpage>218</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature06341</pubid>
                  <pubid idtype="pmcid">2919768</pubid>
                  <pubid idtype="pmpid" link="fulltext">17994087</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Mutational trends and random processes in the evolution of informational macromolecules</p>
            </title>
            <aug>
               <au>
                  <snm>Zuckerkandl</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Derancourt</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Vogel</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1971</pubdate>
            <volume>59</volume>
            <fpage>473</fpage>
            <lpage>490</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0022-2836(71)90311-1</pubid>
                  <pubid idtype="pmpid" link="fulltext">5571595</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Evolution of proteomes: fundamental signatures and global trends in amino acid compositions</p>
            </title>
            <aug>
               <au>
                  <snm>Tekaia</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Yeramian</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>BMC Genomics</source>
            <pubdate>2006</pubdate>
            <volume>7</volume>
            <fpage>307</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2164-7-307</pubid>
                  <pubid idtype="pmcid">1764020</pubid>
                  <pubid idtype="pmpid" link="fulltext">17147802</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Causes of trends in amino acid gain and loss</p>
            </title>
            <aug>
               <au>
                  <snm>Hurst</snm>
                  <fnm>LD</fnm>
               </au>
               <au>
                  <snm>Feil</snm>
                  <fnm>EJ</fnm>
               </au>
               <au>
                  <snm>Rocha</snm>
                  <fnm>EPC</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2006</pubdate>
            <volume>442</volume>
            <fpage>E11</fpage>
            <lpage>E12</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature05137</pubid>
                  <pubid idtype="pmpid" link="fulltext">16929253</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Apparent trends of amino Acid gain and loss in protein evolution due to nearly neutral variation</p>
            </title>
            <aug>
               <au>
                  <snm>McDonald</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2006</pubdate>
            <volume>23</volume>
            <fpage>240</fpage>
            <lpage>244</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msj026</pubid>
                  <pubid idtype="pmpid" link="fulltext">16195487</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>The universal trend of amino acid gain&#8211;loss is caused by CpG hypermutability</p>
            </title>
            <aug>
               <au>
                  <snm>Misawa</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kamatani</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kikuno</snm>
                  <fnm>RF</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2008</pubdate>
            <volume>67</volume>
            <fpage>334</fpage>
            <lpage>342</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-008-9141-1</pubid>
                  <pubid idtype="pmpid" link="fulltext">18810523</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Distribution of the strength of selection against amino acid replacements in human proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Yampolsky</snm>
                  <fnm>LY</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>FA</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>AS</fnm>
               </au>
            </aug>
            <source>Human Molecular Genetics</source>
            <pubdate>2005</pubdate>
            <volume>14</volume>
            <fpage>3191</fpage>
            <lpage>3201</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/hmg/ddi350</pubid>
                  <pubid idtype="pmpid" link="fulltext">16174645</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Evolution by gene duplication</p>
            </title>
            <aug>
               <au>
                  <snm>Ohno</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <publisher>Berlin(Germany): Springer- Verlag</publisher>
            <pubdate>1970</pubdate>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The evolutionary fate and consequences of duplicate genes</p>
            </title>
            <aug>
               <au>
                  <snm>Lynch</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Conery</snm>
                  <fnm>JS</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>290</volume>
            <fpage>1151</fpage>
            <lpage>1155</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.290.5494.1151</pubid>
                  <pubid idtype="pmpid" link="fulltext">11073452</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Distinguishing Among Evolutionary Models for the Maintenance of Gene Duplicates</p>
            </title>
            <aug>
               <au>
                  <snm>Hahn</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>J. Heredity</source>
            <pubdate>2009</pubdate>
            <volume>100</volume>
            <fpage>605</fpage>
            <lpage>617</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/jhered/esp047</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>The evolution of gene duplications: classifying and distinguishing between models</p>
            </title>
            <aug>
               <au>
                  <snm>Innan</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2010</pubdate>
            <volume>11</volume>
            <fpage>97</fpage>
            <lpage>108</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nrg2689</pubid>
                  <pubid idtype="pmpid" link="fulltext">20051986</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution</p>
            </title>
            <aug>
               <au>
                  <snm>He</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2005</pubdate>
            <volume>169</volume>
            <fpage>1157</fpage>
            <lpage>1164</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1534/genetics.104.037051</pubid>
                  <pubid idtype="pmcid">1449125</pubid>
                  <pubid idtype="pmpid" link="fulltext">15654095</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Consistent patterns of rate asymmetry and gene loss indicate widespread neofunctionalization of yeast genes after whole-genome duplication</p>
            </title>
            <aug>
               <au>
                  <snm>Byrne</snm>
                  <fnm>KP</fnm>
               </au>
               <au>
                  <snm>Wolfe</snm>
                  <fnm>KH</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2007</pubdate>
            <volume>175</volume>
            <fpage>1341</fpage>
            <lpage>1350</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1534/genetics.106.066951</pubid>
                  <pubid idtype="pmcid">1840088</pubid>
                  <pubid idtype="pmpid" link="fulltext">17194778</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Adaptive evolution of young gene duplicates in mammals</p>
            </title>
            <aug>
               <au>
                  <snm>Han</snm>
                  <fnm>MV</fnm>
               </au>
               <au>
                  <snm>Demuth</snm>
                  <fnm>JP</fnm>
               </au>
               <au>
                  <snm>McGrath</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Casola</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Hahn</snm>
                  <fnm>MW</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2009</pubdate>
            <volume>19</volume>
            <fpage>859</fpage>
            <lpage>867</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.085951.108</pubid>
                  <pubid idtype="pmcid">2675974</pubid>
                  <pubid idtype="pmpid" link="fulltext">19411603</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Preservation of duplicate genes by complementary, degenerative mutations</p>
            </title>
            <aug>
               <au>
                  <snm>Force</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lynch</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Pickett</snm>
                  <fnm>FB</fnm>
               </au>
               <au>
                  <snm>Amores</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yan</snm>
                  <fnm>Y-L</fnm>
               </au>
               <au>
                  <snm>Postlethwait</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>1999</pubdate>
            <volume>151</volume>
            <fpage>1531</fpage>
            <lpage>1545</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1460548</pubid>
                  <pubid idtype="pmpid" link="fulltext">10101175</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Gene Family Evolution across 12 Drosophila Genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Hahn</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Han</snm>
                  <fnm>MV</fnm>
               </au>
               <au>
                  <snm>Han</snm>
                  <fnm>SG</fnm>
               </au>
            </aug>
            <source>PLoS Genetics</source>
            <pubdate>2007</pubdate>
            <volume>3</volume>
            <fpage>2135</fpage>
            <lpage>2146</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1371/journal.pgen.0030197</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>AcidMiner</p>
            </title>
            <url>http://sourceforge.net/projects/acidminer</url>
         </bibl>
         <bibl id="B21">
            <title>
               <p>The exchangeability of amino acids in proteins</p>
            </title>
            <aug>
               <au>
                  <snm>Yampolsky</snm>
                  <fnm>LY</fnm>
               </au>
               <au>
                  <snm>Stoltzfus</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2005</pubdate>
            <volume>170</volume>
            <fpage>1459</fpage>
            <lpage>1472</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1534/genetics.104.039107</pubid>
                  <pubid idtype="pmcid">1449787</pubid>
                  <pubid idtype="pmpid" link="fulltext">15944362</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Using FlyAtlas to identify better Drosophila melanogaster models of human disease</p>
            </title>
            <aug>
               <au>
                  <snm>Chintapalli</snm>
                  <fnm>VR</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dow</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>2007</pubdate>
            <volume>39</volume>
            <fpage>715</fpage>
            <lpage>720</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/ng2049</pubid>
                  <pubid idtype="pmpid" link="fulltext">17534367</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Highly expressed genes in yeast evolve slowly</p>
            </title>
            <aug>
               <au>
                  <snm>P&#225;l</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Papp</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hurst</snm>
                  <fnm>LD</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2001</pubdate>
            <volume>158</volume>
            <fpage>927</fpage>
            <lpage>931</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1461684</pubid>
                  <pubid idtype="pmpid" link="fulltext">11430355</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Why highly expressed proteins evolve slowly</p>
            </title>
            <aug>
               <au>
                  <snm>Drummond</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Bloom</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Adami</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wilke</snm>
                  <fnm>CO</fnm>
               </au>
               <au>
                  <snm>Arnold</snm>
                  <fnm>FH</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>2005</pubdate>
            <volume>102</volume>
            <fpage>14338</fpage>
            <lpage>14343</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1073/pnas.0504070102</pubid>
                  <pubid idtype="pmcid">1242296</pubid>
                  <pubid idtype="pmpid" link="fulltext">16176987</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Drummond</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Wilke</snm>
                  <fnm>CO</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2008</pubdate>
            <volume>134</volume>
            <fpage>341</fpage>
            <lpage>352</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.cell.2008.05.042</pubid>
                  <pubid idtype="pmcid">2696314</pubid>
                  <pubid idtype="pmpid" link="fulltext">18662548</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Formation and longevity of chimeric and duplicate genes in Drosophila melanogaster</p>
            </title>
            <aug>
               <au>
                  <snm>Rogers</snm>
                  <fnm>RL</fnm>
               </au>
               <au>
                  <snm>Bedford</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hartl</snm>
                  <fnm>DL</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2009</pubdate>
            <volume>181</volume>
            <fpage>313</fpage>
            <lpage>322</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1534/genetics.108.091538</pubid>
                  <pubid idtype="pmcid">2621179</pubid>
                  <pubid idtype="pmpid" link="fulltext">19015547</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Dfam</p>
            </title>
            <url>http://www.indiana.edu/~hahnlab/fly/DfamDB/drosophila_frb.html</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>A Hybrid Micro&#8211;Macroevolutionary Approach to Gene Tree Reconstruction</p>
            </title>
            <aug>
               <au>
                  <snm>Durand</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Bjarni</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Halld&#243;Rsson</snm>
                  <fnm>Bv</fnm>
               </au>
               <au>
                  <snm>Vernot</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>J Comp Biol</source>
            <pubdate>2006</pubdate>
            <volume>13</volume>
            <fpage>320</fpage>
            <lpage>335</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1089/cmb.2006.13.320</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>AAindex: amino acid index database</p>
            </title>
            <aug>
               <au>
                  <snm>Kawashima</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kanehisa</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>374</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/28.1.374</pubid>
                  <pubid idtype="pmcid">102411</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592278</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>FlyBase: enhancing Drosophila Gene Ontology annotations</p>
            </title>
            <aug>
               <au>
                  <snm>Tweedie</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ashburner</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Falls</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Leyland</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>McQuilton</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Marygold</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Millburn</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Osumi-Sutherland</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Schroeder</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Seal</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <cnm>The FlyBase Consortium</cnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2009</pubdate>
            <volume>37</volume>
            <fpage>D555</fpage>
            <lpage>D559</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/gkn788</pubid>
                  <pubid idtype="pmcid">2686450</pubid>
                  <pubid idtype="pmpid" link="fulltext">18948289</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
