<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2007-8-10-r225</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Huang</snm>
               <fnm>Weichun</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>weichun.huang@bc.edu</email>
            </au>
            <au id="A2">
               <snm>Nevins</snm>
               <mi>R</mi>
               <fnm>Joseph</fnm>
               <insr iid="I1"/>
               <email>j.nevins@duke.edu</email>
            </au>
            <au ca="yes" id="A3">
               <snm>Ohler</snm>
               <fnm>Uwe</fnm>
               <insr iid="I1"/>
               <email>uwe.ohler@duke.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Institute for Genome Sciences and Policy, Duke University, Durham, NC 27708, USA</p>
            </ins>
            <ins id="I2">
               <p>Current address: Department of Biology, Boston College, Chestnut Hill, MA 02467, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>10</issue>
         <fpage>R225</fpage>
         <url>http://genomebiology.com/2007/8/10/R225</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17956628</pubid>
               <pubid idtype="doi">10.1186/gb-2007-8-10-r225</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>11</day>
               <month>4</month>
               <year>2007</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>20</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>24</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>24</day>
               <month>10</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Huang et al; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>Phylogenetic simulation of promoter evolution</p>
      </shorttitle>
      <shortabs>
         <p>Phylogenetic simulation of promoter evolution were used to analyze functional site turnover in regulatory sequences.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The phenomenon of functional site turnover has important implications for the study of regulatory region evolution, such as for promoter sequence alignments and transcription factor binding site (TFBS) identification. At present, it remains difficult to estimate TFBS turnover rates on real genomic sequences, as reliable mappings of functional sites across related species are often not available. As an alternative, we introduce a flexible new simulation system, Phylogenetic Simulation of Promoter Evolution (PSPE), designed to study functional site turnovers in regulatory sequences.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Using PSPE, we study replacement turnover rates of different individual TFBSs and simple modules of two sites under neutral evolutionary functional constraints. We find that TFBS replacement turnover can happen rapidly in promoters, and turnover rates vary significantly among different TFBSs and modules. We assess the influence of different constraints such as insertion/deletion rate and translocation distances. Complementing the simulations, we give simple but effective mathematical models for TFBS turnover rate prediction. As one important application of PSPE, we also present a first systematic evaluation of multiple sequence aligners regarding their capability of detecting TFBSs in promoters with site turnovers.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusion</p>
               </st>
               <p>PSPE allows researchers for the first time to investigate TFBS replacement turnovers in promoters systematically. The assessment of alignment tools points out the limitations of current approaches to identify TFBSs in non-coding sequences, where turnover events of functional sites may happen frequently, and where we are interested in assessing the similarity on the functional level. PSPE is freely available at the authors' website.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010008">Evolution</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Transcription regulation is a central component in the control of gene expression. Identification of functional <it>cis</it>-elements in promoter regions, a key to understanding gene regulation, has turned out to be a difficult task thus far. With the increasing availability of genome sequences, phylogenetic footprinting appeared to offer a very promising approach for identifying <it>cis</it>-elements <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. One essential assumption of phylogenetic footprinting is sequence conservation of functionally homologous genes. While such an assumption has been frequently found to be true for protein encoding sequences, there is no straightforward relationship of conservation between sequence and function for non-protein-coding regulatory sequences <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>.</p>
         <p>Compared to protein-coding regions, transcriptional promoter regions are subject to much less stringent selection and have higher nucleotide substitution rates, where short transcription factor binding sites can easily turn over and be replaced by new ones arising from random mutations <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>. In many cases, the function of a regulatory sequence may, however, remain well conserved despite substantial sequence changes. One of the best-studied examples is the <it>even-skipped </it>enhancer system <it>S2E </it>of <it>Drosophila </it>species, which is highly conserved at the functional level (for example, maintaining a high similarity of expression pattern) but substantially diverged at the sequence level. Such sequence divergence includes large insertions and deletions between different sites, substitutions within sites, and gains and losses of sites. Several experimental studies suggested that compensatory mutations in the <it>even-skipped </it>enhancer region are the key to maintain the functionality of the enhancer in evolution <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Estimates of transcription factor binding site (TFBS) turnover rates rank as high as 32-40% between human and rodent species <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, and can also happen at transcription start sites (TSSs) of orthologous genes <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, albeit at a lower frequency. The phenomenon of TFBS turnovers in regulatory regions suggest that any phylogenetic footprinting methods based on a simple trace of the evolution of nucleotides can be highly effective in some cases, but are unlikely to be able to identify all functionally important elements in regulatory genomic sequences, particularly in distantly related species. In this sense, a major improvement in TFBS identification will rely on a better understanding of evolutionary mechanisms regarding TFBS turnover events.</p>
         <p>While TFBS turnover has been known for a long time, it has not become a widely studied topic until recently, when the availability of related genome sequences made it amenable to systematic studies <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. With our currently limited knowledge about their structure and functional constraints, it is much more challenging to study the evolution of regulatory sequences than of protein-coding sequences. Most published experimental studies have been conducted on a gene-by-gene and element-by-element basis, and computational studies on real data are severely limited by the available functional site mapping data. In the absence of real biological data, computational simulation may provide the best way to study TFBS evolution and turnover in a systematic way. A pioneering simulation of TFBS evolution estimated the expected time for new binding sites to arise from point mutations in promoter regions, where binding sites were represented by simple consensus sequences, and promoters were evolved under a neutral evolution model <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. A recent study examined the expected time for a new site to evolve and become fixed in a population by positive selection, where the authors considered effective population size and used position weight matrices (PWMs) to model TFBSs <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. The study found that the existence and location of pre-sites of functional sites could be major factors determining the expected time and location of newly evolved sites, while the relative position of sites had little impact on the final location of new functional sites.</p>
         <p>The above simulation studies explicitly assume that the functions encoded in regulatory regions evolve and change with the change in sequences. There are, however, many cases like the evolution of the <it>even-skipped </it>enhancer mentioned above, in which the regulatory sequence changes but functions (that is, the resulting expression patterns) appear unchanged. Frequently, such genes are involved in crucial developmental processes and, therefore, subject to stringent functional constraints <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. Our study thus investigates how a promoter evolves under the neutral scenario of functional maintenance in 'status quo', that is, with little or no change in the presence and strength of functional elements. Specifically, we address the expected replacement turnover rate (RTR) of TFBSs in promoter sequences in relation to evolutionary distance, insertion/deletion (InDel) rate, and restricted translocation distance of TFBSs. In accordance with previous work, our study suggests that replacement turnover of TFBSs can happen quickly in evolution and varies significantly among different TFBSs, but can be predicted using simple mathematical models.</p>
         <p>TFBS turnover phenomena in promoter sequences raise the important question about the ability of current multiple sequence alignment (MSA) tools to identify TFBSs in comparative genomics studies. Comparative evaluations of alignment tools have been conducted previously, but usually in conjunction with a newly developed tool <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp> and with only few attempts at a comprehensive or systematic evaluation of different tools <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp>. However, little has been done regarding a performance evaluation of MSA tools for the task of aligning non-coding genomic sequences, largely due to lack of good benchmark datasets of real sequences. As a result, tool performance assessment on genomic sequences was often based on indirect measures, such as an alignment of putative conserved non-coding regions, functional sites <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, or exon regions <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>.</p>
         <p>Simulation provides an effective way to circumvent the problem of lack of data. Simulation data generated <it>in silico </it>make it possible to evaluate tool performance on direct measures of alignment accuracy. For example, a careful work on tool benchmarking was based on simulated <it>Drosophila </it>non-coding sequences, in which the authors compared the accuracy, sensitivity and specificity of several tools for pair-wise alignment <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. A recent simulation study by the same group examined the limitations of several MSA tools for TFBS identification and divergence distance estimation in aligning non-coding sequences, where TFBSs may be gained or lost in neutral evolution <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. However, these evaluation studies implicitly assumed a strong correlation between conservation at the functional and sequence level, and assessed tools on their ability to align homologous base pairs, that is, the alignment accuracy of bases evolved from the same site in the common ancestral sequences. Different from protein coding sequences, however, many recent studies of non-coding sequence evolution suggest that frequently there is only a weak correlation between conservation at the functional level and sequence level among non-coding orthologous sequences <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B3">3</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B10">10</abbr></abbrgrp> (see Figure <figr fid="F1">1</figr> for an example of homology at the functional level and sequence level).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Illustration of the difference between a sequence homology map and a functional homology map</p>
            </caption>
            <text>
               <p>Illustration of the difference between a sequence homology map and a functional homology map. <b>(a) </b>An ancestral promoter sequence with five functional sites. <b>(b) </b>Three unaligned descendent sequences derived from the ancestral promoter sequence. In the first descendent sequence, the old site <it>a </it>was functionally replaced by the new site <it>a' </it>because of evolutionary sequence changes. Similar replacement turnovers occurred at site <it>b </it>in the second and site <it>c </it>in the third descendent sequence, respectively. The three TFBS pairs <it>a</it>-<it>a'</it>, <it>b</it>-<it>b'</it>, and <it>c</it>-<it>c' </it>are homologous at the functional level but not at the sequence level. <b>(c) </b>Alignment of the three descendent sequences based on sequence base-pair homology. <b>(d) </b>Alignment of the three descendent sequences based on their homology at the functional level. The figure illustrates cases in which it is easier to identify functional elements <it>a</it>(<it>a'</it>), <it>b</it>(<it>b'</it>), and <it>c</it>(<it>c'</it>) and to predict gene functions from the homology map at the functional level rather than at the sequence level.</p>
            </text>
            <graphic file="gb-2007-8-10-r225-1"/>
         </fig>
         <p>Uncovering TFBSs in promoter sequences by cross-species comparison has so far been successful in some cases, but most approaches rely on alignments that are pre-computed on the whole genome. It is an open issue how appropriate these strategies are for non-coding alignments. Taking advantage of our Phylogenetic Simulation of Promoter Evolution (PSPE) simulation tool, we assess the performance of commonly used MSA algorithms for aligning TFBS in orthologous promoter sequences, where the function of a promoter (that is, an ensemble of binding sites under constraints) is maintained, but TFBS replacement turnovers are allowed to occur. Different from previous studies that assessed tool performance with respect to their ability to align homologous bases, we thus focus on assessing tool performance by their ability to align functional sites that are homologous at the functional level but may not be homologous at the sequence level. To our knowledge, no such assessment of MSA tool performance from the viewpoint of functional homology, that is, alignment of functional elements in the presence of re-arrangements and turnovers, has been carried out. Our findings can thus serve as useful references for alignment tool selection in comparative genomics and provide insights for the improvement of non-coding multiple sequence alignment.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Simulation system</p>
            </st>
            <p>We designed a new computational system, PSPE, specifically to perform simulations of regulatory sequence evolution, such as promoter sequences. Different from other programs for sequence evolution simulation, which frequently use different evolutionary models for functional and non-functional sites, PSPE imposes a variety of functional constraints and validates at discrete intervals that these constraints are maintained. Such functional constraints include GC content, presence and strength of functional sites, location and copy number restrictions on functional sites, and space constraints between different functional sites. Depending on the specification of these constraints, turnover events are thus possible, as functional sites are not generally tied to a specific location in the sequence.</p>
            <p>PSPE reads a set of simulation parameters from a single configuration file (Figure <figr fid="F2">2</figr>). The root sequence for simulation can be provided by the user or generated by PSPE, according to user-specified length, a background Markov model, and functional constraints. PSPE can generate different random evolutionary trees by simulating evolution distances (branch length) with an exponential model, and the number of descendent sequences (number of branches from a parent node) by a Poisson process. While binary trees are commonly used in phylogenetic studies, PSPE can generate different tree structures with either a fixed or a random number of branches from the root or internal node. Given a phylogenetic tree and a sequence at its root, PSPE can use one of many commonly used DNA substitution models as well as different InDel models to simulate sequence evolution, subject to defined functional constraints, such as GC content, functional site locations and interactions of functional sites. By default, PSPE reports the alignment of the simulated sequences, as well as the sequences themselves and the locations of functional sites in each sequence. PSPE also has the capability to simulate replicates from the same tree and same root sequence, which is essential for quantitative evolution simulations.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>An example of a PSPE configuration file</p>
               </caption>
               <text>
                  <p>An example of a PSPE configuration file. In the configuration file, parameter names and their corresponding values are always separated by '='. The comment lines start with '#'.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-2"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>TFBS replacement turnover rate estimation</p>
            </st>
            <p>In this study, a functional TFBS in a descendent sequence corresponds to the original TFBS if its sequence can be traced back to the TFBS sequence in the ancestor; otherwise, the TFBS is regarded as a new one. A TFBS replacement event is therefore defined as an event in which an original TFBS is replaced by a new TFBS of the same type through any two or more events (destruction of the old site and creation of the new one), including point mutations, insertions and deletions. The RTR is defined as the probability of a functional TFBS in an ancestral sequence to be replaced by a newly evolved one in the descendent sequence. We estimate TFBS RTR as the proportion of descendent sequences in which the TFBS is replaced at least once in the evolution process from an ancestral sequence. For example, assuming that we simulate <it>M </it>different descendent sequences from the same ancestral sequence, and we observe replacement turnover of the TFBS in <it>m </it>descendent sequences, then the estimate of RTR is <it>m</it>/<it>M</it>. In the following, we report the mean RTR averaged over different ancestor sequences, that is:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-10-r225-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>R</m:mi>
                           <m:mover accent="true">
                              <m:mi>T</m:mi>
                              <m:mo>^</m:mo>
                           </m:mover>
                           <m:mi>R</m:mi>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mn>1</m:mn>
                              <m:mi>K</m:mi>
                           </m:mfrac>
                           <m:mstyle displaystyle="true">
                              <m:munderover>
                                 <m:mo>&#8721;</m:mo>
                                 <m:mrow>
                                    <m:mi>i</m:mi>
                                    <m:mo>=</m:mo>
                                    <m:mn>1</m:mn>
                                 </m:mrow>
                                 <m:mi>K</m:mi>
                              </m:munderover>
                              <m:mrow>
                                 <m:mfrac>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>m</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                    <m:mrow>
                                       <m:msub>
                                          <m:mi>M</m:mi>
                                          <m:mi>i</m:mi>
                                       </m:msub>
                                    </m:mrow>
                                 </m:mfrac>
                              </m:mrow>
                           </m:mstyle>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacaWGsbGabmivayaajaGaamOuaiabg2da9maalaaabaGaaGymaaqaaiaadUeaaaWaaabCaeaajuaGdaWcaaqaaiaad2gadaWgaaqaaiaadMgaaeqaaaqaaiaad2eadaWgaaqaaiaadMgaaeqaaaaaaSqaaiaadMgacqGH9aqpcaaIXaaabaGaam4saaqdcqGHris5aaaa@3FBC@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where K is the number of different ancestral sequences, <it>M</it><sub><it>i </it></sub>is the number of all descendent sequences of the <it>i</it><sup>th </sup>ancestral sequence, and <it>m</it><sub><it>i </it></sub>is the number of descendent sequences in which the TFBSs of interest have been subjected to replacement turnover. We also report the median values, as the distributions of RTRs are not necessarily approximate to the normal distribution.</p>
            <p>Using PSPE for sequence evolution simulation, we are able to study the replacement turnover rate of functional conserved TFBSs in the evolution process of promoter sequences. In a complicated evolution process, many different events can occur at a TFBS, including point mutation, deletion, insertion, translocation, duplication and replacement. Our study here focuses only on TFBS replacement turnover in a simple 'status quo' scenario, assuming that all TFBSs in the sequences are essential to maintain proper gene expression levels and are thus functionally conserved in all descendent sequences. All functionally conserved TFBS are, however, allowed to be translocated to neighboring regions or replaced by newly evolved sites within a given restricted space. As ancestral sequences, we use either real or simulated human promoter sequences.</p>
            <p>As the main transcription factor for this study, we used the well-known cell-cycle regulator E2F, and investigated two additional factors, Myc and NF&#954;B, to validate our model for estimating TFBS replacement rates. Both E2F and Myc are important transcription regulators of cell cycle progression, DNA replication, and apoptosis <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>. In some cases, E2F and Myc form a complex to regulate gene expression in a combinatorial fashion <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>. NF&#954;B is a family of ubiquitously expressed transcription factors involved in both the onset and the resolution of inflammation. NF&#954;B is also widely believed to govern the expression of many genes for stress response, intercellular communications, cellular proliferation and apoptosis <abbrgrp><abbr bid="B36">36</abbr><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr></abbrgrp>. To simulate ancestral sequences containing binding sites of these transcription factors, we used their positional weight matrix models in the JASPAR database <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Binding sites in real human promoters known to be regulated by E2F were based on computational prediction (see Materials and methods). The simulated background promoter sequences were generated from a third order Markov model trained on 25,088 annotated human promoter sequences. We used the HKY85 model <abbrgrp><abbr bid="B40">40</abbr></abbrgrp> to simulate nucleotide substitution, a geometric distribution for the size of sequence InDel events, and a gamma distribution and invariant rate (&#915;+I) for modeling heterogeneity of substitution rates. The HKY85 model does not assume equal base frequencies and can account for the difference between transitions and transversions with one parameter. Sequence evolution was then additionally subject to diverse functional constraints related to the specific characteristics of transcriptional regulatory regions (Table <tblr tid="T1">1</tblr>). While many different factors may have significant impact on the RTR of a TFBS, we mainly focused on three important and interesting factors: evolution divergence distance, InDel rate, and restricted translocation distance.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>PSPE parameters for simulating sequence evolution</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Original ancestral sequences</p>
                     </c>
                     <c ca="left">
                        <p>Human non-coding region</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Sequence length</p>
                     </c>
                     <c ca="left">
                        <p>500 bp</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Base frequencies</p>
                     </c>
                     <c ca="left">
                        <p>A = 0.215, C = 0.287, G = 0.285, T = 0.214</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Substitution model</p>
                     </c>
                     <c ca="left">
                        <p>HKY85</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Transition:transversion ratio</p>
                     </c>
                     <c ca="left">
                        <p>20:1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Point substitution:InDel ratio</p>
                     </c>
                     <c ca="left">
                        <p>10:1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>InDel model</p>
                     </c>
                     <c ca="left">
                        <p>Geometric distribution (<it>p </it>= 0.5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Heterogeneity of substitution rate</p>
                     </c>
                     <c ca="left">
                        <p>Gamma (1.0) + Iota (0.1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Range of GC content</p>
                     </c>
                     <c ca="left">
                        <p>(45%, 70%)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Evolution distance per step</p>
                     </c>
                     <c ca="left">
                        <p>0.05 substitution per site</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Evolution of individual binding sites</p>
            </st>
            <p>We first studied the effect of divergence distance on the RTR of E2F sites (Figure <figr fid="F3">3</figr>). With increasing evolutionary divergence, we expect the RTR of a TFBS to increase, so the question is how fast and in what pattern the RTR increases along with the divergence distance. To answer this question, we estimated the RTR of a TFBS within a new descendent sequence, evolved from an ancestral sequence at 15 different divergent distances from 0.01 to 5.0, measured by the number of substitutions per site (see Materials and methods). At each of the different distances, we simulated 1,000 ancestor sequences and 1,000 descendent sequences from each ancestral sequence. In the simulation, E2F binding sites in ancestral and descendent sequences were subject to the same functional constraints (Figure <figr fid="F3">3</figr>), such that each simulated sequence had one and only one functional E2F site. As a consequence, E2F replacement could occur only at the time when the loss of the original functional site was accompanied by the creation of a new functional site. This requirement is likely to lead to conservative estimates of turnover rates.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>TFBSs used in the evolution simulation</p>
               </caption>
               <text>
                  <p>TFBSs used in the evolution simulation. PWMs of these TFBSs are taken from JASPAR [39], and their accession numbers are listed in the second column. The height of an individual letter in the motif logo represents the information content of each position in a motif. The motif logo plots were created by WebLogo [82]. The functional constraints on individual TFBSs used in the simulation are given.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-3"/>
            </fig>
            <p>Initial results showed that the RTR of E2F significantly increased as the divergence distance increased (Figure <figr fid="F4">4a</figr>). The change of RTR was faster at short divergence distances (number of substitutions per site &lt;1) than at large divergence distances (number of substitutions per site >3). Based on the assumption that the number of E2F replacement events during any evolution time interval follows a Poisson distribution, we further analyzed the relationship between RTR and sequence divergence distance. Assuming that replacement turnover events occur at a Poisson rate <it>&#955;</it>, the probability of no replacement in a time interval <it>t </it>measured by number of substitutions per site is:</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Exponential relationship between E2F replacement turnover rate and sequence divergence distance</p>
               </caption>
               <text>
                  <p>Exponential relationship between E2F replacement turnover rate and sequence divergence distance. The x-axis is the evolution divergence measured by the number of substitutions per site, and the y-axis is the RTR of an E2F site in a descendent sequence. The points are values observed from simulation, and lines are values predicted by the exponential model given in equation 2. <b>(a) </b>E2F replacement turnover rates observed in an evolution simulation starting from simulated ancestral promoter sequences, where &#955; is 0.0832 and 0.0724 for fitting the mean and median, respectively. <b>(b) </b>E2F replacement turnover rates observed in an evolution simulation starting from real human promoter sequences, where &#955; is 0.0833 and 0.0755 for fitting the mean and median, respectively.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-4"/>
            </fig>
            <p>
               <display-formula id="M1">
                  <m:math name="gb-2007-8-10-r225-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>Pr</m:mi>
                           <m:mo>&#8289;</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>N</m:mi>
                           <m:mo>=</m:mo>
                           <m:mn>0</m:mn>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>=</m:mo>
                           <m:mfrac>
                              <m:mrow>
                                 <m:msup>
                                    <m:mi>e</m:mi>
                                    <m:mrow>
                                       <m:mo>&#8722;</m:mo>
                                       <m:mi>&#955;</m:mi>
                                       <m:mi>t</m:mi>
                                    </m:mrow>
                                 </m:msup>
                                 <m:msup>
                                    <m:mrow>
                                       <m:mo stretchy="false">(</m:mo>
                                       <m:mi>&#955;</m:mi>
                                       <m:mi>t</m:mi>
                                       <m:mo stretchy="false">)</m:mo>
                                    </m:mrow>
                                    <m:mn>0</m:mn>
                                 </m:msup>
                              </m:mrow>
                              <m:mrow>
                                 <m:mn>0</m:mn>
                                 <m:mo>!</m:mo>
                              </m:mrow>
                           </m:mfrac>
                           <m:mo>=</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:mi>&#955;</m:mi>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaaciGGqbGaaiOCaiaacIcacaWGobGaeyypa0JaaGimaiaacMcacqGH9aqpjuaGdaWcaaqaaiaadwgadaahaaqabeaacqGHsisliiGacqWF7oaBcaWG0baaaiaacIcacqWF7oaBcaWG0bGaaiykamaaCaaabeqaaiaaicdaaaaabaGaaGimaiaacgcaaaGaeyypa0JaamyzamaaCaaabeqaaiabgkHiTiab=T7aSjaadshaaaaaaa@482F@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>Therefore, the probability of at least one replacement turnover, or expected RTR, of a TFBS in a time interval <it>t </it>is:</p>
            <p>
               <display-formula id="M2"><it>RTR </it>= Pr(<it>N </it>&#8805; 1) = 1 - Pr(<it>N </it>= 0) = 1 - <it>e</it><sup>-&#955;<it>t</it></sup></display-formula>
            </p>
            <p>which corresponds to the cumulative density function of an exponential distribution with mean 1/<it>&#955;</it>.</p>
            <p>We fitted the observed E2F RTR data with this exponential model and estimated the model parameter <it>&#955;</it>. This simple exponential model fitted well with the RTR of E2F observed in our simulation (Figure <figr fid="F4">4a</figr>), where the model parameter &#955; was 0.0832 and 0.0724 for fitting the mean and median of the observed RTR, respectively. In other words, the average probability for a replacement turnover event of an E2F binding site was 8.3% at a divergence distance of one substitution per site, suggesting the potential of substantial E2F turnover.</p>
            <p>To verify the RTR of E2F estimated on simulated promoter sequences, we repeated the experiment using real promoter sequences of human genes as ancestral sequences, known to be under E2F regulation from wet-lab experiments <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp>. Among 127 E2F regulated genes confirmed by ChIP-chip experiments <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, we were able to select 11 genes, each having one and only one E2F binding site in the upstream region of 500 base pairs (bp) from its transcription start site (see Materials and methods; see Additional data file 1 for details of the 11 genes). Most of the 11 genes are well known to be under regulation of E2F, especially <it>CDC6</it>, for which the location of the E2F binding site and functional activity of E2F have been characterized <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr><abbr bid="B45">45</abbr></abbrgrp>. Real promoter sequences would presumably give us a more realistic estimate of RTR of E2F sites than starting from simulated background sequences. One such potential difference is that real promoter sequences may contain remnants or 'ghosts' of previously functional binding sites accumulated during evolution, which could become functional again by a small number of sequence changes, which would thus result in higher turnover rates.</p>
            <p>Starting with the real promoter sequences, we ran essentially the same simulation as the simulated promoter sequences above (Table <tblr tid="T1">1</tblr>), with the minor difference of using a different restricted location of E2F sites for each promoter, as the actual E2F locations were different. We kept, however, the same restricted distance for translocation of E2F sites as those in simulated promoter sequence (50 bp centered on the ancestral site). Since we had a limited number of real promoters, we simulated 10,000 descendent sequences from each ancestral promoter instead of 1,000 descendents as above. The RTRs of E2F sites estimated in this way were highly consistent with those using simulated ancestral sequences across different divergence distances. As a result, the exponential model given in equation 2 fitted well with the observed RTRs (Figure <figr fid="F4">4b</figr>), where the model parameter &#955; was 0.0833 and 0.0755 for fitting mean and median values, respectively. Both &#955; values were indeed slightly higher than the corresponding ones starting from simulated ancestral sequences (Table <tblr tid="T2">2</tblr>), but such small differences may easily be caused by other factors (for example, different locations of E2F sites).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Estimated exponential rates associated with replacement turnovers of different TFBSs</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>TFBS</p>
                     </c>
                     <c ca="left">
                        <p>Promoter</p>
                     </c>
                     <c ca="left">
                        <p>&#955;<sub>mean</sub></p>
                     </c>
                     <c ca="left">
                        <p>&#955;<sub>median</sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>E2F</p>
                     </c>
                     <c ca="left">
                        <p>Simulated</p>
                     </c>
                     <c ca="left">
                        <p>0.0832</p>
                     </c>
                     <c ca="left">
                        <p>0.0724</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>E2F</p>
                     </c>
                     <c ca="left">
                        <p>Real</p>
                     </c>
                     <c ca="left">
                        <p>0.0833</p>
                     </c>
                     <c ca="left">
                        <p>0.0756</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MycMax</p>
                     </c>
                     <c ca="left">
                        <p>Simulated</p>
                     </c>
                     <c ca="left">
                        <p>0.2200</p>
                     </c>
                     <c ca="left">
                        <p>0.2293</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NF&#954;B</p>
                     </c>
                     <c ca="left">
                        <p>Simulated</p>
                     </c>
                     <c ca="left">
                        <p>0.1032</p>
                     </c>
                     <c ca="left">
                        <p>0.0918</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The probability of replacement turnover in evolution can be predicted by an exponential cumulative distribution function of divergence distance: <it>RTR </it>= 1 - <it>Exp </it>(-<it>&#955; </it>&#215; <it>d</it>). <it>&#955;</it><sub><it>mean </it></sub>and <it>&#955;</it><sub><it>median </it></sub>are estimated rates for mean and median values, respectively.</p>
               </tblfn>
            </tbl>
            <p>To validate the good fit of estimated turnover rates with a simple exponential model, we performed similar independent simulation studies for the additional TFBSs of Myc and NF&#954;B. Both Myc and NF&#954;B have palindromic binding sites with a length of 11 and 10 bases, respectively. Myc sites have more conserved positions in the center region, consisting of mixed A/T and G/C nucleotides, whereas NF&#954;B has highly conserved positions at the two sides, consisting of mostly G/C nucleotides (Figure <figr fid="F3">3</figr>). Overall, Myc sites are the most degenerate among the three TFBSs. These differences in information content and sequence composition may lead to different RTRs. It was instructive to see how these factors affected the RTR, and whether the exponential model provided as good a fit for these other TFBS as well. For each TFBS, we again simulated 1,000 ancestral promoter sequences, and for each ancestral promoter sequence, we simulated 1,000 descendent sequences at each of 15 divergence distances as above. We also used the same substitution and InDel models for the sequence evolution (Table <tblr tid="T1">1</tblr>). For the purpose of comparison, we imposed the same location and copy number constraints on both TFBSs as specified in Figure <figr fid="F3">3</figr>.</p>
            <p>Our results indicated that the RTR of Myc was consistently more than two times higher than that of NF&#954;B across all divergence distances (Figure <figr fid="F5">5</figr> and Table <tblr tid="T2">2</tblr>) For example, the observed RTRs for Myc and NF&#954;B were 0.219 and 0.083 at a divergence distance of 1.0, and 0.373 and 0.167 at a divergence distance of 2.0. These results suggested that differences in sequence composition had a significant impact on the RTRs of a TFBS. In this case, the sequence composition of the NF&#954;B site, which is G/C rich at the two sides and A/T rich in the center, is more different from the background than that of Myc, for which A/T and G/C positions are almost uniformly distributed. Fitting the RTR data with our exponential model, we observed again a good fit for both TFBSs (see Table <tblr tid="T2">2</tblr> for the estimated model parameters <it>&#955;</it>).</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>RTRs of Myc and NF&#922;B in simulated promoter sequences</p>
               </caption>
               <text>
                  <p>RTRs of Myc and NF&#922;B in simulated promoter sequences. The x-axis denotes evolutionary divergence measured by the number of substitutions per site, and the y-axis denotes the RTR of a TFBS in a descendent sequence. The figure shows that the predicted values (lines) from the exponential model given in equation 2 fit well with observed RTR values (points) from an evolution simulation of <b>(a) </b>Myc and <b>(b) </b>NF&#922;B.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Turnover rates of regulatory modules: the Myc-E2F pair</p>
            </st>
            <p>Both Myc and E2F are important transcription factors in coordinating cell-cycle regulation, and partner together to regulate some common target genes <abbrgrp><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr></abbrgrp>. As a restricted space between two TFBSs, that is, to enable an effective interaction, can limit the replacement turnover of each individual TFBS, we were interested in assessing how two sites can evolve together as a regulatory module. We studied the RTR of the Myc-E2F pair in a simple scenario in which there was one and only one pair of Myc-E2F in a promoter sequence. For both E2F and Myc, we kept the location restriction relative to the TSS identical to the above studies on single sites, and studied their RTRs by simulations with and without a constraint of restricted space between them (Table <tblr tid="T3">3</tblr>). We performed simulations at different divergence distances as for individual sites above.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Functional constraints placed on a Myc-E2F pair in promoter sequences</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>E2F location relative to TSS</p>
                     </c>
                     <c ca="center">
                        <p>[-50, -100]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Myc location relative to TSS</p>
                     </c>
                     <c ca="center">
                        <p>[-100, -150]</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Copy number of E2F</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Copy number of Myc</p>
                     </c>
                     <c ca="center">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>DNA strand of E2F site</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>DNA strand of Myc site</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Additional space constraint between Myc and E2F sites</p>
                     </c>
                     <c ca="center">
                        <p>[50, 60]</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>We calculated the observed RTRs of the Myc-E2F pair from the simulated sequences, and compared them to the expected ones assuming independent evolution of both sites. The expected RTR of both sites, defined as the probability of observing simultaneous replacement turnovers of both Myc and E2F, was estimated as the product of the individual RTRs from the simulation of single sites. The expected RTR of a single site, defined as the probability of observing a replacement turnover in only one site of the pair, was estimated from the above simulation of individual sites. Results showed that the expected RTRs were close to the observed ones in simulations without an additional space constraint between two TFBSs (Figure <figr fid="F6">6a,b</figr>), validating the independent evolution of both sites. For the simulation with additional space constraints between the pair, the observed RTRs of both sites showed significant deviation from the predicted ones assuming independent evolution, although the expected and observed RTRs of single sites were still close (Figure <figr fid="F6">6d</figr>). The significantly lower RTRs of both sites indicate that the space constraint between two sites made it less likely for them to turn over simultaneously (Figure <figr fid="F6">6c</figr>).</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>RTR of a Myc-E2F pair</p>
               </caption>
               <text>
                  <p>RTR of a Myc-E2F pair. We calculated the observed RTRs of Myc-E2F from simulations with and without an additional space constraint between two TFBSs, and compared the observed and expected RTRs assuming independence. The fit-1 lines are expected values based on the mean turnover rate of individual TFBSs, and the fit-2 lines are expected values based on median turnover rate of individual TFBSs. Under simulation without space constraints between the sites, the expected RTRs are close to the observed ones in both cases: <b>(a) </b>replacement turnover occurred at both Myc and E2F sites; <b>(b) </b>replacement turnover occurred at only one of two sites. Under simulation with space constraint, the expected RTRs are higher than the observed ones when <b>(c) </b>replacement turnover occurred at both Myc and E2F sites, but are close to observed ones when <b>(d) </b>replacement turnover occurred at only one of the two sites. The models based on estimates of turnover for individual sites given in equations 3 and 4 fit the observed RTR data well in those cases where no dependency between sites exists.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-6"/>
            </fig>
            <p>The small difference between the observed RTRs of the Myc-E2F pair and the expected ones assuming independence of individual TFBSs suggested that it was reasonable to describe the independent evolution of two sites within a simple predictive model. Based on this assumption, we thus described the RTR of a given TFBS pair by:</p>
            <p>
               <display-formula id="M3">
                  <m:math name="gb-2007-8-10-r225-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>R</m:mi>
                           <m:mi>T</m:mi>
                           <m:msub>
                              <m:mi>R</m:mi>
                              <m:mrow>
                                 <m:mi>p</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:mi>i</m:mi>
                                 <m:mi>r</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mn>1</m:mn>
                                 </m:msub>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#215;</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mn>2</m:mn>
                                 </m:msub>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacaWGsbGaamivaiaadkfadaWgaaWcbaGaamiCaiaadggacaWGPbGaamOCaaqabaGccqGH9aqpcaGGOaGaaGymaiabgkHiTiaadwgadaahaaWcbeqaaiabgkHiTGGaciab=T7aSnaaBaaameaacaaIXaaabeaaliaadshaaaGccaGGPaGaey41aqRaaiikaiaaigdacqGHsislcaWGLbWaaWbaaSqabeaacqGHsislcqWF7oaBdaWgaaadbaGaaGOmaaqabaWccaWG0baaaOGaaiykaaaa@4B3E@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>where <it>&#955;</it><sub>1 </sub>and <it>&#955;</it><sub>2 </sub>are the expected Poisson rates of replacement turnover events for TFBS 1 (E2F) and TFBS 2 (Myc). Similarly, the probability of a replacement turnover of one and only one of two TFBSs can be modeled by:</p>
            <p>
               <display-formula id="M4">
                  <m:math name="gb-2007-8-10-r225-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mi>R</m:mi>
                           <m:mi>T</m:mi>
                           <m:msub>
                              <m:mi>R</m:mi>
                              <m:mrow>
                                 <m:mi>o</m:mi>
                                 <m:mi>n</m:mi>
                                 <m:mi>e</m:mi>
                                 <m:mo>_</m:mo>
                                 <m:mi>i</m:mi>
                                 <m:mi>n</m:mi>
                                 <m:mo>_</m:mo>
                                 <m:mi>p</m:mi>
                                 <m:mi>a</m:mi>
                                 <m:mi>i</m:mi>
                                 <m:mi>r</m:mi>
                              </m:mrow>
                           </m:msub>
                           <m:mo>=</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mn>1</m:mn>
                                 </m:msub>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#215;</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mn>2</m:mn>
                                 </m:msub>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                           <m:mo>+</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mn>1</m:mn>
                                 </m:msub>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                           <m:mo>&#215;</m:mo>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mn>1</m:mn>
                           <m:mo>&#8722;</m:mo>
                           <m:msup>
                              <m:mi>e</m:mi>
                              <m:mrow>
                                 <m:mo>&#8722;</m:mo>
                                 <m:msub>
                                    <m:mi>&#955;</m:mi>
                                    <m:mn>2</m:mn>
                                 </m:msub>
                                 <m:mi>t</m:mi>
                              </m:mrow>
                           </m:msup>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                        <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacaWGsbGaamivaiaadkfadaWgaaWcbaGaam4Baiaad6gacaWGLbGaai4xaiaadMgacaWGUbGaai4xaiaadchacaWGHbGaamyAaiaadkhaaeqaaOGaeyypa0JaaiikaiaaigdacqGHsislcaWGLbWaaWbaaSqabeaacqGHsisliiGacqWF7oaBdaWgaaadbaGaaGymaaqabaWccaWG0baaaOGaaiykaiabgEna0kaadwgadaahaaWcbeqaaiabgkHiTiab=T7aSnaaBaaameaacaaIYaaabeaaliaadshaaaGccqGHRaWkcaWGLbWaaWbaaSqabeaacqGHsislcqWF7oaBdaWgaaadbaGaaGymaaqabaWccaWG0baaaOGaey41aqRaaiikaiaaigdacqGHsislcaWGLbWaaWbaaSqabeaacqGHsislcqWF7oaBdaWgaaadbaGaaGOmaaqabaWccaWG0baaaOGaaiykaaaa@6002@</m:annotation>
                     </m:semantics>
                  </m:math>
               </display-formula>
            </p>
            <p>We fitted the observed RTR data with both models 3 and 4. Both models fitted well with data as shown in Figure <figr fid="F6">6a,b,d</figr>, validating our assumption for the independent evolution of TFBSs. However, as the RTRs for the Myc-E2F pair in Figure <figr fid="F6">6c</figr> show, the simple models began to deviate from the simulations in more complex scenarios including dependencies between sites.</p>
         </sec>
         <sec>
            <st>
               <p>TFBS conservation between human and mouse</p>
            </st>
            <p>Because of the moderate divergence distance between mammalian genomes, such as those of human and mouse, there is a strong interest in comparative studies of their genomes as an important way to infer gene function and gene regulation as well as their evolutionary mechanisms. While it is relatively easy to compare the coding sequences of human and mouse orthologous genes, it remains a difficult task to compare their promoter sequences, largely because they are more divergent than coding sequences. One pioneering comparative genomics study estimated that a fraction as high as 32-40% of the human functional TFBSs may not be functional in rodents, suggesting a high turnover rate of TFBSs <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. A recent study estimated that the divergence distances of human and mouse from the last common ancestor are 0.1187 and 0.3987 substitutions per site, respectively <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. Another study estimated the total divergence distance of human and mouse at about 0.8 substitutions per site <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>. Based on these two estimates, we here set the divergence distances of human and mouse from their last common ancestor to be 0.2 and 0.6, respectively, in terms of the number of substitutions per site in neutrally evolving regions. In this study, we simulated TFBS evolution of human and mouse from their last common ancestral species in the hope of shedding some light on the evolution of their TFBSs. Using the same three TFBSs as above, we estimated RTRs of individual TFBSs in human and mouse orthologous sequences at different InDel rates as well as at different restricted translocation distances.</p>
            <sec>
               <st>
                  <p>Effect of InDel rate variation</p>
               </st>
               <p>We again simulated 1,000 ancestral promoter sequences and evolved 1,000 pairs of human and mouse descendent sequences from each ancestral sequence, but this time varying the ratio of InDel to substitution rate from 0 (that is, no InDels at all) to 0.2 (one InDel per five substitution events) at ten different steps. Except for the InDel rate, we used the same models and parameters as given in Table <tblr tid="T1">1</tblr>. We performed three independent simulations for the TFBSs of E2F, Myc and NF&#954;B. The evolution of individual TFBSs was under the same functional constraints as above (Figure <figr fid="F3">3</figr>).</p>
               <p>Instead of calculating the TFBS RTRs from their common ancestral sequences, we estimated the probability of observing replacement turnovers of individual TFBSs in at least one species, which we defined as the RTR between human and mouse. We found that at zero or very low InDel rates, the RTRs of Myc and NF&#954;B between human and mouse were almost zero, whereas E2F had a low RTR (Figure <figr fid="F7">7</figr>). As expected, RTRs of all TFBSs increased as the InDel rate increased. The RTR of NF&#954;B, however, was almost one magnitude smaller than that of either E2F or Myc, indicating a significant effect of the nucleotide composition of different TFBSs. Our analysis suggested that the TFBS RTR between human and mouse could be approximated by an exponential function of the InDel rate given by:</p>
               <fig id="F7">
                  <title>
                     <p>Figure 7</p>
                  </title>
                  <caption>
                     <p>Effect of different InDel rates on TFBS RTR</p>
                  </caption>
                  <text>
                     <p>Effect of different InDel rates on TFBS RTR. The x-axis denotes the InDel rate measured by the number of InDel events per substitution events, and the y-axis shows the RTR of a TFBS in a descendent sequence. The figure shows that the exponential model given in equation 5 fits well with the observed RTR values from simulation for all three TFBSs: <b>(a) </b>E2F, <b>(b) </b>Myc, and <b>(c) </b>NF&#954;B.</p>
                  </text>
                  <graphic file="gb-2007-8-10-r225-7"/>
               </fig>
               <p>
                  <display-formula id="M5"><it>Rate </it>= -<it>a </it>+ <it>b </it>&#215; <it>e</it><sup>1.5&#947;</sup></display-formula>
               </p>
               <p>where <it>a </it>and <it>b </it>are parameters, and &#947; is the InDel rate. Therefore, at a zero InDel rate (&#947; = 0), the base RTR is (<it>b </it>- <it>a</it>), which cannot be less than the zero, implying that <it>b </it>must be larger or equal to <it>a</it>. We found that this model fitted well with the RTR data of all three TFBSs regardless of using the mean or median value of the RTR (Figure <figr fid="F7">7</figr>). Estimates of model parameters for the individual TFBSs are given in Table <tblr tid="T4">4</tblr>.</p>
               <tbl id="T4">
                  <title>
                     <p>Table 4</p>
                  </title>
                  <caption>
                     <p>Estimated parameter values for the exponential model of RTR and InDel rate</p>
                  </caption>
                  <tblbdy cols="5">
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c cspan="2" ca="center">
                           <p>Model for mean</p>
                        </c>
                        <c cspan="2" ca="center">
                           <p>Model for median</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c cspan="2">
                           <hr/>
                        </c>
                        <c cspan="2">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>TFBS name</p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>a</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>b</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>a</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>b</it>
                           </p>
                        </c>
                     </r>
                     <r>
                        <c cspan="5">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>E2F</p>
                        </c>
                        <c ca="center">
                           <p>-0.216</p>
                        </c>
                        <c ca="center">
                           <p>0.226</p>
                        </c>
                        <c ca="center">
                           <p>-0.181</p>
                        </c>
                        <c ca="center">
                           <p>0.184</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Myc</p>
                        </c>
                        <c ca="center">
                           <p>-0.252</p>
                        </c>
                        <c ca="center">
                           <p>0.252</p>
                        </c>
                        <c ca="center">
                           <p>-0.265</p>
                        </c>
                        <c ca="center">
                           <p>0.265</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>NF&#954;B</p>
                        </c>
                        <c ca="center">
                           <p>-0.072</p>
                        </c>
                        <c ca="center">
                           <p>0.072</p>
                        </c>
                        <c ca="center">
                           <p>-0.035</p>
                        </c>
                        <c ca="center">
                           <p>0.035</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Simulation results suggested that the TFBS RTR can be modeled by an exponential function of InDel rates given in equation 5. The values for parameters <it>a </it>and <it>b </it>were estimated from observed mean and median values of RTRs at different InDel rates.</p>
                  </tblfn>
               </tbl>
            </sec>
            <sec>
               <st>
                  <p>Influence of restricted translocation distance</p>
               </st>
               <p>TFBS often have a preferred location relative to the TSS, but many TFBSs can move within a limited distance while maintaining their regulatory function. Such a restricted translocation distance relative to the TSS may have an important impact on TFBS evolution. In a final simulation, we studied how the RTR of a TFBS between human and mouse was affected by its restricted translocation distance.</p>
               <p>We simulated TFBS evolution under 10 different restricted distances of translocation ranging from 0 to 300 bp from the original location of a TFBS in ancestral sequences, where we set 20 bp as the minimum distance of a TFBS to TSS. For each maximal translocation distance, we simulated 1,000 ancestral promoter sequences and 1,000 pairs of descendent human and mouse sequences from each ancestral sequence using the models given in Table <tblr tid="T1">1</tblr>. We performed a separate simulation for the same three TFBSs, and estimated the RTR between human and mouse as defined above. The RTR between human and mouse increased approximately linearly with the size of the restricted translocation range (Figure <figr fid="F8">8</figr>). The means of the RTR could therefore be fitted well with a linear model given by:</p>
               <fig id="F8">
                  <title>
                     <p>Figure 8</p>
                  </title>
                  <caption>
                     <p>Effect of restricted translocation distance on TFBS RTR</p>
                  </caption>
                  <text>
                     <p>Effect of restricted translocation distance on TFBS RTR. The x-axis is the restricted translocation distance relative to the original binding site in the ancestral sequence, and the y-axis is the RTR of TFBSs. The points are the RTR observed in simulations, and lines are values predicted by the model given in equation 6: <b>(a) </b>E2F, <b>(b) </b>Myc, and <b>(c) </b>NF&#954;B.</p>
                  </text>
                  <graphic file="gb-2007-8-10-r225-8"/>
               </fig>
               <p>
                  <display-formula id="M6">
                     <m:math name="gb-2007-8-10-r225-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
                        <m:semantics>
                           <m:mrow>
                              <m:mi>R</m:mi>
                              <m:mi>T</m:mi>
                              <m:mi>R</m:mi>
                              <m:mo>=</m:mo>
                              <m:mi>a</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>c</m:mi>
                              <m:mn>1</m:mn>
                              <m:msqrt>
                                 <m:mi>&#952;</m:mi>
                              </m:msqrt>
                              <m:mo>&#215;</m:mo>
                              <m:mi>c</m:mi>
                              <m:mn>2</m:mn>
                              <m:msqrt>
                                 <m:mi>&#952;</m:mi>
                              </m:msqrt>
                              <m:mo>=</m:mo>
                              <m:mi>a</m:mi>
                              <m:mo>+</m:mo>
                              <m:mi>c</m:mi>
                              <m:mi>&#952;</m:mi>
                           </m:mrow>
                           <m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8GiVeY=Pipec8Eeeu0xXdbba9frFj0xb9Lqpepeea0xd9q8qiYRWxGi6xij=hbbc9s8aq0=yqpe0xbbG8A8frFve9Fve9Fj0dmeaabaqaciaacaGaaeqabaqabeGadaaakeaacaWGsbGaamivaiaadkfacqGH9aqpcaWGHbGaey4kaSIaam4yaiaaigdadaGcaaqaaGGaciab=H7aXbWcbeaakiabgEna0kaadogacaaIYaWaaOaaaeaacqWF4oqCaSqabaGccqGH9aqpcaWGHbGaey4kaSIaam4yaiab=H7aXbaa@4415@</m:annotation>
                        </m:semantics>
                     </m:math>
                  </display-formula>
               </p>
               <p>where <it>a</it>, <it>c1</it>, <it>c2 </it>and <it>c </it>are model parameters, <it>c </it>is the product of <it>c1 </it>and <it>c2</it>, and &#952; is the restriction translocation distance of a TFBS. In this model, <it>c1 </it>and <it>c2 </it>are associated with the evolutionary distances of species one and two from their last common ancestral species. Therefore, the TFBS RTR in a single species is a linear function of the square root of its restricted translocation distance. Interestingly, while the median RTRs for E2F could also be fitted quite well with this model (Figure <figr fid="F6">6a</figr>), the fit for Myc and NF&#954;B was less good, hinting at the strong effects that different motifs can have on some of the promoter features studied here.</p>
            </sec>
            <sec>
               <st>
                  <p>Impact of transition/transversion ratio</p>
               </st>
               <p>To better simulate sequences of closely related species, which generally have a higher ratio between transition and transversion substitution rates than distantly related species, we used a relatively large ratio of transition to transversion (20:1) in all the above simulations. This large ratio made sense in our case, as we simulated sequence evolution in a stepwise fashion with a small divergence distance (0.05 substitutions per site) at each step. To check whether a large change in transition to transversion ratio would have significant impact on RTRs, we also ran all the above simulations at a much smaller ratio of 4:1. We used the Wilcoxon rank sum test to check whether the difference between the means of the resulting RTRs was significantly different from zero (data not shown). We found no statistically significant differences in our results (Bonferroni-corrected significance level of <it>P </it>&#8804; 0.05). The results suggested that our observed replacement turnovers were slow processes relative to nucleotide substitutions.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Evaluation of alignment tools</p>
            </st>
            <p>In addition to the theoretical studies regarding turnover rates, the PSPE simulator can be used to assess the impact of the turnover phenomenon on practical applications in comparative genomics. In the following, we looked specifically at the problem of identifying functional binding sites in multiple sequence alignments. Most current alignment tools are based on the assumption that the functional sites in orthologous sequences are homologous in sequence space, that is, that they can be traced back to the same position in the ancestral genome. Replacement turnover events of functional sites in promoter sequences, however, make this assumption somewhat unrealistic, which could consequently limit the performance of a tool for aligning non-coding sequences. Our evaluation aimed to: compare different multiple sequence alignment tools for their robustness to violation of this assumption; and investigate the impact of increasing the number of species on tool performance.</p>
            <p>We evaluated a set of representative MSA tools for their performance in detecting TFBSs in several sets of orthologous sequences, generated from an underlying phylogenetic tree of five mammalian genomes (Figure <figr fid="F9">9</figr>). The rationale for using the mammalian tree topology was to achieve a realistic assessment of TFBS detection accuracy and to allow for a fair comparison between different tools. First, in most comparative genomics studies, species in comparison often have different divergence distances from their last common ancestor. Second, it is also frequently assumed that an MSA tool should work better when aligning more closely related species at the beginning stage and adding more distantly related species in later stages, especially for those based on a progressive approach. We used evolutionary distances that were recently inferred from coding regions <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>, but evaluated the tree at different scale factors as it is not generally known how well these distances reflect the actual substitution rates in non-coding regions. We extended the simulation to large divergence distances to test the notion that conserved sites should be readily picked up when the surrounding sequence has sufficiently diverged. To assess the validity of our observations, we consistently evaluated tool performance with additional benchmark datasets, generated from a phylogenetic tree with a star topology in which all descendent sequences had the same evolutionary distance from their last common ancestral sequence. The evaluation results are consistent with those reported below (see Additional data file 2 for details).</p>
            <fig id="F9">
               <title>
                  <p>Figure 9</p>
               </title>
               <caption>
                  <p>Phylogenetic tree of five mammalian genomes</p>
               </caption>
               <text>
                  <p>Phylogenetic tree of five mammalian genomes. The evolutionary distances shown in the tree were recently inferred from the coding region of orthologous genes [46]. In our simulation, we used the tree scaled at eight different levels relative to the evolutionary distances shown.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-9"/>
            </fig>
            <p>We scaled the mammalian phylogenetic tree at eight different levels from 0.25 to 5, relative to the actual distances, and generated a benchmark promoter dataset at each scale level (defined as divergence scale coefficient), where each dataset contained 1,000 replicates of orthologous promoter sequences of the five species. Sequences were simulated under the HKY85 nucleotide substitution model with gamma and invariant rate (&#915;+I) for modeling substitution rate heterogeneity (Table <tblr tid="T5">5</tblr>). In the dataset, each sequence contained exactly one functional binding site for each of the six transcription factors: Pax6, TP53, IRF2, PPARG, ROAZ, and YY1E2F. YY1E2F is a composite TFBS consisting of YY1 and E2F binding sites that reportedly interact with each other in cell cycle gene regulation <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>. Binding sites were subject to a set of functional constraints (Table <tblr tid="T6">6</tblr>) that were set to allow for turnover within a restricted distance, but keeping the overall order of the binding sites unchanged. Simulation allowed us to quantify the amount of turnover: how many non-aligned functional sites were due to turnover compared to 'simple' misalignments, and whether some tools would in fact be able to align functional sites despite turnover.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Simulation parameters used by PSPE for generating benchmark promoter sequences</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Evolution distance per step</p>
                     </c>
                     <c ca="left">
                        <p>0.05 substitution per site</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Length of root sequences</p>
                     </c>
                     <c ca="left">
                        <p>3,000 bp</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Background sequence model</p>
                     </c>
                     <c ca="left">
                        <p>Markov order of third</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Base frequencies</p>
                     </c>
                     <c ca="left">
                        <p>A = 0.258, C = 0.242, G = 0.242, T = 0.258</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Substitution model</p>
                     </c>
                     <c ca="left">
                        <p>HKY85</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Transition:transversion ratio</p>
                     </c>
                     <c ca="left">
                        <p>20:1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Rate heterogeneity</p>
                     </c>
                     <c ca="left">
                        <p>Gamma (1.0) + Iota (0.1)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Range of GC content</p>
                     </c>
                     <c ca="left">
                        <p>(0.45, 0.55)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Gap model</p>
                     </c>
                     <c ca="left">
                        <p>Negative binomial distribution (1, 0.5)</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>Functional TFBS constraints used in the promoter simulation</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p>Name</p>
                     </c>
                     <c ca="left">
                        <p>Accession no.</p>
                     </c>
                     <c ca="center">
                        <p>Length (bp)</p>
                     </c>
                     <c ca="center">
                        <p>Strand</p>
                     </c>
                     <c ca="center">
                        <p>Location (min, max)</p>
                     </c>
                     <c ca="center">
                        <p>Copy no. (min, max)</p>
                     </c>
                     <c ca="center">
                        <p>Cutoff</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>YY1E2F</p>
                     </c>
                     <c ca="left">
                        <p>MA0095 (YY1)</p>
                        <p>MA0024 (E2F)</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                     <c ca="center">
                        <p>(20, 30)</p>
                     </c>
                     <c ca="center">
                        <p>(1, 1)</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Pax6</p>
                     </c>
                     <c ca="left">
                        <p>MA0069</p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                     <c ca="center">
                        <p>(50, 70)</p>
                     </c>
                     <c ca="center">
                        <p>(1, 1)</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TP53</p>
                     </c>
                     <c ca="left">
                        <p>MA0106</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                     <c ca="center">
                        <p>(360, 400)</p>
                     </c>
                     <c ca="center">
                        <p>(1, 1)</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>IRF2</p>
                     </c>
                     <c ca="left">
                        <p>MA0051</p>
                     </c>
                     <c ca="center">
                        <p>18</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                     <c ca="center">
                        <p>(420, 480)</p>
                     </c>
                     <c ca="center">
                        <p>(1, 1)</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PPARG</p>
                     </c>
                     <c ca="left">
                        <p>MA0066</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                     <c ca="center">
                        <p>(2000, 2080)</p>
                     </c>
                     <c ca="center">
                        <p>(1, 1)</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>ROAZ</p>
                     </c>
                     <c ca="left">
                        <p>MA0116</p>
                     </c>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>+</p>
                     </c>
                     <c ca="center">
                        <p>(2100, 2200)</p>
                     </c>
                     <c ca="center">
                        <p>(1, 1)</p>
                     </c>
                     <c ca="center">
                        <p>0.90</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The accession numbers in the second column are from the JASPAR database [39]. 'Location' refers to the restriction on the upstream minimum and maximum distances to transcription start site. YY1E2F is a composite TFBS created by joining the YY1 and E2F sites.</p>
               </tblfn>
            </tbl>
            <p>We used this dataset to assess the performance of five widely used MSA tools: CLUSTALW <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>, DIALIGN <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>, AVID/MAVID <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B51">51</abbr></abbrgrp>, LAGAN/MLAGAN <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>, and MUSCLE <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Among the five tools, AVID/MAVID is the fastest alignment tool and uses exactly matching words as alignment seeds to speed up the alignment process, albeit at the expense of lower alignment accuracy. As an improvement, both DIALIGN and LAGAN/MLAGAN adopt non-exact word matching for finding alignment seeds, which can improve their ability to detect degenerate functional sites. DIALIGN identifies alignment seeds by finding consistent sequence segments of a fixed length between sequences, while LAGAN/MLAGAN locates alignment seeds by chaining together neighboring similar words. Both CLUSTALW and MUSCLE are primarily based on the dynamic programming algorithm. MUSCLE, however, has made significant improvements over CLUSTALW by employing anchoring techniques and a progressive refinement approach. The performance was measured as TFBS detection accuracy, defined as the proportion of nucleotides in functionally homologous TFBSs that were correctly aligned. The detection accuracy reported here is the average value over 1,000 replicates at each divergence scale level.</p>
            <p>For the two species (human and baboon) alignment, all five tools showed high detection accuracies of TFBS with no significant difference between each other (Figure <figr fid="F10">10a(1)</figr>). When adding more distant species, such as mouse, to the alignment, we found that TFBS detection accuracies of all tools were dramatically decreased, especially those of MAVID and CLUSTALW (Figure <figr fid="F10">10b(1),c(1),d(1)</figr>). Again, we observed marked differences in performance between different tools for three or more species alignments. Overall, MUSCLE had the highest detection accuracy among all tools across all divergence scale coefficients; MAVID had a slightly worse performance than all other tools; and CLUSTALW, DIALIGN and MLAGAN showed similar performance, although their relative order in performance varied with the number of species or a change of the divergence scale coefficient. As expected, the TFBS detection accuracy decreased for all tools as the divergence scale coefficient increased. PSPE also allowed us to consider only the set of sites that had not turned over, and the relative performance of tools was unchanged (Figure <figr fid="F10">10a(2),b(2),c(2),d(2)</figr>). With increasing distance, a large fraction of sites has turned over, but many of those trace back to the same ancestral nucleotides in several descendants, due to turnover before a branch in the tree or convergent evolution. These sites should thus be aligned and are counted positive in at least some of the pairwise comparisons that our metric is based on, even if they are not in the location of the original TFBS (see Additional data file 2 for more evaluations on turnover sites).</p>
            <fig id="F10">
               <title>
                  <p>Figure 10</p>
               </title>
               <caption>
                  <p>Average TFBS detection accuracy of five alignment tools</p>
               </caption>
               <text>
                  <p>Average TFBS detection accuracy of five alignment tools. The y-axis shows the TFBS detection accuracy average of six TFBSs, and the x-axis is the divergence scale coefficient of the mammalian phylogenetic tree (Figure 9). SimuALN stands for the simulated alignment and its measure indicates the proportion of TFBS nucleotides not subject to replacement turnover in descendent sequences, and thus aligned in simulated alignments. Plots in the left panel show the overall detection accuracy of all functional TFBSs, while those in the right panel show the detection accuracy on the subset of TFBSs that had not turned over. Note that insertion and deletion events may affect parts of a binding site (these are still included in the evaluation), and that SimuALN consequently does not reach a level of one in the right panels. <b>(a) </b>Two species alignments of human and baboon. <b>(b) </b>Three species alignments of human, baboon and mouse. <b>(c) </b>Four species alignments of human, baboon, mouse, and dog. <b>(d) </b>Five species alignment.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-10"/>
            </fig>
            <p>The ability of a tool to detect the presence of a common TFBS varied among different TFBSs, depending on TFBS base composition, length, and restricted translocation distance, as well as the divergence scale coefficient of the phylogenetic tree. For example, Figure <figr fid="F11">11</figr> shows that detection accuracies differed significantly among TFBSs in the alignments of the five species. In addition, the same figure shows that all tools had higher detection accuracies for TFBSs with low RTRs, such as YY1E2F and Pax6, than those with high RTRs, such as IRF2 and ROAZ. While MUSCLE showed a better performance than all other tools, CLUSTALW as the oldest tool performed slightly better than DIALIGN, MAVID, and MLAGAN in at least some cases (YY1E2F and ROAZ). Additionally, for YY1E2F, Pax6 and TP53, MUSCLE showed higher TFBS detection accuracies than the baseline of SimuALN, suggesting its capability of correctly aligning at least some TFBSs subject to turnover, that is, homologous only at the functional level. At large divergence scale coefficients, however, no tool seemed to perform well in detecting ROAZ.</p>
            <fig id="F11">
               <title>
                  <p>Figure 11</p>
               </title>
               <caption>
                  <p>Detection accuracy of individual TFBSs on five-way mammalian alignments</p>
               </caption>
               <text>
                  <p>Detection accuracy of individual TFBSs on five-way mammalian alignments. All five tools perform better at detecting YY1E2F and Pax6, which have low RTRs and short restricted distance for translocation, than IRF2 and ROAZ, which have high RTR and long restricted distance for translocation. MUSCLE shows an overall better performance than the other four tools. MLAGAN performs better than DIALIGN on YY1E2F, PAX6, PPARG and ROZA, while DIALIGN shows a better performance than MLAGAN on TP53 and PPARG, which have a long restricted distance for translocation but a relatively low RTR.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-11"/>
            </fig>
            <p>When looking at the performance of each tool individually (Figure <figr fid="F12">12</figr>), we found that the TFBS detection accuracies of all tools decreased when adding one or more distant species to the human/baboon alignment. For alignments from three to five species, the TFBS detection accuracies of DIALIGN and MUSCLE showed little change, those of CLUSTALW and MLAGAN had a noticeable change and that of MAVID markedly decreased, especially at large divergence scale coefficients. We also compared tool performance again with respect to overall alignment sensitivity and TFBS sensitivity. We found that in terms of alignment sensitivity, MUSCLE and CLUSTALW had slightly better overall performance than the other three (data not shown). The ranks according to TFBS sensitivity were also in the same order as those according to detection accuracies, and this was also true if we considered non-turnover sites only (Figure <figr fid="F13">13</figr>).</p>
            <fig id="F12">
               <title>
                  <p>Figure 12</p>
               </title>
               <caption>
                  <p>Effects of the number of aligned mammalian species on the TFBS detection accuracy</p>
               </caption>
               <text>
                  <p>Effects of the number of aligned mammalian species on the TFBS detection accuracy. Each panel shows the performance of a tool in aligning a different number of species. Human and baboon were used for the two species alignment, mouse was added for the three species alignment, and all five species but cow were used for four species alignment. While all tools have almost the same performance for aligning the two closely related species human and baboon, MUSCLE and DIALIGN performed better than other tools in maintaining or improving performance when adding more species to the alignment.</p>
               </text>
               <graphic file="gb-2007-8-10-r225-12"/>
            </fig>
            <fig id="F13">
               <title>
                  <p>Figure 13</p>
               </title>
               <caption>
                  <p>The average TFBS sensitivity of five tools in aligning TFBS in five mammalian species</p>
               </caption>
               <text>
                  <p>The average TFBS sensitivity of five tools in aligning TFBS in five mammalian species. <b>(a) </b>The average TFBS sensitivity of all functional TFBSs. <b>(b) </b>The average TFBS sensitivity with the subset of non-turnover sites among all TFBSs. The relative order of TFBS sensitivity for the five tools is almost the same as the order of their TFBS detection accuracy (Figure 10d).</p>
               </text>
               <graphic file="gb-2007-8-10-r225-13"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>In the process of evolution, selection may act directly on regulatory functions but only indirectly on gene sequences, which is supported by the experimental observations that some orthologous genes with highly conserved expression patterns have substantial divergence in their promoter sequence <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. That means that functional conservation does not necessitate conservation on the sequence level. Neutral sites in promoter sequences may be free to change, and newly evolved functional sites can readily replace old ones. It is important, therefore, to understand the evolutionary mechanisms of regulatory regions in order to improve computational methods that are developed to analyze them. However, it is difficult to investigate systematically non-protein-coding evolution on real sequence data because the history of evolutionary events shaping them is largely unknown, and the map of functional sites in regulatory sequences is often incomplete and inaccurate. In many cases, there is no simple way to distinguish a site newly evolved in a replacement turnover event from one created by simple translocation of an old site. Computational simulation seems to be an effective alternative to study TFBS evolution in this case. Simulators allow us to investigate evolutionary events such as replacement turnovers of TFBS, which may significantly limit the effectiveness of phylogenetic footprinting for regulatory region identification, in an explicit way. Here, we describe a new sequence simulator to investigate the effect of different functional constraints on turnover rates, and to create a framework to evaluate multiple sequence alignment algorithms regarding their ability to detect functional elements in the presence of turnover events.</p>
         <sec>
            <st>
               <p>Simulation of TFBS turnover</p>
            </st>
            <p>Our simulator PSPE is designed specifically for studying the evolution of functional sites in regulatory sequences. PSPE is not only able to use one of many common models of nucleotide substitution, but it can also apply different InDel models important for regulatory sequence simulation. In contrast to other simulators, PSPE imposes a variety of functional constraints instead of sequence constraints. Such functional constraints include GC content, presence of functional sites, strength of the binding sites, location and copy number restrictions on functional sites, and space constraints between different functional sites. All these features enable PSPE to simulate evolution of promoter sequences more realistically than other simulation programs.</p>
            <p>Consistent with previous simulation studies <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B14">14</abbr></abbrgrp>, our results show that TFBS turnover can occur rapidly in promoter evolution. For example, replacement turnover events can occur at a Poisson rate as high as 0.083 for the highly constrained E2F sites even if we only allow for a small translocation distance of 50 nucleotides, and is even higher for the less constrained sites of Myc (0.22) and NF&#954;B (0.103). Furthermore, these parameters may be relatively conservative considering that we used stringent matrix score cutoffs to avoid false hits, highly restricted locations for functional sites, a relatively low rate for transversions, and the requirement of the presence of exactly one functional site throughout. However, a high turnover rate of TFBSs can frequently be detrimental to an organism, and highly increased turnover rates may not be observed in practice, even for degenerate sites. This is supported by an additional simulation study we carried out using a lower cutoff threshold of 0.85 for functional sites, in which promoters with Myc sites had a lower RTR despite the higher chance of creating a new site at the lower cutoff. This was mainly due to our restriction of allowing only one site to be present in the promoters (see Additional data file 1 for details). Therefore, TFBS replacement turnovers in real sequences may happen more frequently than we estimated, but there is an upper limit of turnover rate for each individual TFBS imposed by the resulting changes in fitness.</p>
            <p>Altogether, our study suggests that the TFBS RTR of a functional site between different species does not depend only on the base composition of the site and the divergence distances between species, but also on location constraints, neighboring functional sites, the InDel rate, and the GC content. While not discussed in detail, a simulation using lower GC contents showed a consistently higher or lower RTR depending on the TFBS, suggesting that the high GC content in promoter regions near the TSS is affecting the turnover rates of important functional sites (Additional data file 1). Consequently, the RTR varies not only among different functional sites and different species, but also among different instances of the same functional site upstream of different genes.</p>
            <p>While we attempted to choose realistic model parameters and biologically meaningful functional constraints in our simulations, our estimates are certainly biased by the assumptions behind the chosen constraints, and may be substantially different from the real ones. Furthermore, the TFBS and evolution models themselves represent simplified versions of the underlying biological processes, and other factors, such as the number of replicates used in the simulation, can add some additional variation as well. We realize that the weight matrices used here as models of functional sites may not be as adequate for modeling positional dependencies as other more advanced motif models <abbrgrp><abbr bid="B52">52</abbr><abbr bid="B53">53</abbr></abbrgrp>; however, PWMs are a valid model for many biological motifs, are available in open-access databases, and are computationally more efficient than other advanced models. Computational efficiency is an important factor in simulation studies that are as large as this one.</p>
            <p>Simple evolutionary changes within regulatory regions, such as turnover events affecting individual sites only, can be modeled effectively by Poisson events. We could show good agreement of this for a variety of binding sites and conditions, such as different translocation distances. In theory, one could derive closed-form solutions for the probability of these events, based on the sequence composition of the region and the composition and degeneracy of a binding site. However, with an increasing number of restrictions and dependencies of sites in complex regulatory modules, this becomes increasingly cumbersome and not straightforward. Figure <figr fid="F6">6c</figr> showed that these simple models begin to deviate as soon as we address the conservation on the module level instead of individual sites only.</p>
            <p>One can easily think of a large number of additional parameters and configurations of functional sites that we did not explore. A tool such as PSPE will allow researchers to explore empirically a wider range of restrictions and complex configurations of regulatory regions in an efficient manner. Enhancers come in many different flavors, from highly restricted 'enhanceosomes' corresponding to ultra-conserved elements, to highly flexible 'billboard' enhancers allowing for many drastic sequence changes without apparent functional consequence <abbrgrp><abbr bid="B54">54</abbr></abbrgrp>. PSPE is available to the public and we anticipate that it will be a beneficial tool for evolutionary biologists to explore the specific characteristics and evolutionary space of particular regulatory systems. Future extensions may include an adaptation for RNA regulatory regions, including specific modeling of compensatory mutations in RNA secondary structure, incorporating transposable elements, and neighbor-dependent substitution models.</p>
         </sec>
         <sec>
            <st>
               <p>Assessment of MSA tools</p>
            </st>
            <p>During evolution, natural selection forces impose different functional constraints on protein coding and regulatory regions. The phenomenon of frequent TFBS turnovers in regulatory regions may partially explain why comparative genomics analysis, the most powerful approach so far, has met with only limited success in identifying functional sites despite the increasing availability of whole genome sequences. TFBS turnovers may also be responsible for the weak relationship between sequence conservation and functional conservation in promoter sequences, which makes the straightforward tracing of nucleotide evolution between divergent orthologous sequences meaningless with respect to their function. Our strategy of defining conservation on the level of functional constraints such as matrix score cutoffs is similar to a recent model, which defines conservation on the level of conserved binding energy <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. In this sense, functional homology maps of regulatory regions, where mapped elements correspond to functionally equivalent sites, can be more important than strict sequence homology.</p>
            <p>While many alignment tools have been developed so far, it is difficult to systematically evaluate and compare these tools, especially regarding their performance in aligning non-coding sequences, for which we have a limited understanding of evolutionary constraints. Studies that rigorously assess alignment tools (for example, <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>) can serve as useful reference for making more informed decisions about which tool to use for which task, and can also provide important insights or suggestions for improvement of existing algorithms. Most published evaluations of alignment algorithms were based on alignment sensitivity, specificity, and accuracy, and did not address replacement turnover of functional sites in evolution. The evaluation reported here is different: instead of trying to systematically assess all different performance aspects, we focus on one particular scenario, the capability of accurately aligning conserved TFBS in promoter sequences. Specifically, our evaluation was based on two aspects: the capability of aligning functionally homologous TFBS in promoter sequences in which TFBS replacement turnovers are allowed to occur; and the capability of increasing TFBS detection power with an increase in the number of homologous aligned species.</p>
            <p>The five tools selected for our evaluation are representatives of many existing tools of different underlying algorithms. These differences were clearly reflected in the success of aligning TFBSs, which ranked MUSCLE at the top, AVID/MAVID at the bottom, and others in between. We purposefully chose transcription factors with long binding sites, and required strong conservation of orthologous sites (that is, a high matrix score threshold for each site). Furthermore, while our choice of constraints allowed for turnover events, it did not allow for a shuffling of sites, which none of the programs can take into account. Yet, our results suggest that the ability of existing tools to detect functionally homologous elements decreases with increasing replacement turnover rates of functional sites or, related, the sequence divergence distance. An increased divergence of the non-functional parts of the sequence does thus not necessarily help to locate individual functional binding sites, even if the sites are highly conserved and 15-20 bp long.</p>
            <p>It is often reported that an increase in the number of species may significantly increase the power for functional site identification in comparative genomics analyses <abbrgrp><abbr bid="B56">56</abbr></abbrgrp>. On the contrary, our evaluation results show that we should be extremely cautious at this point to assume that this is a general property of many functional DNA regions and/or tools to analyze them. With the exception of DIALIGN, the TFBS detection accuracy of all tools was either decreased or relatively unchanged in most cases. This is in fact not surprising when ones take a closer look at the approaches used for multiple sequence alignment. CLUSTALW, MAVID, and MLAGAN all use the same progressive approach for aligning multiple sequences, in which intermediate alignments from the early stages are not allowed to change in later stages. That means that the mistakes that happen in an early stage of alignment will be propagated and cannot be corrected at a later stage. Since a tool based on the progressive approach can only accumulate more mistakes when aligning more sequences, it is conceivable that its performance decreases as the number of species increases. MUSCLE employs an improved progressive approach that allows changes in the alignment of sub-groups in a recursive refinement process, which explains why MUSCLE did not show a significant decrease in performance as the number of species increased. It is conceivable, however, that the particular choice of species, and the order in which they are presented to a phylogenetic aligner, may significantly change the accuracy of these approaches. DIALIGN is the only tool surveyed here that does not use the progressive approach. Instead, it assembles the whole alignment by greedily finding all consistent segments of significant similarity from all sequences <abbrgrp><abbr bid="B57">57</abbr></abbrgrp>, which allows DIALIGN to be able to take advantage of the information from additional species. While these features of DIALIGN are interesting, there is still much room for improvement as its overall performance is no better than MUSCLE.</p>
            <p>We want to stress that the tools in this study were not specifically developed for the alignment of non-coding regions. In fact, some design principles may be counterproductive for this task: whole genome alignments are built to provide fast comparative maps and are certainly able to detect coding conservation. The progressive aligners in our evaluation are meant to provide the phylogenetic history, that is, to compute an accurate alignment of bases that are derived from the same nucleotide in the ancestral genome. Yet, there is no doubt that many researchers currently use these tools in studies concerning gene regulatory sequences, and we hope that this evaluation provides clues about what to expect if they are used in this way. We aimed to include a representative subset of tools fast enough to perform extensive comparisons. We do not expect this to have introduced a systematic bias, but of course some recently developed aligners (for example, TBA <abbrgrp><abbr bid="B58">58</abbr></abbrgrp>, Prank <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>, or Probcons <abbrgrp><abbr bid="B60">60</abbr></abbrgrp>) may perform differently to our selected set.</p>
            <p>The objective and systematic evaluation of alignment tools is a challenging task, in particular for an assessment on non-coding sequences whose actual functional and evolutionary mechanisms remain largely unknown. Since we simulated data under a set of specific conditions that are unlikely to represent all actual scenarios, one should carefully interpret our comparison results. For example, our study did not consider the ability of a tool to deal with very large insertions and deletions because of few large insertions/deletions in our simulated data. Furthermore, we were very conservative in our constraints, and, for example, allowed for turnover, but not for a shuffling of sites. Our results are therefore a rather optimistic estimate, and performance on real promoters with shorter sites that do not preserve their order can be expected to be significantly worse. We are also aware that the criteria for tool performance can be different in a different study depending on its objectives. Therefore, our results may not be applicable for some studies, such as the estimation of divergence distances between species. For such cases, the recent evaluation by Pollard <it>et al</it>. <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> may be a better reference.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>TFBS replacement turnover is an important phenomenon in the process of promoter evolution, and providing a framework to address it systematically is critical for our understanding of the mechanisms driving promoter evolution. We introduced the new simulation system PSPE, designed specifically for regulatory sequences, and allowing for functional site turnover events. PSPE is freely available at the authors' websites <abbrgrp><abbr bid="B61">61</abbr><abbr bid="B62">62</abbr></abbrgrp>. Applying PSPE in a large-scale simulation, we found that replacement turnovers could happen rapidly in promoter evolution. We also investigated different factors besides the divergence distance that significantly affect turnover rates, and describe the relationships between the RTR and different factors in simple mathematical models. Our study adds to the increasing evidence that it is important and advantageous to trace homology on the functional rather than on the sequence base-pair level in cross-species comparisons of regulatory sequences.</p>
         <p>PSPE also provides a flexible system to generate appropriate standard test sets for alignment or motif finding algorithms, and we presented first results of this application. To our knowledge, our evaluation of MSA tools is the first one to assess their ability to detect TFBSs that are homologous on a functional level. Our evaluation of five widely used MSA tools suggests that the turnover of functional sites poses a challenge for alignment tools, even for the simplified case where the functional sites remain co-linear in orthologous sequences. While all MSA tools under consideration, especially MUSCLE, performed well in aligning functional sites at short or moderate divergence distances, they appeared to lack sufficient capability to align functional sites that have high RTRs in divergent sequences. In addition, our study suggests that the widely used progressive approach for MSA is counterproductive for the multiple alignments of homologous non-coding sequences, and that MUSCLE's improved progressive approach and DIALIGN's segment assembling approach are better suited for non-coding MSA. Some recent approaches are promising to successfully deal with the specific challenges of non-coding alignments, for example, by using available models of TFBS to 'anchor' alignments <abbrgrp><abbr bid="B63">63</abbr></abbrgrp>. However, this still leaves us with a number of open issues on the way towards computational tools that will help us to elucidate the structure and evolution of regulatory regions.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <sec>
            <st>
               <p>Background model of ancestral sequences</p>
            </st>
            <p>To generate biologically relevant ancestral sequences, we used a 3rd order Markov model to generate background sequences of ancestral promoters. We trained the background Markov model on a large real dataset of regulatory sequences extracted from the NCBI human RefSeq database (build 35). The dataset consists of 25,088 human promoter sequences each spanning a region of 500 bp immediately upstream of the transcription start sites. The base frequencies of four nucleotides were also estimated on this dataset.</p>
         </sec>
         <sec>
            <st>
               <p>Selection of E2F regulated ancestral genes</p>
            </st>
            <p>We obtained 127 experimentally confirmed E2F regulated genes from a previous publication <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. We removed the genes for which we were not able to extract their promoter sequences from NCBI, and extracted 500 bp long promoter sequences upstream of their annotated transcription start site. We then identified potential E2F sites in each promoter sequence using the PWM model. We removed those genes that had either zero or more than one E2F binding site based on the cutoff score of 0.92 given in Figure <figr fid="F3">3</figr>. The remaining 11 genes are given in Additional data file 1 and were used as ancestral sequences for our simulation study.</p>
         </sec>
         <sec>
            <st>
               <p>Motif model of TFBSs</p>
            </st>
            <p>We used the PWM, a generic and widely used model for DNA motifs, to represent functional TFBSs. The PWM is generally given by a matrix with frequencies (or weights) of the four nucleotides at each position. While there are several different methods to calculate a motif score, we used a scoring function similar to the one proposed by <abbrgrp><abbr bid="B64">64</abbr></abbrgrp> and defined by:</p>
            <p>
               <display-formula>
                  <m:math name="gb-2007-8-10-r225-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
                     <m:semantics>
                        <m:mrow>
                           <m:mtable columnalign="left">
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mi>S</m:mi>
                                       <m:mi>c</m:mi>
                                       <m:mi>o</m:mi>
                                       <m:mi>r</m:mi>
                                       <m:mi>e</m:mi>
                                       <m:mo>=</m:mo>
                                       <m:mfrac>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>i</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>w</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>&#952;</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>&#215;</m:mo>
                                                   <m:msubsup>
                                                      <m:mi>f</m:mi>
                                                      <m:mi>i</m:mi>
                                                      <m:mi>b</m:mi>
                                                   </m:msubsup>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                          <m:mrow>
                                             <m:mstyle displaystyle="true">
                                                <m:msubsup>
                                                   <m:mo>&#8721;</m:mo>
                                                   <m:mrow>
                                                      <m:mi>i</m:mi>
                                                      <m:mo>=</m:mo>
                                                      <m:mn>1</m:mn>
                                                   </m:mrow>
                                                   <m:mi>w</m:mi>
                                                </m:msubsup>
                                                <m:mrow>
                                                   <m:msub>
                                                      <m:mi>&#952;</m:mi>
                                                      <m:mi>i</m:mi>
                                                   </m:msub>
                                                   <m:mo>&#215;</m:mo>
                                                   <m:msub>
                                                      <m:mrow>
                                                         <m:mi>max</m:mi>
                                                         <m:mo>&#8289;</m:mo>
                                                      </m:mrow>
                                                      <m:mrow>
                                                         <m:mi>b</m:mi>
                                                         <m:mo>&#8712;</m:mo>
                                                         <m:mo stretchy="false">(</m:mo>
                                                         <m:mi>A</m:mi>
                                                         <m:mo>,</m:mo>
                                                         <m:mi>C</m:mi>
                                                         <m:mo>,</m:mo>
                                                         <m:mi>G</m:mi>
                                                         <m:mo>,</m:mo>
                                                         <m:mi>T</m:mi>
                                                         <m:mo stretchy="false">)</m:mo>
                                                      </m:mrow>
                                                   </m:msub>
                                                   <m:msubsup>
                                                      <m:mi>f</m:mi>
                                                      <m:mi>i</m:mi>
                                                      <m:mi>b</m:mi>
                                                   </m:msubsup>
                                                </m:mrow>
                                             </m:mstyle>
                                          </m:mrow>
                                       </m:mfrac>
                                    </m:mrow>
                                 </m:mtd>
                              </m:mtr>
                              <m:mtr columnalign="left">
                                 <m:mtd columnalign="left">
                                    <m:mrow>
                                       <m:mtext>where&#160;</m:mtext>
                                       <m:msub>
                                          <m:mi>&#952;</m:mi>
                                          <m:mtext>i</m:mtext>
                                       </m:msub>
                                       <m:mo>=</m:mo>
                                       <m:mn>1</m:mn>
                                       <m:mo>+</m:mo>
                                       <m:mi>ln</m:mi>
                                       <m:mo>&#8289;</m:mo>
                                       <m:mn>4</m:mn>
                                       <m:mo>&#215;</m:mo>
                                       <m:mstyle displaystyle="true">
                                          <m:munder>
                                             <m:mo>&#8721;</m:mo>
                                             <m:mi>b</m:mi>
                                          </m:munder>
                                          <m:mrow>
                                 