<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2002-3-12-research0086</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Assessing the impact of comparative genomic sequence data on the functional annotation of the <it>Drosophila </it>genome</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Bergman</snm>
               <mi>M</mi>
               <fnm>Casey</fnm>
               <insr iid="I1"/>
               <insr iid="I9"/>
            </au>
            <au id="A2">
               <snm>Pfeiffer</snm>
               <mi>D</mi>
               <fnm>Barret</fnm>
               <insr iid="I1"/>
               <insr iid="I9"/>
            </au>
            <au id="A3">
               <snm>Rinc&#243;n-Limas</snm>
               <mi>E</mi>
               <fnm>Diego</fnm>
               <insr iid="I2"/>
               <insr iid="I6"/>
            </au>
            <au id="A4">
               <snm>Hoskins</snm>
               <mi>A</mi>
               <fnm>Roger</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A5">
               <snm>Gnirke</snm>
               <fnm>Andreas</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A6">
               <snm>Mungall</snm>
               <mi>J</mi>
               <fnm>Chris</fnm>
               <insr iid="I4"/>
            </au>
            <au id="A7">
               <snm>Wang</snm>
               <mi>M</mi>
               <fnm>Adrienne</fnm>
               <insr iid="I1"/>
               <insr iid="I7"/>
            </au>
            <au id="A8">
               <snm>Kronmiller</snm>
               <fnm>Brent</fnm>
               <insr iid="I1"/>
               <insr iid="I8"/>
            </au>
            <au id="A9">
               <snm>Pacleb</snm>
               <fnm>Joanne</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A10">
               <snm>Park</snm>
               <fnm>Soo</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A11">
               <snm>Stapleton</snm>
               <fnm>Mark</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A12">
               <snm>Wan</snm>
               <fnm>Kenneth</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A13">
               <snm>George</snm>
               <mi>A</mi>
               <fnm>Reed</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A14">
               <snm>de Jong</snm>
               <mi>J</mi>
               <fnm>Pieter</fnm>
               <insr iid="I5"/>
            </au>
            <au id="A15">
               <snm>Botas</snm>
               <fnm>Juan</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A16">
               <snm>Rubin</snm>
               <mi>M</mi>
               <fnm>Gerald</fnm>
               <insr iid="I1"/>
               <insr iid="I4"/>
            </au>
            <au id="A17" ca="yes">
               <snm>Celniker</snm>
               <mi>E</mi>
               <fnm>Susan</fnm>
               <insr iid="I1"/>
               <email>celniker@bdgp.lbl.gov</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Berkeley <it>Drosophila </it>Genome Project, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA</p>
            </ins>
            <ins id="I3">
               <p>Exelixis Inc., South San Francisco, CA 94080, USA</p>
            </ins>
            <ins id="I4">
               <p>Howard Hughes Medical Institute, Department of Molecular and Cellular Biology, University of California, Berkeley, CA 94720, USA</p>
            </ins>
            <ins id="I5">
               <p>Children's Hospital and Research Center at Oakland, Oakland, CA 94609, USA</p>
            </ins>
            <ins id="I6">
               <p>Current address: Departamento de Biologia Molecular, Universidad Autonoma de Tamaulipas-UAMRA, Reynosa, CP 88740, Mexico</p>
            </ins>
            <ins id="I7">
               <p>Current address: Department of Physiology, University of California, San Francisco, CA 94143, USA</p>
            </ins>
            <ins id="I8">
               <p>Current address: Department of Bioinformatics and Computational Biology, Iowa State University, Ames, IA 50011, USA</p>
            </ins>
            <ins id="I9">
               <p>These authors contributed equally to this work</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2002</pubdate>
         <volume>3</volume>
         <issue>12</issue>
         <fpage>research0086.1</fpage>
         <lpage>0086.20</lpage>
         <url>http://genomebiology.com/2002/3/12/research/0086</url>
         <note>This article is part of a series of refereed research articles from Berkeley Drosophila Genome Project, FlyBase and colleagues, describing Release 3 of the <it>Drosophila</it> genome, which are freely available at <url>http://genomebiology.com/drosophila/</url>.</note>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2002-3-12-research0086</pubid>
               <pubid idtype="pmpid">12537575</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>8</day>
               <month>10</month>
               <year>2002</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>25</day>
               <month>11</month>
               <year>2002</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>5</day>
               <month>12</month>
               <year>2002</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>30</day>
               <month>12</month>
               <year>2002</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2002</year>
         <collab>Bergman et al., licensee BioMed Central Ltd</collab>
      </cpyrt>
      <shorttitle>
         <p>Assessing the impact of comparative genomic sequence data on the functional annotation of the <it>Drosophila </it>genome</p>
      </shorttitle>
      <shortabs>
         <p>Analysis of conservation in eight genomic regions (<it>apterous, even-skipped, fushi tarazu, twist</it>, and <it>Rhodopsins 1, 2, 3</it> and <it>4</it>) from four <it>Drosophila </it>species (<it>D. erecta</it>, <it>D. pseudoobscura, D. willistoni</it>, and <it>D. littoralis</it>) covering more than 500 kb of the <it>D. melanogaster </it>genome. All <it>D. melanogaster </it>genes (and 78-82% of coding exons) identified in divergent species such as <it>D. pseudoobscura </it>show evidence of functional constraint. Addition of a third species can reveal functional constraint in otherwise non-significant pairwise exon comparisons.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most informative species and features of genome evolution for comparison remain to be determined.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We analyzed conservation in eight genomic regions (<it>apterous, even-skipped, fushi tarazu, twist</it>, and <it>Rhodopsins 1, 2, 3</it> and <it>4</it>) from four <it>Drosophila </it>species (<it>D. erecta</it>, <it>D. pseudoobscura, D. willistoni</it>, and <it>D. littoralis</it>) covering more than 500 kb of the <it>D. melanogaster </it>genome. All <it>D. melanogaster </it>genes (and 78-82% of coding exons) identified in divergent species such as <it>D. pseudoobscura </it>show evidence of functional constraint. Addition of a third species can reveal functional constraint in otherwise non-significant pairwise exon comparisons. Microsynteny is largely conserved, with rearrangement breakpoints, novel transposable element insertions, and gene transpositions occurring in similar numbers. Rates of amino-acid substitution are higher in uncharacterized genes relative to genes that have previously been studied. Conserved non-coding sequences (CNCSs) tend to be spatially clustered with conserved spacing between CNCSs, and clusters of CNCSs can be used to predict enhancer sequences.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Our results provide the basis for choosing species whose genome sequences would be most useful in aiding the functional annotation of coding and <it>cis</it>-regulatory sequences in <it>Drosophila</it>. Furthermore, this work shows how decoding the spatial organization of conserved sequences, such as the clustering of CNCSs, can complement efforts to annotate eukaryotic genomes on the basis of sequence conservation alone.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010008">Evolution</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010015">Model organisms</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The functional annotation of metazoan genome sequences represents one of the greatest challenges in modern biological research. For example, even with structural constraints imposed by the genetic code to guide algorithm design, the identification of all protein-coding genes in a metazoan genome remains an unsolved computational problem. The identification of functional non-coding sequences, such as untranslated regions (UTRs), genes for non-protein-coding RNAs, and <it>cis</it>-regulatory elements, poses an even more difficult problem for comprehensive genome annotation, as the rules governing their structure and function remain more elusive. Despite these difficulties, it is increasingly clear that comparative genomic approaches will substantially aid efforts to annotate these and other important sequence features. With whole-genome sequence data quickly becoming available for several organisms, it is important to determine which species comparisons and features of genome evolution will be most useful for comparative genome annotation.</p>
         <p>The genus <it>Drosophila </it>offers a well-characterized evolutionary genetic system for developing and testing methods for comparative genome annotation. From the seminal population-genetic and phylogenetic studies of Dobzhansky and co-workers [<abbr bid="B1">1</abbr>], and the classification of taxonomic relationships in the genus by Patterson, Stone and others [<abbr bid="B2">2</abbr>], <it>Drosophila </it>has long served as a model system for developing and testing evolutionary principles at the morphological and cytological levels. The genus <it>Drosophila </it>has also served as a proving ground for developing and testing evolutionary principles at the protein [<abbr bid="B3">3</abbr>] and DNA sequence levels [<abbr bid="B4">4</abbr>]. In addition, for over a decade and a half, comparative sequence analysis has had an important role in the functional analysis of genes and <it>cis</it>-regulatory sequences in <it>Drosophila </it>(see, for example, [<abbr bid="B5">5</abbr>,<abbr bid="B6">6</abbr>]). This history of research has culminated in a rich understanding of the pattern and process of molecular evolution in the genus <it>Drosophila </it>[<abbr bid="B7">7</abbr>]. With the complete sequencing of the euchromatic portion of the <it>Drosophila melanogaster </it>genome [<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>], this prior knowledge can be applied to the task of comparative genome annotation.</p>
         <p>We have undertaken a pilot study to assess the contribution of large-scale comparative genomic sequence data on the functional annotation of the <it>Drosophila </it>genome. Our goals are to identify the species whose genome sequences would be most useful in annotating the <it>D. melanogaster </it>genome, and to identify features of genome evolution that can assist the annotation of protein-coding genes and the non-coding <it>cis</it>-regulatory sequences controlling their transcription. The lessons learned from this study have implications for efforts to annotate the entire <it>D. melanogaster </it>genome using comparative sequence data from the forthcoming <it>D. pseudoobscura </it>genome [<abbr bid="B10">10</abbr>] as well as the recently completed <it>Anopheles gambiae </it>genome [<abbr bid="B11">11</abbr>]. Beyond the initial analyses presented here, these data also serve as materials for the further study of molecular evolutionary processes in <it>Drosophila </it>and the calibration of comparative sequence analysis tools.</p>
         <p>Here, we report the isolation and analysis of genomic sequences from eight candidate regions representing both gene-rich and gene-poor regions of the <it>Drosophila </it>genome, totaling over 1.25 megabases (Mb) of DNA sequence. These regions were isolated from fosmid libraries of four divergent <it>Drosophila </it>species (<it>D. erecta, D. pseudoobscura, D. willistoni</it>, and <it>D. littoralis</it>) chosen to cover a range of divergence times (6-15, 46, 53 and 61-65 million years, respectively) from the reference species, <it>D. melanogaster </it>[<abbr bid="B7">7</abbr>]. Using the annotation pipeline and curation tools described in accompanying papers [<abbr bid="B12">12</abbr>,<abbr bid="B13">13</abbr>,<abbr bid="B14">14</abbr>], we predicted the coding sequence content of these sequences for subsequent comparative analyses. Our results indicate that the majority of coding sequences predicted in <it>D. melanogaster </it>can be identified in divergent <it>Drosophila </it>species and show evidence of functional constraint. Microsynteny is generally maintained at the scale of individual fosmid clones, and the few rearrangement breakpoints, transposable elements and gene transpositions can readily be identified. Analysis of coding sequence evolution suggests that uncharacterized genes, which we will refer to as 'predicted' genes, tend to have a higher rate of protein evolution than 'known' genes - those genes that have been selected for experimental study and thus are more likely to have easily discerned functions. Analysis of non-coding sequence evolution reveals that levels of conservation vary with divergence time, and that conserved non-coding sequences (CNCSs) exhibit a striking pattern of spatial clustering in <it>Drosophila</it>. Using transgenic reporter assays we show that CNCS clusters can be used to accurately predict a developmentally regulated enhancer in the <it>apterous </it>(<it>ap</it>) region. We discuss the implications of our results for comparative approaches to protein-coding and <it>cis</it>-regulatory sequence prediction in the genus <it>Drosophila</it>.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Isolation and sequencing of genomic regions from divergent <it>Drosophila</it> species</p>
            </st>
            <p>On the basis of genome size considerations and the desire to investigate a range of divergence times in the genus <it>Drosophila</it>, we constructed fosmid libraries (approximately 40-kb inserts) for <it>D. erecta, D. pseudoobscura, D. willistoni </it>and <it>D. littoralis </it>(Figure <figr fid="F1">1</figr>). <it>D. littoralis </it>is closely related to the well-studied species, <it>D. virilis</it>, but has been reported to have less dispersed repetitive DNA than <it>D. virilis </it>(Kevin White, personal communication). We designed degenerate PCR primers for a set of eight well-characterized genes (<it>apterous (ap), even-skipped (eve), fushi-tarazu (ftz), twist (twi)</it>, and <it>Rhodopsins 1, 2, 3</it> and <it>4</it> (<it>Rh1, Rh2, Rh3 </it>and <it>Rh4</it>)) to obtain species-specific sequence-tagged sites (STSs) that were subsequently used for hybridization to gridded fosmid filters (see Materials and methods). Positive clones from the library screen were verified by PCR and restriction mapped to choose the longest clone containing the candidate gene and its regulatory regions.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Phylogenetic relationships of the five <it>Drosophila </it>species studied in this paper and the outgroup species, the mosquito <it>Anopheles gambiae</it></p>
               </caption>
               <text>
                  <p>Phylogenetic relationships of the five <it>Drosophila </it>species studied in this paper and the outgroup species, the mosquito <it>Anopheles gambiae</it>. The topology of this tree is based on the accepted relationship of these six species; the divergence times from <it>D. melanogaster </it>are approximately 6-15, 46, 53, 61-65, and 250 million years for <it>D. erecta</it>, <it>D. pseudoobscura</it>, <it>D. willistoni</it>, <it>D. littoralis </it>and <it>A. gambiae</it>, respectively [<abbr bid="B7">7</abbr>,<abbr bid="B84">84</abbr>]. <it>D. melanogaster</it>, <it>D. erecta</it>, <it>D. pseudoobscura </it>and <it>D. willistoni </it>belong to the subgenus <it>Sophophora </it>and <it>D. littoralis </it>belongs to the subgenus <it>Drosophila</it>. Rearrangements are indicated by double-headed arrows below each branch and gene transpositions are indicated by triangles above each branch. Rearrangements are inferred to occur on the lineages leading to (a) the ancestor of the <it>D. melanogaster/D. erecta eve </it>region, (b) the <it>D. pseudoobscura Rh1 </it>region, the <it>D. willistoni </it>(c) <it>eve</it>, (d) <it>Rh1</it>, and (e) <it>Rh3 </it>regions, and (f) the <it>D. littoralis ftz </it>region. Gene transpositions are inferred to occur for the (1) <it>CG13029 </it>and (2) <it>CG12133 </it>genes in the ancestor of the <it>D. melanogaster/D. erecta </it>lineage, (3) the <it>CG5245</it>-like gene in the <it>D. pseudoobscura </it>lineage, (4) the <it>CG8319</it>-like gene in the <it>D. willistoni </it>lineage, (5) the <it>CG2222</it>-like gene in the <it>D. willistoni </it>lineage, and (6) the <it>Rh4 </it>gene in the <it>D. littoralis </it>lineage. We note that the event classified as a rearrangement involving the <it>D. pseudoobscura CG31155 </it>gene at the end of the <it>Rh1 </it>clone may be a gene transposition as this gene is a partial gene spanning the edge of the clone. In addition, we note that rearrangement involving the <it>D. littoralis ftz </it>gene may have occurred on the branch leading to the ancestor of the Sophophoran species since, although the orientation of <it>ftz </it>with respect to <it>Antp </it>is ambiguous in <it>A. gambiae </it>([<abbr bid="B85">85</abbr>,<abbr bid="B86">86</abbr>] and data not shown), it shares a similar configuration to <it>D. littoralis </it>in the outgroup, <it>Tribolium castaneum </it>[<abbr bid="B87">87</abbr>].</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-1"/>
            </fig>
            <p>In the initial design of this project, comparative sequence data was to be collected from a <it>D. virilis </it>P1 library [<abbr bid="B15">15</abbr>]. Using a PCR-based plate-pool screening strategy, we isolated a P1 clone from this library containing an 83.2-kb insert from the <it>ap </it>region of <it>D. virilis. </it>Sequencing of this clone revealed long stretches of repetitive DNA, which complicated both assembly and comparative analyses. In addition, the insert size of the <it>D. virilis </it>P1 library (approximately 60-80 kb) was greater than necessary for comparative analysis of single gene regions. This clone was used to guide transgenic reporter analysis (see below), but has not been included in the other analyses reported here.</p>
            <p>In total, 30 fosmid clones were isolated and sequenced using methods described in [<abbr bid="B9">9</abbr>] which sum to 1,257,069 bp. All clones were finished to an estimated error rate of fewer than 0.17 errors per 10 kb, with an average estimated error rate of 0.03 errors per 10 kb. The lengths of fosmids sequenced for the eight candidate regions are shown in Table <tblr tid="T1">1</tblr>. Though we were able to obtain species-specific STSs for the <it>D. willistoni twi </it>gene, we were not able to obtain clones for this region from the <it>D. willistoni </it>fosmid library. We were also not able to obtain a species-specific probe for <it>D. willistoni ftz</it>, nor could we obtain any <it>D. willistoni ftz </it>clones using probes from other non-melanogaster species. Also shown in Table <tblr tid="T1">1</tblr> are the lengths and locations of <it>D. melanogaster </it>genomic regions corresponding to the union of the Release 3 sequences homologous to all four non-melanogaster species. The union of sequences from all non-melanogaster species for the eight candidate regions covers 494.6 kb of the <it>D. melanogaster </it>genome; an additional 65.3 kb of <it>D. melanogaster </it>genomic sequence was sampled owing to rearrangements in non-melanogaster species. Thus the 1.25 Mb of comparative data presented here span over 0.5 Mb of coding and non-coding sequences of the <it>D. melanogaster </it>genome.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Summary of candidate gene regions and lengths of sequences analyzed in this study</p>
               </caption>
               <tblbdy cols="8">
                  <r>
                     <c ca="left">
                        <p>Region</p>
                     </c>
                     <c ca="center">
                        <p>Arm</p>
                     </c>
                     <c ca="left">
                        <p>Cytological location</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. melanogaster</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. erecta</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. pseudoobscura</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. willistoni</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. littoralis</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="8">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>Rh1</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3R</p>
                     </c>
                     <c ca="left">
                        <p>92B3-6</p>
                     </c>
                     <c ca="center">
                        <p>54,450</p>
                     </c>
                     <c ca="center">
                        <p>38,418</p>
                     </c>
                     <c ca="center">
                        <p>45,873</p>
                     </c>
                     <c ca="center">
                        <p>43,804</p>
                     </c>
                     <c ca="center">
                        <p>35,983</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>Rh2</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3R</p>
                     </c>
                     <c ca="left">
                        <p>91D3-5</p>
                     </c>
                     <c ca="center">
                        <p>58,172</p>
                     </c>
                     <c ca="center">
                        <p>43,599</p>
                     </c>
                     <c ca="center">
                        <p>42,336</p>
                     </c>
                     <c ca="center">
                        <p>35,954</p>
                     </c>
                     <c ca="center">
                        <p>43,945</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>Rh3</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3R</p>
                     </c>
                     <c ca="left">
                        <p>92C3-D1</p>
                     </c>
                     <c ca="center">
                        <p>83,394</p>
                     </c>
                     <c ca="center">
                        <p>43,180</p>
                     </c>
                     <c ca="center">
                        <p>42,117</p>
                     </c>
                     <c ca="center">
                        <p>41,651</p>
                     </c>
                     <c ca="center">
                        <p>45,428</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>Rh4</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3L</p>
                     </c>
                     <c ca="left">
                        <p>73D1-6</p>
                     </c>
                     <c ca="center">
                        <p>53,470</p>
                     </c>
                     <c ca="center">
                        <p>41,352</p>
                     </c>
                     <c ca="center">
                        <p>44,117</p>
                     </c>
                     <c ca="center">
                        <p>36,325</p>
                     </c>
                     <c ca="center">
                        <p>44,255</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>ap</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2R</p>
                     </c>
                     <c ca="left">
                        <p>41F8</p>
                     </c>
                     <c ca="center">
                        <p>50,314</p>
                     </c>
                     <c ca="center">
                        <p>37,077</p>
                     </c>
                     <c ca="center">
                        <p>38,050</p>
                     </c>
                     <c ca="center">
                        <p>40,487</p>
                     </c>
                     <c ca="center">
                        <p>39,016</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>eve</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2R</p>
                     </c>
                     <c ca="left">
                        <p>47C6-D4</p>
                     </c>
                     <c ca="center">
                        <p>46,587</p>
                     </c>
                     <c ca="center">
                        <p>45,909</p>
                     </c>
                     <c ca="center">
                        <p>44,139</p>
                     </c>
                     <c ca="center">
                        <p>38,059</p>
                     </c>
                     <c ca="center">
                        <p>43,320</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>ftz</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>3R</p>
                     </c>
                     <c ca="left">
                        <p>84A5-B2</p>
                     </c>
                     <c ca="center">
                        <p>66,214</p>
                     </c>
                     <c ca="center">
                        <p>44,340</p>
                     </c>
                     <c ca="center">
                        <p>42,627</p>
                     </c>
                     <c ca="center">
                        <p>NA</p>
                     </c>
                     <c ca="center">
                        <p>43,155</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>twi</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>2R</p>
                     </c>
                     <c ca="left">
                        <p>59C1-3</p>
                     </c>
                     <c ca="center">
                        <p>82,029</p>
                     </c>
                     <c ca="center">
                        <p>43,101</p>
                     </c>
                     <c ca="center">
                        <p>43,025</p>
                     </c>
                     <c ca="center">
                        <p>NA</p>
                     </c>
                     <c ca="center">
                        <p>46,427</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>494,630*</p>
                     </c>
                     <c ca="center">
                        <p>336,976</p>
                     </c>
                     <c ca="center">
                        <p>342,284</p>
                     </c>
                     <c ca="center">
                        <p>236,280</p>
                     </c>
                     <c ca="center">
                        <p>341,529</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Cytological locations are for sequences in <it>D. melanogaster</it>. The <it>D. willistoni ftz </it>and <it>twi </it>region (NA) were not isolated in our library screen. All fosmid clones sequenced have estimated error rates of fewer than 0.17 errors/10 kb. *An additional 65.3 kb of sequence was surveyed from other regions of the <it>D. melanogaster </it>genome as a result of rearrangements (see text for details).</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Comparative annotation of coding sequences</p>
            </st>
            <p>The 30 non-melanogaster fosmid (and the <it>D. virilis ap </it>P1) sequences were computationally processed using the pipeline used to re-annotate the <it>D. melanogaster </it>genome [<abbr bid="B12">12</abbr>]. The only major modification to this pipeline was to add an additional tier of evidence containing the results of TBLASTN searches of all Release 3 <it>D. melanogaster </it>peptides [<abbr bid="B14">14</abbr>] against non-melanogaster sequences. Predicted coding sequences were manually verified and refined using the Apollo annotation tool [<abbr bid="B13">13</abbr>]. As no expressed sequence tag (EST) information exists to annotate transcribed non-coding sequences (such as UTRs) for the four non-melanogaster species, we annotated only protein-coding gene and exon models. Thus, in keeping with other gene-prediction studies (for example [<abbr bid="B16">16</abbr>]), we use the terms gene and exon to refer to the translated components of genes and exons.</p>
            <p>In the 30 fosmids, we predict a total of 164 protein-coding genes in non-melanogaster species (53 in <it>D. erecta</it>, 41 in <it>D. pseudoobscura</it>, 39 in <it>D. willistoni</it>, 31 in <it>D. littoralis</it>) that form orthologous clusters with 81 <it>D. melanogaster </it>genes. Of the 81 genes, 30 are 'known' genes that have been functionally characterized in some way by the community of <it>Drosophila </it>researchers; the remaining 51 genes are 'predicted' genes based only on the evidence in the Release 3 annotations ([<abbr bid="B14">14</abbr>] and see Supplementary Table 1 in the Additional data files section). Of the 164 genes predicted in non-melanogaster species, 133 (81%) are full length; the remaining 31 (19%) are partial coding sequences that span the edge of the sequenced genomic clone. In non-melanogaster species, we predict 495 coding exons (148 in <it>D. erecta</it>, 133 in <it>D. pseudoobscura</it>, 111 in <it>D. willistoni</it>, 103 in <it>D. littoralis</it>) that form orthologous clusters with 264 <it>D. melanogaster </it>coding exons. On average, there are approximately two non-melanogaster species sampled per orthologous gene and coding exon cluster. Fifteen genes (10 complete) and 39 coding exons were sequenced in all four non-melanogaster species.</p>
            <p>Qualitatively, our data reveal that the majority of <it>D. melanogaster </it>Release 3 gene models are highly conserved in divergent <it>Drosophila </it>species. This made it possible to automatically identify orthologous genes in non-melanogaster species using TBLASTN results in conjunction with Genie [<abbr bid="B17">17</abbr>] and/or GENSCAN [<abbr bid="B18">18</abbr>] predictions to improve intron-exon boundaries and identify small/divergent exons in Apollo. In the few discrepant cases where no clear ortholog could be unambiguously identified (such as the four closely related members of the <it>Rhodopsin </it>gene family), we used the conserved microsyntenic gene orders maintained in these species to resolve orthologs (see below). With the exception of the retrotransposition events discussed below, the intron-exon structure of gene models is highly conserved as well: only one case of intron gain was observed in the <it>D. littoralis Rh2</it>, as has been reported previously for <it>Rh2 </it>in the closely related species, <it>D. virilis </it>[<abbr bid="B19">19</abbr>]. For a small class of genes (<it>BcDNA:LD21213, Gr59a, Gr59b, CG9895, CG10887, CG17186, CG4733</it>), orthologs could be identified in divergent species, but amino-acid sequences could not be reliably aligned with the <it>D. melanogaster </it>gene model. In addition, orthologs of four genes (<it>CG13029, CG14294, CG12133, CG12378</it>) could not be identified in non-melanogaster species except in <it>D. erecta</it>, the species most closely related to <it>D. melanogaster</it>. The absence of these genes is not simply due to insufficient sampling, since in these cases both 5' and 3' neighboring genes could be identified in more divergent species (see Figure <figr fid="F2">2</figr>, for example). These may represent genes overpredicted in both <it>D. erecta </it>and <it>D. melanogaster</it>, lineage-specific genes which evolved before the divergence of <it>D. melanogaster </it>from <it>D. erecta</it>, or genes which have transposed from (or to) other locations in the genomes of the more divergent species.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>VISTA plot of genome organization and sequence conservation in the <it>Drosophila eve </it>region</p>
               </caption>
               <text>
                  <p>VISTA plot of genome organization and sequence conservation in the <it>Drosophila eve </it>region. Sequences were aligned using AVID, and conserved sequences were visualized using default parameters of VISTA. From top to bottom are pairwise comparisons between <it>D. melanogaster </it>and <it>D. erecta </it>(mel-ere), <it>D. pseudoobscura </it>(mel-pse), <it>D. willistoni </it>(mel-wil) and <it>D. littoralis </it>(mel-lit), respectively. In each panel, conserved segments from 50-100% are plotted, with the midline indicating 75% identity; regions with no midline represent sequences not sampled in a pairwise comparison. Double bars crossing a midline represent rearrangement breakpoints. The location and orientation of coding sequences are indicated by arrows; purple boxes represent coding exons and light-blue boxes represent functionally characterized <it>cis</it>-regulatory sequences [<abbr bid="B50">50</abbr>,<abbr bid="B88">88</abbr>,<abbr bid="B89">89</abbr>,<abbr bid="B90">90</abbr>]; pink regions represent uncharacterized CNCSs. Suffixes on gene names (for example, <it>TER94-RA</it>) indicate the particular transcript displayed for genes with multiple transcripts. Note that the predicted gene <it>CG12133 </it>is restricted to the <it>D. melanogaster</it>/<it>D. erecta </it>lineage but is absent in <it>D. pseudoobscura</it>, although both flanking genes are present.</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-2"/>
            </fig>
            <p>We used an evolutionary genetic approach, the K<sub>a</sub>/K<sub>s </sub>test, to assess the accuracy of these gene and exon predictions [<abbr bid="B20">20</abbr>]. This test relies on the assumption that functionally constrained protein-coding sequences should exhibit significantly lower rates of evolution in amino-acid-encoding nucleotide sites (typically first and second positions in a codon) relative to silent sites (typically third positions in a codon). Quantitatively, this leads to the prediction that the ratio of the average rate of amino-acid substitution per site (K<sub>a</sub>) relative to the average rate of silent substitution per site (K<sub>s</sub>) for functionally constrained coding sequences should be significantly less than 1 [<abbr bid="B21">21</abbr>]. Genes or exons which have a K<sub>a</sub>/K<sub>s </sub>&#8776; 1 are inferred to evolve in the absence of functional constraint; genes or exons which have a K<sub>a</sub>/K<sub>s </sub>> 1 are inferred to evolve under the influence of positive selection. The significance of a K<sub>a</sub>/K<sub>s </sub>ratio can be determined by a likelihood ratio test of the probabilities of the data under the alternative hypotheses of functional constraint relative to no constraint [<abbr bid="B22">22</abbr>]. Genes or coding exons with a K<sub>a</sub>/K<sub>s </sub>ratio significantly less than 1 'pass' the K<sub>a</sub>/K<sub>s </sub>test; genes or coding exons with a K<sub>a</sub>/K<sub>s </sub>ratio not significantly less than 1 'fail' the K<sub>a</sub>/K<sub>s </sub>test. The power of this test to detect functional constraint is influenced both by evolutionary distance and sequence length [<abbr bid="B20">20</abbr>]; thus we analyzed both genes and coding exons in pairwise comparison with all four non-melanogaster species.</p>
            <p>All pairwise gene-level comparisons studied here exhibited K<sub>a</sub>/K<sub>s </sub>ratios less than one (see Supplementary Table 1 in the Additional data files section). One hundred and fifty-five of 164 (94.5%) of these K<sub>a</sub>/K<sub>s </sub>ratios were significantly less than 1, indicating that the vast majority of genes in our sample show evidence of functional constraint. All nine pairwise comparisons that fail the K<sub>a</sub>/K<sub>s </sub>test at the gene level were <it>D. melanogaster-D. erecta </it>comparisons, and eight out of nine involved predicted genes (Supplementary Table 1). Genomic sequences for six of the nine genes which fail the K<sub>a</sub>/K<sub>s </sub>test at the gene level were sampled in more divergent species: four of these six genes could be identified in more divergent species (<it>Lmpt, CG10887, CG14292</it>, and <it>CG4468</it>), whereas two could not (<it>CG12378 </it>and <it>CG14294</it>), indicating that genes conserved in divergent species can fail gene-level K<sub>a</sub>/K<sub>s </sub>tests in comparisons among closely related species like <it>D. erecta</it>. Of the four genes identified only in <it>D. melanogaster </it>and <it>D. erecta </it>and not in more distantly related species, two pass <it>CG12133 </it>and <it>CG13029</it>) and two fail (<it>CG12378 </it>and <it>CG14294</it>) the gene-level K<sub>a</sub>/K<sub>s </sub>test. We note that of these four genes, the two genes that pass (<it>CG12133 </it>and <it>CG13029</it>) have multiple exons, whereas the two genes that fail (<it>CG12378 </it>and <it>CG14294</it>) have only a single exon. This result indicates that at least some of the genes found only in <it>D. melanogaster </it>and <it>D. erecta </it>are likely to be real genes under functional constraint.</p>
            <p>Though the majority of pairwise exon level comparisons have K<sub>a</sub>/K<sub>s </sub>ratios less than one (Figure <figr fid="F3">3</figr>), a much lower proportion of pairwise comparisons at the exon level pass the K<sub>a</sub>/K<sub>s </sub>test. In total, 71.9% (356/495) of pairwise comparisons at the exon level pass the K<sub>a</sub>/K<sub>s </sub>test: 54.0% (80/148) for <it>D. erecta</it>; 78.9% (105/133) for <it>D. pseudoobscura</it>; 81.1% (90/111) for <it>D. willistoni</it>; and 79.6% (82/103) for <it>D. littoralis</it>. Coding exons from known and predicted genes pass the K<sub>a</sub>/K<sub>s </sub>test at similar rates: overall, (72.2% known versus 71.1% predicted), <it>D. erecta </it>(56.6% known versus 50.0% predicted), <it>D. pseudoobscura </it>(80.4% known versus 75.6% predicted), <it>D. willistoni </it>(81.5% known versus 82.1% predicted), <it>D. littoralis </it>(78.0% known versus 80.6% predicted). The majority of exons that fail the K<sub>a</sub>/K<sub>s </sub>test still have K<sub>a</sub>/K<sub>s </sub>ratios less than 1; only six non-significant pairwise exon comparisons (one in <it>D. erecta</it>, one in <it>D. pseudoobscura</it>, two in <it>D. willistoni</it>, and two in <it>D. littoralis</it>) have K<sub>a</sub>/K<sub>s </sub>ratios greater than 1 (Figure <figr fid="F3">3</figr>). As with gene-level comparisons, the most closely related species, <it>D. erecta</it>, fails the highest proportion of exon-level K<sub>a</sub>/K<sub>s </sub>tests. In contrast to gene-level comparisons, there is no tendency for exons from predicted genes to fail K<sub>a</sub>/K<sub>s </sub>tests relative to exons from known genes.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Frequency distribution of K<sub>a</sub>/K<sub>s </sub>ratios for pairwise exon-level comparisons between <it>D. melanogaster </it>and either <it>D. erecta</it>, <it>D. pseudoobscura</it>, <it>D. willistoni</it>, or <it>D. littoralis</it></p>
               </caption>
               <text>
                  <p>Frequency distribution of K<sub>a</sub>/K<sub>s </sub>ratios for pairwise exon-level comparisons between <it>D. melanogaster </it>and either <it>D. erecta</it>, <it>D. pseudoobscura</it>, <it>D. willistoni</it>, or <it>D. littoralis</it>. K<sub>a</sub>/K<sub>s </sub>ratios were estimated using the codeml program of PAML 3.12 using runmode = -2.</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-3"/>
            </fig>
            <p>Pairwise comparisons that do not pass the K<sub>a</sub>/K<sub>s </sub>test could result from misannotated exons or an insufficient amount of divergence to resolve differential rates of amino acid and silent site evolution. Failure to pass exon-level K<sub>a</sub>/K<sub>s </sub>tests because of insufficient divergence is a function of divergence time and exon length [<abbr bid="B20">20</abbr>]. Our results suggest that both factors contribute to non-significant exon-level K<sub>a</sub>/K<sub>s </sub>tests between species in the genus <it>Drosophila</it>. The fact that the most closely related species, <it>D. erecta</it>, fails the highest proportion of gene-and exon-level K<sub>a</sub>/K<sub>s </sub>tests indicates that insufficient divergence time contributes to non-significant comparisons. Exon length is also a factor, as there is a tendency for shorter exons to fail K<sub>a</sub>/K<sub>s </sub>tests in our data. For example, the average length of all exons failing the K<sub>a</sub>/K<sub>s </sub>test in comparisons between <it>D. melanogaster </it>and <it>D. pseudoobscura </it>is 22.1 codons, and the average length of all exons passing the K<sub>a</sub>/K<sub>s </sub>test is 152.1 codons. Similar results are obtained for pairwise comparisons involving <it>D. melanogaster </it>and <it>D. erecta, D. willistoni </it>or <it>D. littoralis</it>, and for both known and predicted genes (data not shown).</p>
            <p>To determine if insufficient divergence time is the major cause of non-significant exon-level K<sub>a</sub>/K<sub>s </sub>tests, we performed multi-species exon-level K<sub>a</sub>/K<sub>s </sub>tests that capitalize on a greater total amount of divergence for a given exon [<abbr bid="B20">20</abbr>]. The question addressed by this analysis is: does the addition of a third species to <it>D. melanogaster-D. pseudoobscura </it>pairwise comparisons increase the proportion of exons that pass exon-level K<sub>a</sub>/K<sub>s </sub>tests? For this analysis, we analyzed exons that failed pairwise tests between <it>D. melanogaster </it>and <it>D. pseudoobscura </it>using triplets involving <it>D. melanogaster, D. pseudoobscura </it>and one other non-melanogaster species. Using the same cutoffs for the pairwise exon-level analyses and a guide tree based on Figure <figr fid="F1">1</figr>, we tested 16, 14 and 13 exons which did not show evidence of functional constraint between <it>D. melanogaster </it>and <it>D. pseudoobscura</it>, for which we have sequence data available in <it>D. erecta, D. willistoni </it>and <it>D. littoralis</it>, respectively. Only 2 of the 16 (12.5%) non-significant exon-level <it>D. melanogaster-D. pseudoobscura </it>comparisons pass the K<sub>a</sub>/K<sub>s </sub>test when <it>D. erecta </it>is included as a third species, whereas 6 of 14 (42.8%) and 6 of 13 (46.1%) pass when <it>D. willistoni </it>and <it>D. littoralis </it>are included as a third species, respectively. These results demonstrate that multiple comparisons among divergent species can reveal functional constraint acting on coding exons that cannot otherwise be detected in pairwise comparisons.</p>
            <p>Finally, as a preliminary assessment of the relative utility of <it>A. gambiae </it>genome sequences for comparative gene prediction in <it>Drosophila</it>, we attempted to identify homologs in <it>A. gambiae </it>of the 81 genes for which we have comparative sequence data in <it>Drosophila</it>. For 21 of the 81 genes in our study (25.9%) we were not able to obtain a clear homolog (defined as a high-scoring pair (HSP) with an expected (E) value less than 10<sup>-20 </sup>and greater than or equal to 30% identity over 100 amino acids using default parameters of TBLASTN) in the <it>A. gambiae </it>mapped scaffold sequences; 11 of these 21 genes did not yield any HSPs at all. These results are compatible with a recent whole-genome analysis showing that 18.6% of <it>D. melanogaster </it>genes have no clear homolog in <it>A. gambiae </it>[<abbr bid="B23">23</abbr>]. No clear homolog could be identified in the <it>A. gambiae </it>genome sequences for three of 30 (10.0%) known genes in our dataset, whereas a greater than three times higher proportion of predicted genes, 18 of 51 (35.3%), had no clear homolog in <it>A. gambiae</it>. Five <it>D. melanogaster </it>genes - the four members of the <it>Rhodopsin </it>gene family and <it>CG5245 </it>- have multiple HSPs in the <it>A. gambiae </it>genome sequences. We were able to resolve orthology for <it>Rh4 </it>only, as the <it>sina </it>gene in <it>A. gambiae </it>is contained within the <it>Rh4 </it>gene as in <it>D. melanogaster </it>and other species in the subgenus <it>Sophophora</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Rearrangement and transposition of genomic sequences</p>
            </st>
            <p>Using the gene predictions discussed above as orthologous markers, we addressed the question of whether the microsyntenic relationships in the <it>D. melanogaster </it>genomic sequence surveyed are conserved in non-melanogaster species. In general, our data indicate that the microsyntenic order of coding and non-coding sequences is highly conserved in the genus <it>Drosophila </it>at the scale of individual fosmids (approximately 40 kb). Our data provide evidence for only six genomic rearrangements in these sequences occurring in the phylogeny of these five species, one each in the lineages leading to the <it>D. littoralis ftz, D. pseudoobscura Rh1, D. willistoni eve, Rh1</it>, and <it>Rh3 </it>regions, as well as in the ancestor of the <it>D. melanogaster/D. erecta eve </it>region (see Figure <figr fid="F1">1</figr>). All of these unique events occurred in non-coding intergenic regions and none of the rearrangement breakpoints is associated with detectable transposable element sequences (see also [<abbr bid="B24">24</abbr>]). Although it is difficult to estimate the length distribution of microsyntenic regions in <it>Drosophila </it>from our data, it is clear that very small microsyntenic regions can be delimited in the <it>Drosophila </it>genome through multiple species comparisons. For example, the two independent rearrangements in the vicinity of the <it>eve </it>locus reduce this microsyntenic region to a approximately 20-kb interval of the <it>D. melanogaster </it>genome containing only three neighboring genes (<it>Adam, CG12134 </it>and <it>eve</it>) and their flanking non-coding sequences (Figure <figr fid="F2">2</figr>).</p>
            <p>We can directly confirm the nature of one rearrangement (<it>D. littoralis ftz</it>) as a paracentric micro-inversion since both breakpoints are contained within a single fosmid clone. In this case, a small (approximately 14 kb) region containing the <it>ftz </it>coding sequence and flanking non-coding DNA is inverted between the <it>Antp </it>and <it>Scr </it>genes relative to <it>D. melanogaster</it>. Maier <it>et al</it>. [<abbr bid="B25">25</abbr>] provide hybrization data for a similar rearrangement in the <it>ftz </it>locus of <it>D. hydei</it>, another member of the subgenus <it>Drosophila</it>. It is likely that the other rearrangement breakpoints we observe also result from paracentric inversions, the predominant form of genome rearrangement in <it>Drosophila </it>[<abbr bid="B26">26</abbr>]. Consistent with this is the fact that rearranged sequences can be inferred to come from the same chromosome arm. At least two other breakpoints (in the <it>D. willistoni Rh1 </it>and <it>Rh3 </it>regions) also have probably arisen from micro-inversions, as in both cases only two genes are inferred to have switched order locally on the chromosome.</p>
            <p>We also identified eight examples of novel genetic elements in non-melanogaster species, seven of which occur in intergenic regions (Figure <figr fid="F1">1</figr>). Four of these cases involve the insertion of novel transposable element sequences: full length <it>Bari-1</it>-like elements in both the <it>D. pseudoobscura Rh1 </it>region and the <it>D. willistoni Rh3 </it>regions, a partial <it>I</it>-like element in the <it>D. willistoni Rh4 </it>region, and a partial <it>blastopia</it>-like element in the <it>D. littoralis Rh3 </it>region. Identification of <it>Bari-1</it>-like transposon sequences in <it>D. pseudoobscura </it>and <it>D. willistoni </it>is consistent with previous observations [<abbr bid="B27">27</abbr>]; <it>I</it>-like elements have been shown to exist in the melanogaster and obscura species [<abbr bid="B28">28</abbr>], but this is the first report of <it>I</it>-like elements in the willistoni group. The other four cases arise from gene transposition including: a homolog of the <it>D. melanogaster </it>X-chromosome gene <it>CG2222 </it>in the <it>D. willistoni eve </it>region; a homolog of the <it>D. melanogaster </it>3R-chromosome gene <it>CG5245 </it>in the <it>D. pseudoobscura Rh1 </it>region; a homolog of the <it>D. melanogaster </it>3R-chromosome gene <it>CG8319 </it>in the <it>D. willistoni Rh1 </it>region, and the <it>Rh4 </it>gene in <it>D. littoralis </it>(see below). The <it>CG5245</it>-like gene in <it>D. pseudoobscura </it>and the <it>CG8319</it>-like gene in <it>D. willistoni </it>both are located in the same intergenic region between the <it>Arc42 </it>and <it>PK92B </it>genes, but result from independent events since they involve different ancestral sequences and occur on opposite strands in this intergenic region. This result suggests the possibility of hotspots for gene transposition in the <it>Drosophila </it>genome.</p>
            <p>At least one novel gene, the <it>CG2222</it>-like gene <it>D. willistoni</it>, is likely to have originated from a retrotransposition event as this gene lacks introns while its closest homolog, found on a different chromosome arm in the <it>D. melanogaster </it>genome, has two introns. Another striking example of retrotransposition involves the <it>D. littoralis Rh4 </it>gene and illustrates the fact that functionally important genes can undergo dramatic changes in location and gene structure during genome evolution [<abbr bid="B29">29</abbr>]. This gene maintains its microsyntenic relationship with neighboring genes in the 72D2-3 region of the <it>D. melanogaster </it>genome in Sophophoran species, but has retrotransposed into the intron of another gene, <it>CG10967</it>, in a region of the <it>D. littoralis </it>genome that corresponds to the 69E1-2 region of the <it>D. melanogaster </it>genome. As a result, genes contained in the intron of the Sophophoran <it>Rh4 </it>(<it>sina</it>, <it>CG13030 </it>and <it>CG13029</it>) have been lost in the process. Cytological evidence for transposition of <it>Rh4 </it>exists for the closely related species <it>D. virilis </it>and the more distantly related species <it>D. repleta </it>[<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>].</p>
            <p>In contrast to the stability of microsyntenic gene order in the genus <it>Drosophila</it>, we found that the sample of genes studied here are scattered widely throughout the <it>Anopheles </it>genome. For example, of the 55 <it>Drosophila </it>genes that had a single clear homolog in <it>Anopheles</it>, 27 are located on <it>D. melanogaster </it>chromosome arm 2R. Of these 27 genes, ten, five, six and six are located on <it>A. gambiae </it>chromosome arms 2L, 2R, 3L and 3R. These results are consistent with previous reports comparing the locations of genes in <it>D. melanogaster </it>with <it>A. gambiae</it>, which indicate that extensive genome rearrangement has occurred since the divergence of these two lineages [<abbr bid="B23">23</abbr>,<abbr bid="B31">31</abbr>,<abbr bid="B32">32</abbr>]. Some <it>D. melanogaster </it>genes in our sample do maintain microsyntenic relationships in <it>A. gambiae</it>, such as the <it>Rh4 </it>and <it>sina </it>genes. In this case, conservation of microsynteny is most probably maintained because of the nested relationship of these genes, and this configuration in the outgroup <it>Anopheles </it>supports the scenario that transposition of <it>Rh4 </it>occurred at some point in the lineage leading to the <it>Drosophila </it>subgenus (see above, and [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>]).</p>
         </sec>
         <sec>
            <st>
               <p>Patterns of coding sequence evolution</p>
            </st>
            <p>In addition to providing a useful resource for studying comparative gene prediction and genome rearrangements, our data confirm and extend emerging trends in <it>Drosophila </it>coding sequence evolution. Table <tblr tid="T2">2</tblr> summarizes the average rates of amino-acid and silent site substitution for all, known and predicted genes in our dataset. These data show that predicted genes tend to have a higher rate of amino-acid substitution than known genes in the genus <it>Drosophila</it>. This trend is significant for the three most closely related pairwise comparisons (<it>D. melanogaster </it>versus <it>D. erecta, D. pseudoobscura </it>or <it>D. willistoni</it>) but non-significant in the comparison involving the most distantly related species (<it>D. melanogaster </it>versus <it>D. littoralis</it>). No significant differences were detected in the rates of silent site substitution between known and predicted genes in any pairwise comparison, although predicted genes in <it>D. pseudoobscura, D. willistoni </it>and <it>D. littoralis </it>tend to show elevated rates of silent site substitution.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Rates of amino-acid (K<sub>a</sub>) and silent (K<sub>s</sub>) substitution in <it>Drosophila </it>genes</p>
               </caption>
               <tblbdy cols="12">
                  <r>
                     <c ca="left">
                        <p>Species</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>All genes</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Known genes</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Predicted genes</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p><it>p</it>-value</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>K<sub>a</sub></p>
                     </c>
                     <c ca="center">
                        <p>K<sub>s</sub></p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>N</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>K<sub>a</sub></p>
                     </c>
                     <c ca="center">
                        <p>K<sub>s</sub></p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>N</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>K<sub>a</sub></p>
                     </c>
                     <c ca="center">
                        <p>K<sub>s</sub></p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>N</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>K<sub>a</sub></p>
                     </c>
                     <c ca="center">
                        <p>K<sub>s</sub></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. erecta</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.057</p>
                     </c>
                     <c ca="center">
                        <p>0.357</p>
                     </c>
                     <c ca="center">
                        <p>53</p>
                     </c>
                     <c ca="center">
                        <p>0.042</p>
                     </c>
                     <c ca="center">
                        <p>0.366</p>
                     </c>
                     <c ca="center">
                        <p>25</p>
                     </c>
                     <c ca="center">
                        <p>0.071</p>
                     </c>
                     <c ca="center">
                        <p>0.349</p>
                     </c>
                     <c ca="center">
                        <p>28</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.0001</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.1299</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. pseudoobscura</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.146</p>
                     </c>
                     <c ca="center">
                        <p>2.313</p>
                     </c>
                     <c ca="center">
                        <p>41</p>
                     </c>
                     <c ca="center">
                        <p>0.071</p>
                     </c>
                     <c ca="center">
                        <p>1.830</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>0.199</p>
                     </c>
                     <c ca="center">
                        <p>2.655</p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.0009</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.0262</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. willistoni</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.220</p>
                     </c>
                     <c ca="center">
                        <p>2.627</p>
                     </c>
                     <c ca="center">
                        <p>39</p>
                     </c>
                     <c ca="center">
                        <p>0.089</p>
                     </c>
                     <c ca="center">
                        <p>2.225</p>
                     </c>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>0.302</p>
                     </c>
                     <c ca="center">
                        <p>2.878</p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.0001</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.0735</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. littoralis</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.170</p>
                     </c>
                     <c ca="center">
                        <p>2.166</p>
                     </c>
                     <c ca="center">
                        <p>31</p>
                     </c>
                     <c ca="center">
                        <p>0.126</p>
                     </c>
                     <c ca="center">
                        <p>1.923</p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>0.206</p>
                     </c>
                     <c ca="center">
                        <p>2.366</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>0.1315</p>
                     </c>
                     <c ca="center">
                        <p>0.6058</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Rates of substitution per site between <it>D. melanogaster </it>and <it>D. erecta, D. pseudooobscura</it>, <it>D. willistoni</it>, or <it>D. littoralis </it>are estimated using the method of Yang and Nielsen [<abbr bid="B81">81</abbr>]. Shown are the average rates of substitution per site (and sample sizes) of all, known or predicted genes. <it>p</it>-values are the results of Mann-Whitney U-tests for differences in the distribution of K<sub>a </sub>and K<sub>s </sub>values between known and predicted genes for a given pairwise comparison. Values in bold represent significant differences in rates of evolution between known and predicted genes at the 0.006 (= 0.05/8) level.</p>
               </tblfn>
            </tbl>
            <p>In contrast to expectation, average rates of amino-acid substitution are highest in comparisons between <it>D. melanogaster </it>and <it>D. willistoni</it>, not <it>D. melanogaster </it>and <it>D. littoralis </it>(Table <tblr tid="T2">2</tblr>, Figure <figr fid="F1">1</figr>). The overall increased rate of amino-acid substitution for genes in the <it>D. willistoni </it>lineage is caused by an increased rate of amino-acid substitution in predicted genes. For known genes, average rates of amino-acid substitution are consistent with the accepted phylogenetic relationships of these species: <it>D. erecta </it>is most closely related to <it>D. melanogaster</it>, followed by <it>D. pseudoobscura, D. willistoni </it>and <it>D. littoralis</it>, respectively. Average rates of silent site substitution also do not show a pattern consistent with the accepted phylogeny of these species (Table <tblr tid="T2">2</tblr>, Figure <figr fid="F1">1</figr>). This is a consequence of the fact that, for comparisons between <it>D. melanogaster </it>and either <it>D. pseudoobscura, D. willistoni </it>or <it>D. littoralis</it>, average rates of silent site substitution exceed an expectation of one substitution per site, indicating that silent sites are 'saturated' in these comparisons. Even so, it is apparent that there may be an increased rate of silent site substitution as well in the <it>D. willistoni </it>lineage. It is unlikely that these results are simply a consequence of an incorrect phylogeny, since the phylogenetic relationships of these species are well established [<abbr bid="B7">7</abbr>].</p>
            <p>Our estimate of the average rate of amino-acid substitution per site in known genes between <it>D. melanogaster </it>and <it>D. pseudoobscura </it>(0.071) is nearly the same as previous estimates (0.076) using a different sample of known genes and estimation procedure [<abbr bid="B33">33</abbr>]. In addition, our estimate of the average rate of amino-acid substitution for predicted genes between <it>D. melanogaster </it>and <it>D. erecta </it>(0.071) is similar to that estimated using different methods for a sample of rapidly evolving genes between <it>D. melanogaster </it>and <it>D. yakuba </it>(0.067) [<abbr bid="B34">34</abbr>], a species approximately as divergent from <it>D. melanogaster </it>as <it>D. erecta </it>[<abbr bid="B35">35</abbr>]. Thus the categorical and lineage effects we detect are unlikely to be artifacts of our data or methods. The cause(s) of the increased rate of amino-acid substitution in predicted genes in the <it>D. willistoni </it>lineage remain to be clarified, but are most probably related to increased rates of protein evolution detected previously in the <it>D. saltans </it>lineage [<abbr bid="B36">36</abbr>], which have been explained by a shift in base composition in the common ancestor of the <it>D. saltans </it>and <it>D. willistoni </it>groups (see below, and [<abbr bid="B37">37</abbr>]).</p>
            <p>In <it>D. melanogaster</it>, it is well established that coding sequences have a higher GC content, relative to genomic averages, due to the preferential use of codons ending in C or G [<abbr bid="B38">38</abbr>,<abbr bid="B39">39</abbr>]. This pattern holds in the closely related species <it>D. erecta</it>, as well as in the more distantly related species <it>D. pseudoobscura </it>and <it>D. littoralis </it>(Table <tblr tid="T3">3</tblr>). In contrast, our data show that <it>D. willistoni </it>coding sequences have a higher frequency of AT (53%) base-pairs than GC (47%) base-pairs. This shift in base usage in <it>D. willistoni </it>coding sequences is apparent at the dinucelotide level as well, predominantly affecting those dinucleotides that exclusively contain AT or GC. Non-coding sequences of all non-melanogaster species are AT-rich, as in <it>D. melanogaster </it>[<abbr bid="B40">40</abbr>]; slight shifts towards higher AT frequency are observed in the non-coding sequences of the <it>D. willistoni </it>lineage (Table <tblr tid="T3">3</tblr>).</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Mono- and dinucleotide frequencies of coding and non-coding sequences in <it>Drosophila </it>species</p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="left">
                        <p>Mononucleotide</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. melanogaster</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. erecta</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. pseudoobscura</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. willistoni</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. littoralis</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Coding</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A = T</p>
                     </c>
                     <c ca="center">
                        <p>0.231</p>
                     </c>
                     <c ca="center">
                        <p>0.222</p>
                     </c>
                     <c ca="center">
                        <p>0.220</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.265</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.224</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>G = C</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.269</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.278</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.280</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.235</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.276</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Non-coding</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A = T</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.300</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.295</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.281</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.324</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.305</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>G = C</p>
                     </c>
                     <c ca="center">
                        <p>0.200</p>
                     </c>
                     <c ca="center">
                        <p>0.205</p>
                     </c>
                     <c ca="center">
                        <p>0.219</p>
                     </c>
                     <c ca="center">
                        <p>0.176</p>
                     </c>
                     <c ca="center">
                        <p>0.195</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Dinucleotide</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. melanogaster</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. erecta</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. pseudoobscura</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. willistoni</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. littoralis</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Coding</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TA</p>
                     </c>
                     <c ca="center">
                        <p>0.032</p>
                     </c>
                     <c ca="center">
                        <p>0.028</p>
                     </c>
                     <c ca="center">
                        <p>0.027</p>
                     </c>
                     <c ca="center">
                        <p>0.046</p>
                     </c>
                     <c ca="center">
                        <p>0.030</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AT</p>
                     </c>
                     <c ca="center">
                        <p>0.057</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                     <c ca="center">
                        <p>0.056</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.080</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.058</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AA = TT</p>
                     </c>
                     <c ca="center">
                        <p>0.057</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.049</p>
                     </c>
                     <c ca="center">
                        <p>0.076</p>
                     </c>
                     <c ca="center">
                        <p>0.056</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AC = GT</p>
                     </c>
                     <c ca="center">
                        <p>0.054</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                     <c ca="center">
                        <p>0.051</p>
                     </c>
                     <c ca="center">
                        <p>0.051</p>
                     </c>
                     <c ca="center">
                        <p>0.051</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AG = CT</p>
                     </c>
                     <c ca="center">
                        <p>0.063</p>
                     </c>
                     <c ca="center">
                        <p>0.064</p>
                     </c>
                     <c ca="center">
                        <p>0.064</p>
                     </c>
                     <c ca="center">
                        <p>0.057</p>
                     </c>
                     <c ca="center">
                        <p>0.059</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GA = TC</p>
                     </c>
                     <c ca="center">
                        <p>0.067</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>0.067</p>
                     </c>
                     <c ca="center">
                        <p>0.063</p>
                     </c>
                     <c ca="center">
                        <p>0.055</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CA = TG</p>
                     </c>
                     <c ca="center">
                        <p>0.075</p>
                     </c>
                     <c ca="center">
                        <p>0.074</p>
                     </c>
                     <c ca="center">
                        <p>0.077</p>
                     </c>
                     <c ca="center">
                        <p>0.078</p>
                     </c>
                     <c ca="center">
                        <p>0.082</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CG</p>
                     </c>
                     <c ca="center">
                        <p>0.064</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>0.046</p>
                     </c>
                     <c ca="center">
                        <p>0.073</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GC</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.081</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.085</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.091</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.067</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.109</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CC = GG</p>
                     </c>
                     <c ca="center">
                        <p>0.067</p>
                     </c>
                     <c ca="center">
                        <p>0.072</p>
                     </c>
                     <c ca="center">
                        <p>0.071</p>
                     </c>
                     <c ca="center">
                        <p>0.054</p>
                     </c>
                     <c ca="center">
                        <p>0.061</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Non-coding</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TA</p>
                     </c>
                     <c ca="center">
                        <p>0.069</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>0.058</p>
                     </c>
                     <c ca="center">
                        <p>0.080</p>
                     </c>
                     <c ca="center">
                        <p>0.075</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AT</p>
                     </c>
                     <c ca="center">
                        <p>0.086</p>
                     </c>
                     <c ca="center">
                        <p>0.084</p>
                     </c>
                     <c ca="center">
                        <p>0.077</p>
                     </c>
                     <c ca="center">
                        <p>0.094</p>
                     </c>
                     <c ca="center">
                        <p>0.089</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AA = TT</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.110</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.106</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.094</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.126</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.111</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AC = GT</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AG = CT</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                     <c ca="center">
                        <p>0.058</p>
                     </c>
                     <c ca="center">
                        <p>0.051</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GA = TC</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                     <c ca="center">
                        <p>0.053</p>
                     </c>
                     <c ca="center">
                        <p>0.059</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.048</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CA = TG</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>0.068</p>
                     </c>
                     <c ca="center">
                        <p>0.070</p>
                     </c>
                     <c ca="center">
                        <p>0.066</p>
                     </c>
                     <c ca="center">
                        <p>0.071</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CG</p>
                     </c>
                     <c ca="center">
                        <p>0.037</p>
                     </c>
                     <c ca="center">
                        <p>0.040</p>
                     </c>
                     <c ca="center">
                        <p>0.039</p>
                     </c>
                     <c ca="center">
                        <p>0.025</p>
                     </c>
                     <c ca="center">
                        <p>0.037</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GC</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.056</p>
                     </c>
                     <c ca="center">
                        <p>0.057</p>
                     </c>
                     <c ca="center">
                        <p>0.038</p>
                     </c>
                     <c ca="center">
                        <p>0.058</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CC = GG</p>
                     </c>
                     <c ca="center">
                        <p>0.043</p>
                     </c>
                     <c ca="center">
                        <p>0.044</p>
                     </c>
                     <c ca="center">
                        <p>0.052</p>
                     </c>
                     <c ca="center">
                        <p>0.033</p>
                     </c>
                     <c ca="center">
                        <p>0.035</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Values for <it>D. melanogaster </it>are genome-wide averages based on Release 3 sequences/annotations [<abbr bid="B9">9</abbr>,<abbr bid="B14">14</abbr>] and include unmapped scaffolds derived from heterochromatic regions (see [<abbr bid="B83">83</abbr>]). Values in bold indicate the most frequently used mono- or dinucleotide. Frequencies of complementary mono- and dinucleotides were averaged to account for the double-stranded nature of DNA.</p>
               </tblfn>
            </tbl>
            <p>The shift in base usage in the <it>D. willistoni </it>lineage is also detected in the pattern of synonymous codon usage (see Supplementary Table 2 in the additional data files). Previous analyses of a limited number of coding sequences revealed a shift away from preferred C-ending codons used in the <it>D. melanogaster </it>lineage, towards T-ending codons in the <it>D. willistoni </it>lineage [<abbr bid="B7">7</abbr>,<abbr bid="B41">41</abbr>]. Our data indicate that this trend holds for a much larger sample of genes (see Supplementary Table 2 in additional data files). For 10 of the 18 amino acids with more than one codon (Arg, Asn, His, Ile, Leu, Lys, Phe, Pro, Thr, Tyr), the most frequently used codon in <it>D. willistoni </it>differs from that in <it>D. melanogaster</it>. All 10 of these changes in synonymous codon usage involve <it>D. willistoni </it>most frequently using an A- or T-ending (or beginning, for example, Leu) codon with <it>D. melanogaster </it>using a G- or C-ending (or beginning) codon, supporting a trend identified originally using only the <it>Adh </it>coding sequence [<abbr bid="B41">41</abbr>]. The most frequently used codon differs between <it>D. melanogaster </it>and <it>D. erecta, D. pseudoobscura </it>and <it>D. littoralis </it>for only two (Asp, Ser), one (Asn) and four (Asn, Ile, Pro, Thr) amino acids, respectively.</p>
         </sec>
         <sec>
            <st>
               <p>Patterns of non-coding sequence evolution</p>
            </st>
            <p>Our data also provide an opportunity to study basic features of non-coding conservation in <it>Drosophila</it>. which remain largely unexplored. As shown in Figures <figr fid="F2">2</figr> and <figr fid="F4">4</figr>, a substantial proportion of non-coding sequences are conserved in <it>Drosophila</it>, especially in pairwise comparisons between <it>D. melanogaster </it>and <it>D. erecta</it>. Levels of conservation appear to plateau in more divergent comparisons, with a tendency for <it>D. pseudoobscura </it>to show higher levels of non-coding conservation relative to <it>D. willistoni </it>or <it>D. littoralis </it>in pairwise comparisons with <it>D. melanogaster</it>. Few, if any, non-coding sequences are conserved between <it>D. melanogaster </it>and <it>A. gambiae </it>(Figure <figr fid="F4">4</figr>, see also [<abbr bid="B23">23</abbr>]). There is also regional variation in levels of non-coding conservation in the <it>Drosophila </it>genome, as illustrated by contrasting conservation between <it>D. melanogaster </it>and <it>D. erecta</it>, for example, in the <it>eve </it>(Figure <figr fid="F2">2</figr>) and <it>ap </it>(Figure <figr fid="F4">4</figr>) regions.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>VISTA plot of genome organization and sequence conservation in the <it>Drosophila ap </it>region</p>
               </caption>
               <text>
                  <p>VISTA plot of genome organization and sequence conservation in the <it>Drosophila ap </it>region. From top to bottom are pairwise comparisons between <it>D. melanogaster </it>and <it>D. erecta </it>(mel-ere), <it>D. pseudoobscura </it>(mel-pse), <it>D. virilis </it>(mel-vir) and <it>A. gambiae </it>(mel-ano), respectively. Features of this plot are as in Figure <figr fid="F3">3</figr>. Shown are five CNCS clusters corresponding to the muscle enhancer [<abbr bid="B91">91</abbr>], the brain-specific enhancer empirically verified in this study (Figure <figr fid="F8">8</figr>), and three predicted enhancers labeled CNCS clusters 1, 2 and 3. Note that the <it>HB </it>transposable element in the region 5' to <it>ap </it>is located between CNCS clusters and is not conserved between species.</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-4"/>
            </fig>
            <p>To estimate levels of sequence conservation in non-coding regions and to contrast patterns of coding with non-coding conservation, we aligned genomic sequences using the AVID alignment tool [<abbr bid="B42">42</abbr>]. AVID is a global alignment tool that works by recursively finding co-linear 'anchors' of maximal sequence identity; therefore, locally inverted or transposed sequences that might be conserved will not be included in our analysis. Conserved non-coding sequences (CNCSs), defined as windows of 10 bp or greater with 90% or greater nucleotide identity, were identified in pairwise alignments using the VISTA program [<abbr bid="B43">43</abbr>]. These parameters were chosen to identify short, highly conserved sequences found in <it>Drosophila </it>non-coding regions [<abbr bid="B44">44</abbr>]. We used <it>D. melanogaster </it>as the reference species in pairwise comparisons with non-melanogaster species, and Release 3 annotations [<abbr bid="B14">14</abbr>] exported from Gadfly in VISTA format to classify conserved segments as either coding or non-coding. Transcribed and nontranscribed non-coding sequences were analyzed together, since previous results showed similar patterns of conservation for intergenic and intronic sequences in <it>Drosophila </it>[<abbr bid="B44">44</abbr>].</p>
            <p>The results of this analysis are shown in Table <tblr tid="T4">4</tblr>, which contrasts features of conservation in both coding and non-coding sequences by species. For all species analyzed, coding regions have a higher proportion of sequences that meet our definition of conservation relative to non-coding sequences. In addition, the median segment length surpassing our criterion for conservation is longer for coding sequences relative to non-coding sequences for all species analyzed. These results are expected, as coding sequences are on average thought to experience more intense purifying selection than non-coding sequences [<abbr bid="B21">21</abbr>]. In contrast, the average percent identity of conserved segments is higher for non-coding sequences than coding sequences. This is probably a result of silent site substitution in otherwise functionally constrained coding sequences.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Estimates of pairwise sequence conservation in coding and non-coding regions between <it>D. melanogaster </it>and <it>D. erecta</it>, <it>D. pseudoobscura, D. willistoni </it>or <it>D. littoralis</it></p>
               </caption>
               <tblbdy cols="6">
                  <r>
                     <c ca="left">
                        <p>Species</p>
                     </c>
                     <c ca="center">
                        <p>Number of bp surveyed</p>
                     </c>
                     <c ca="center">
                        <p>Number of bp conserved</p>
                     </c>
                     <c ca="center">
                        <p>% conserved (bp)</p>
                     </c>
                     <c ca="center">
                        <p>Median length of conserved segment</p>
                     </c>
                     <c ca="center">
                        <p>Average % identity of conserved segment</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Coding</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. erecta</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>63,655</p>
                     </c>
                     <c ca="center">
                        <p>60,327</p>
                     </c>
                     <c ca="center">
                        <p>94%</p>
                     </c>
                     <c ca="center">
                        <p>39</p>
                     </c>
                     <c ca="center">
                        <p>93%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. pseudoobscura</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>46,626</p>
                     </c>
                     <c ca="center">
                        <p>26,978</p>
                     </c>
                     <c ca="center">
                        <p>61%</p>
                     </c>
                     <c ca="center">
                        <p>20</p>
                     </c>
                     <c ca="center">
                        <p>91%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. willistoni</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>42,224</p>
                     </c>
                     <c ca="center">
                        <p>18,774</p>
                     </c>
                     <c ca="center">
                        <p>45%</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>91%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. littoralis</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>19,717</p>
                     </c>
                     <c ca="center">
                        <p>10,997</p>
                     </c>
                     <c ca="center">
                        <p>63%</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>92%</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Non-coding</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. erecta</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>272,366</p>
                     </c>
                     <c ca="center">
                        <p>186,895</p>
                     </c>
                     <c ca="center">
                        <p>69%</p>
                     </c>
                     <c ca="center">
                        <p>24</p>
                     </c>
                     <c ca="center">
                        <p>94%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. pseudoobscura</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>276,731</p>
                     </c>
                     <c ca="center">
                        <p>77,391</p>
                     </c>
                     <c ca="center">
                        <p>28%</p>
                     </c>
                     <c ca="center">
                        <p>17</p>
                     </c>
                     <c ca="center">
                        <p>95%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. willistoni</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>174,421</p>
                     </c>
                     <c ca="center">
                        <p>19,700</p>
                     </c>
                     <c ca="center">
                        <p>13%</p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>95%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. littoralis</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>138,866</p>
                     </c>
                     <c ca="center">
                        <p>24,633</p>
                     </c>
                     <c ca="center">
                        <p>18%</p>
                     </c>
                     <c ca="center">
                        <p>15</p>
                     </c>
                     <c ca="center">
                        <p>95%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>D. virilis</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>114,015</p>
                     </c>
                     <c ca="center">
                        <p>30,564</p>
                     </c>
                     <c ca="center">
                        <p>27%</p>
                     </c>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>95%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p><it>D. virilis</it> [<abbr bid="B44">44</abbr>]</p>
                     </c>
                     <c ca="center">
                        <p>114,015</p>
                     </c>
                     <c ca="center">
                        <p>29,915</p>
                     </c>
                     <c ca="center">
                        <p>26%</p>
                     </c>
                     <c ca="center">
                        <p>19</p>
                     </c>
                     <c ca="center">
                        <p>93%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Microsyntenic regions were globally aligned using AVID and conserved sequences greater than or equal to 10 bp and 90% identity were identified using VISTA. Sequences were classified as coding or non-coding using Release 3 annotations [<abbr bid="B14">14</abbr>] exported from GadFly in VISTA format. Shown for comparison are a re-analysis of conservation between <it>D. melanogaster </it>and <it>D. virilis </it>using the current methods, as well as previous results, for a sample of non-coding regions published in [<abbr bid="B44">44</abbr>].</p>
               </tblfn>
            </tbl>
            <p>Analysis of levels of conservation by species shows that the increased rate of amino-acid sequence evolution in the <it>D. willistoni </it>lineage detected above may reflect a more widespread phenomenon in the genome of this species. As shown in Table <tblr tid="T4">4</tblr>, <it>D. willistoni </it>shows unexpectedly low levels of both non-coding and coding conservation, given the accepted phylogeny of the species. These data show that the increased rate of evolution in the <it>D. willistoni </it>lineage is not restricted to coding sequences, rendering coding-sequence-based interpretations of the unusual patterns of molecular evolution in this lineage less tenable (see, for example [<abbr bid="B7">7</abbr>,<abbr bid="B41">41</abbr>]). Together with the changes in base composition in both coding and non-coding sequences noted above, the increased rate of evolution in both coding and non-coding sequences detected in the <it>D. willistoni </it>suggests a genome-wide effect, possibly resulting from a change in mutation pressure or a change in population size at some time during the history of this lineage (see also [<abbr bid="B37">37</abbr>]).</p>
            <p>Despite the lineage effect in levels of conservation in the <it>D. willistoni </it>genome, the median length of conserved coding or non-coding segments generally decreases with increasing divergence time as expected (Table <tblr tid="T4">4</tblr>). However, the average percent identity of conserved coding or non-coding segments identified does not decrease with increasing divergence time. Finally, the ratio of conserved sequences that are coding relative to non-coding increases with increasing divergence time. The ratio of conserved sequences that are coding relative to non-coding is 1.36 for comparisons with <it>D. erecta</it>, but increases to 2.21 for comparisons involving <it>D. pseudoobscura </it>and approximately 3.5 for comparisons involving <it>D. willistoni </it>or <it>D. littoralis</it>.</p>
            <p>Changes in the median CNCS length reflect changes in the overall distribution of CNCS lengths in pairwise comparisons between <it>D. melanogaster </it>and either <it>D. erecta, D. pseudoobscura, D. willistoni</it>, or <it>D. littoralis </it>(Figure <figr fid="F5">5</figr>). These data quantitatively describe the pattern of non-coding conservation shown in Figures <figr fid="F2">2</figr> and <figr fid="F4">4</figr>: CNCS lengths become shorter with increasing divergence but plateau to approximately the same length in the most distant comparisons. The stability of this distribution at more extreme evolutionary distances is apparently insensitive to changes in the proportion of non-coding DNA that is conserved (compare <it>D. willistoni </it>and <it>D. littoralis</it>). Shown for comparison is the distribution of CNCS lengths between <it>D. melanogaster </it>and <it>D. virilis </it>from [<abbr bid="B44">44</abbr>], as well as a reanalysis of this data using the current methods. Differences between the present and previous results for the <it>D. virilis </it>data show the effect of different methods for detecting CNCSs. The differences observed in the distribution of CNCS lengths between the closely related species <it>D. virilis </it>and <it>D. littoralis </it>using the AVID-VISTA method reflect the fact that the <it>D. virilis </it>data were obtained from non-coding regions with known or suspected <it>cis</it>-regulatory function, whereas the data here represent a more random sampling of non-coding regions in the <it>Drosophila </it>genome.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Frequency distribution of CNCS lengths in <it>Drosophila </it>species</p>
               </caption>
               <text>
                  <p>Frequency distribution of CNCS lengths in <it>Drosophila </it>species. The distributions of CNCS lengths are shown for comparisons between <it>D. melanogaster </it>and either <it>D. erecta</it>, <it>D. pseudoobscura</it>, <it>D. willistoni </it>or <it>D. littoralis</it>. CNCSs of 10 bp or greater with 90% or greater nucleotide identity were identified using VISTA. Also shown for comparison is a re-analysis of the length distribution of CNCSs between <it>D. melanogaster </it>and <it>D. virilis </it>using the current methods, as well as previous results for a sample of noncoding regions published in [<abbr bid="B44">44</abbr>].</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-5"/>
            </fig>
            <p>Conservation of non-coding sequences is typically interpreted as evidence of functional constraint and this assumption underlies most phylogenetic footprinting methods. This assumption was questioned by Clark [<abbr bid="B45">45</abbr>], who proposed an alternative hypothesis for non-coding conservation based on heterogeneity in mutation rates (that is, mutational cold spots). To resolve these alternatives we studied the spatial distribution and conservation of spacing between CNCSs in the <it>Drosophila </it>genome. Under a simple mutational cold-spot hypothesis, CNCSs should occur randomly in non-coding DNA and the lengths of 'spacer intervals' between CNCSs should be exponentially distributed [<abbr bid="B46">46</abbr>,<abbr bid="B47">47</abbr>]. In addition, there should be no tendency for the spacing between mutational cold spots to remain conserved between divergent <it>Drosophila </it>species, given the rapid rate of DNA loss in unconstrained sequences in the <it>Drosophila </it>genome [<abbr bid="B48">48</abbr>,<abbr bid="B49">49</abbr>].</p>
            <p>As shown in Figure <figr fid="F6">6</figr> for non-coding comparisons between <it>D. melanogaster </it>and <it>D. pseudoobscura</it>, the frequency spacer interval lengths between CNCSs in the <it>D. melanogaster </it>genome differ significantly from the exponential distribution. The deviation from expected results from an excess of short and long spacer intervals, indicating that CNCSs are clustered in the <it>Drosophila </it>genome. Non-random spacing of CNCSs is also observed in other pairwise species comparisons in the genus <it>Drosophila </it>([<abbr bid="B46">46</abbr>] and data not shown). In addition, the lengths of homologous spacer intervals are highly correlated across species (Figure <figr fid="F7">7</figr>). This correlation is unlikely to be an artifact of alignment, as the AVID method first aligns regions of local similarity before generating a global alignment. Moreover, similar results have been obtained using non-global alignment methods [<abbr bid="B46">46</abbr>]. These results suggest that spacer interval sequences between CNCSs (and therefore CNCSs themselves) are functionally constrained, and provide evidence against the hypothesis that CNCSs are simply mutational cold spots.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Frequency distribution of spacer interval lengths separating CNCSs between <it>D. melanogaster </it>and <it>D. pseudoobscura</it></p>
               </caption>
               <text>
                  <p>Frequency distribution of spacer interval lengths separating CNCSs between <it>D. melanogaster </it>and <it>D. pseudoobscura</it>. Plotted is a histogram of the length in <it>D. melanogaster </it>of 'nonconserved' spacer interval sequences between CNCSs identified using VISTA (10-bp window, 90% identity). Spacer intervals separating a CNCS and a conserved coding segment, or between two conserved coding segments were omitted from this analysis. Note that only spacer interval lengths less than 250 bp are displayed for clarity. Solid lines represent the expectation under an exponential distribution using an estimate of the rate parameter &#955; based on the inverse of the mean spacer interval length to be 0.0165. The null hypothesis that spacer interval lengths are exponentially distributed can be rejected (&#967;<sup>2 </sup>= 2,040.1, df = 30, <it>p </it>&lt; 10<sup>-6</sup>), indicating that <it>Drosophila </it>CNCSs are non-randomly spaced.</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-6"/>
            </fig>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Correlation of spacer interval lengths separating CNCSs between <it>D. melanogaster </it>and <it>D. pseudoobscura</it></p>
               </caption>
               <text>
                  <p>Correlation of spacer interval lengths separating CNCSs between <it>D. melanogaster </it>and <it>D. pseudoobscura</it>. Each point represents the log<sub>10</sub>-transformed lengths for a homologous pair of spacer intervals. Spacer intervals separating a CNCS and a conserved coding segment, or between two conserved coding segments were omitted from this analysis. The solid line represents perfect spacer interval length conservation; the dashed lines represent order of magnitude size changes in spacer interval length between these two species. The correlation coefficient for homologous spacer interval lengths is <it>r </it>= 0.85 (<it>p </it>&lt; 0.01).</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-7"/>
            </fig>
            <p>Clusters of CNCSs are readily apparent in VISTA plots of complex gene regions with known <it>cis</it>-regulatory function (Figures <figr fid="F2">2</figr> and <figr fid="F4">4</figr>). In addition, there is a strong tendency for known <it>cis</it>-regulatory elements to overlap clusters of CNCSs. For example, discrete enhancers that control embryonic expression of <it>eve </it>are contained within discrete CNCS clusters in the region 5' to <it>eve </it>(Figure <figr fid="F2">2</figr>). In contrast, discrete CNCSs clusters are not observed in the region 3' to <it>eve </it>where enhancers overlap one another [<abbr bid="B50">50</abbr>,<abbr bid="B51">51</abbr>]. The correspondence of CNCS clusters and functional enhancers is observed in other regions of the <it>Drosophila </it>genome, such as the discrete muscle-specific enhancer in the fourth intron of <it>ap </it>(Figure <figr fid="F4">4</figr>). The inexact correspondence between enhancer sequences and CNCS clusters is perhaps not unexpected as enhancers are typically defined as the minimal sequence sufficient to recapitulate native expression in a reporter gene assay. Nevertheless, this pattern suggests a functional relationship between <it>cis</it>-regulatory elements and discrete CNCS clusters.</p>
            <p>To test the hypothesis that CNCS clusters can predict the location of <it>cis</it>-regulatory elements in the <it>Drosophila </it>genome, we carried out <it>P</it>-element-mediated reporter gene analysis of genomic sequences corresponding to a CNCS cluster in the fourth intron of <it>ap</it>. This CNCS cluster is apparent in pairwise comparisons between <it>D. melanogaster </it>and <it>D. pseudoobscura </it>as well as between <it>D. melanogaster </it>and <it>D. virilis </it>(Figure <figr fid="F4">4</figr>). <it>ap </it>is a LIM-homeobox transcription factor expressed in many tissues in <it>Drosophila</it>, including embryonic expression in the developing brain [<abbr bid="B52">52</abbr>,<abbr bid="B53">53</abbr>]. As shown in Figure <figr fid="F8">8</figr>, the <it>D. melanogaster </it>genomic sequences corresponding to the CNCS cluster in the <it>ap </it>intron 4 drives reporter gene expression in the <it>Drosophila </it>embryo in a specific pattern that recapitulates native <it>ap </it>expression in the developing brain. In addition, when introduced into the genome of <it>D. melanogaster</it>, the homologous fragment from the <it>D. virilis </it>genome also drives reporter gene expression in the same pattern, indicating that the expression pattern resulting from this enhancer has been conserved since the divergence of these two species. Experiments to test the function of CNCS clusters 1, 2, and 3 in the <it>ap </it>region are currently underway.</p>
            <fig id="F8">
               <title>
                  <p>Figure 8</p>
               </title>
               <caption>
                  <p>Reporter gene expression driven by genomic sequences corresponding to the CNCS cluster in <it>ap </it>intron 4</p>
               </caption>
               <text>
                  <p>Reporter gene expression driven by genomic sequences corresponding to the CNCS cluster in <it>ap </it>intron 4. Specific expression in the embryonic brain is driven by both <b>(a) </b><it>D. melanogaster </it>and <b>(b) </b><it>D. virilis </it>sequences, indicating that the function of this enhancer has been conserved in these two species.</p>
               </text>
               <graphic file="gb-2002-3-12-research0086-8"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Prospects for comparative gene prediction in <it>Drosophila</it></p>
            </st>
            <p>Although great progress has been made towards understanding the protein-coding component of eukaryotic genome sequences [<abbr bid="B54">54</abbr>,<abbr bid="B55">55</abbr>,<abbr bid="B56">56</abbr>,<abbr bid="B57">57</abbr>], comprehensive genome annotation is far from complete in any metazoan. State-of-the-art statistical and remote-homology gene-prediction methods are successful at identifying the location of exons in unannotated genomic DNA, but are often quite poor at predicting the details of gene structure, necessitating human curation [<abbr bid="B14">14</abbr>]. One of the most useful sources of information for accurately predicting complex gene structures is EST/cDNA data [<abbr bid="B58">58</abbr>]. Predicting the structure of genes for which no EST/cDNA data exists will require alternative approaches, such as comparative gene modeling among divergent species with conserved proteomes in the same group of organisms.</p>
            <p>The results of our K<sub>a</sub>/K<sub>s </sub>analyses presented here give preliminary insight into the prospects of comparative gene modeling using large-scale sequence data in the genus <it>Drosophila</it>. From our findings, we expect that structural details of most Release 3 coding sequences can be verified and improved using pairwise sequence data between divergent species like <it>D. melanogaster </it>and <it>D. pseudoobscura</it>. Our results also indicate that, although it may not be necessary for many genes in the <it>D. melanogaster </it>genome, <it>de novo </it>comparative gene prediction between these species will find the vast majority of as-yet unidentified genes lacking EST/cDNA data. It is important to note that we do not expect to detect all coding exons (especially short exons [<abbr bid="B20">20</abbr>]) in pairwise comparisons, highlighting the added value of multiple species data for comparative exon prediction. In addition, important details of gene models will prove difficult to predict using only comparative data, as amino-acid divergence (especially insertions or deletions) can obscure intron-exon boundaries and other details of gene structure. Moreover, there is inherent uncertainty in the 'correct' gene structure developed from comparative data alone, since two divergent sequences are simultaneously being modeled. Finally, the comparative annotation of UTR sequences awaits the development of methods that accurately predict the non-coding components of gene models.</p>
            <p>The patterns of protein-coding sequence evolution detected in our data have important implications for comparative gene prediction. Most notably, the trend we detect for predicted genes to show an increased rate of amino-acid substitution relative to known genes is important, as it may reflect differences in functional constraint or quality of gene models between the two classes of genes. For at least three reasons, we believe that the elevated rate of amino-acid substitution in predicted genes is not a result of poor-quality gene models in this class of genes. First, many of the genes in the predicted class have EST/cDNA data (see Supplementary Table 1), so the details of these gene models are likely to be correct. Second, estimates of K<sub>a </sub>(and K<sub>s</sub>) are based on aligned sequences; thus gross inaccuracies in gene models that would create gaps in the alignment are excluded from estimates of evolutionary rates. Third, differential rates in these two classes of genes maybe expected, as a high proportion of known genes were selected for study because their mutational inactivation resulted in an obvious phenotype. Thus we favor the interpretation that increased rates of amino-acid substitution in predicted genes results from lower levels of functional constraint.</p>
            <p>If this interpretation is correct, our results confirm those of Schmid and co-workers [<abbr bid="B34">34</abbr>,<abbr bid="B59">59</abbr>] who have shown that a large fraction of randomly sampled coding sequences and orphan genes are rapidly evolving in the genus <it>Drosophila</it>. Our results are also consistent with those of Ashburner <it>et al</it>. [<abbr bid="B60">60</abbr>] who show that genes with known mutant phenotypes in <it>D. melanogaster </it>are more likely to have a conserved homolog in GenBank relative to predicted genes with no known phenotype. Similarly, Zdobnov <it>et al</it>. [<abbr bid="B23">23</abbr>] show that <it>D. melanogaster </it>orphan genes tend to exhibit lower levels of conservation in pairwise comparison with <it>A. gambiae </it>[<abbr bid="B23">23</abbr>]. Finally, the interpretation that known and predicted genes differ in their levels of functional constraint is supported by the fact that increased rates of protein evolution in <it>D. willistoni </it>affect predicted genes more strongly than known genes (Table <tblr tid="T2">2</tblr>). Together these results suggest that there is a large class of functional protein-coding sequences evolving under weak selective constraint in the <it>Drosophila </it>genome [<abbr bid="B34">34</abbr>]. Rates of evolution for this class of genes may be too fast to allow the identification of homologs from extremely divergent species (such as <it>Anopheles</it>) for comparative gene prediction, but slow enough to use comparative data within the genus <it>Drosophila</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Rearrangement, transposition and genome annotation</p>
            </st>
            <p>Genome rearrangement in <it>Drosophila </it>typically occurs through paracentric inversion, allowing the homolo