<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-5-131</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Research article</dochead>
      <bibl>
         <title>
            <p>What can we learn from noncoding regions of similarity between genomes?</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Down</snm>
               <mi>A</mi>
               <fnm>Thomas</fnm>
               <insr iid="I1"/>
               <email>td2@sanger.ac.uk</email>
            </au>
            <au id="A2">
               <snm>Hubbard</snm>
               <mi>JP</mi>
               <fnm>Tim</fnm>
               <insr iid="I1"/>
               <email>th@sanger.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK</p>
            </ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2004</pubdate>
         <volume>5</volume>
         <issue>1</issue>
         <fpage>131</fpage>
         <url>http://www.biomedcentral.com/1471-2105/5/131</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15369604</pubid>
               <pubid idtype="doi">10.1186/1471-2105-5-131</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>04</day>
               <month>12</month>
               <year>2003</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>15</day>
               <month>9</month>
               <year>2004</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>15</day>
               <month>9</month>
               <year>2004</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2004</year>
         <collab>Down and Hubbard; licensee BioMed Central Ltd.</collab>
         <note>This is an open-access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome.</p>
            </sec>
            <sec>
               <st>
                  <p>Methods</p>
               </st>
               <p>Here we used a motif-oriented machine learning method based on the Relevance Vector Machine algorithm to extract the strongest signal from a set of non-coding conserved sequences.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We successfully fitted models to reflect the non-coding sequences, and showed that the results were quite consistent for repeated training runs. Using the learned models to scan genomic sequence, we found that they often made predictions close to the start of annotated genes. We compared this method with other published promoter-prediction systems, and showed that the set of promoters which are detected by this method is substantially similar to that detected by existing methods.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>The results presented here indicate that the promoter signal is the strongest single motif-based signal in the non-coding functional fraction of the genome. They also lend support to the belief that there exists a substantial subset of promoter regions which share several common features including, but not restricted to, a relative abundance of CpG dinucleotides. This subset is detectable by a variety of distinct computational methods.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Since the publication of draft sequences for the human <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> and mouse <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> genomes, several groups have run large-scale comparisons of the sequences to detect regions of conserved sequence. An initial survey of these was published along with the draft mouse genome <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, with additional comparisons appearing since then <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Briefly, protein coding genes are &#8211; as we might expect &#8211; among the most strongly conserved regions, but homologous sequences can be found throughout the genome. In total, it is possible to align up to 40% of the mouse genome to human sequence <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, but it seems likely that at least some of this is just random "comparative noise" &#8211; regions of sequence which serve no particular purpose but which, purely by chance, have not yet accumulated enough mutations to make their evolutionary relationship unrecognisable. However, it is widely accepted that some of the noncoding-but-similar regions, especially those with the highest levels of sequence identity between the two species, are preferentially conserved because they perform some important function. It has been estimated that around 5% of the genome is under purifying selection <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>, indicating that mutations in these regions have deleterious effects: a strong suggestion of some important function.</p>
         <p>Here, we apply the Eponine Windowed Sequence (EWS) sequence analysis method method which uses a Relevance Vector Machine (RVM) <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> to extract a minimal set of short motifs which are able to discriminate between two sets of sequences: in this case, a positive set of conserved non-coding sequences and a negative set of randomly picked non-coding sequences. The EWS model is an adaption of the Eponine Anchored Sequence (EAS) model, first applied for transcription start site prediction in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and subsequently used to predict a range of additional biological features including translation start sites and transcription termination sites [A. Ramadass, unpublished] While EAS is designed to classify individual points in a sequence &#8211; a feature which allows the model to predict precise locations for features such as transcription start sites &#8211; EWS classifies complete blocks (windows) of sequence. The basis functions (inputs) of the RVM are sums of position-weight matrix scores <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> across the whole window.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>We considered a set of alignments made by the blastz program <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> between release NCBI33 of the human genome and release NCBIM30 of the mouse genome. Since unprocessed blastz aligns around 40% of human sequence to the mouse genome, we chose to focus on the 'tight' alignments. These are a subset of alignments which are rescored and thresholded using a set of parameters given in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, and cover only around 5.6% of the human genome &#8211; a proportion much closer to the fraction of bases thought to be under purifying selection <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>.</p>
         <p>In total, the tight blastz set contained 787173 blocks of sequence with high-scoring alignments between the two genomes. We considered only those blocks assigned to human chromosome 6, a 170 Mb chromosome which has recently undergone manual annotation of gene structures and other features <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. This chromosome included 44105 (5.6%) of the total alignments. These varied in length from 34 to 9382 bases, with a length distribution skewed towards relatively short alignments, as shown in figure <figr fid="F1">1</figr>.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Blastz alignments between human chromosome 6 and the mouse genome</p>
            </caption>
            <text>
               <p><b>Blastz alignments between human chromosome 6 and the mouse genome. </b>Histogram showing number of alignments covering human sequences of various lengths.</p>
            </text>
            <graphic file="1471-2105-5-131-1"/>
         </fig>
         <p>Since we were interested in non-coding features of the genome, we ignored all regions where an alignment overlaps an annotated gene structure. This removed 20.8% of aligned bases. It is possible that some genes, and especially pseudogenes, have been missed by the annotation process, so we also removed portions covered by <it>ab initio </it>gene predictions from the Genscan program <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. This eliminated an additional 4.3% of aligned bases. Finally, repetitive sequence elements annotated by the programs RepeatMasker <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> and trf <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> (5.9%) were removed from the working set. The remainder of the aligned regions were split into non-overlapping 200 base windows, ignoring any portions less than 200 bases. This gave a set of 13925 sequences which are well-conserved between human and mouse &#8211; and therefore likely to be functional &#8211; but which are very unlikely to be part of the protein-coding repertoire. These formed the positive training set for our machine learning strategy.</p>
         <p>A negative training set of equal size was prepared by picking 200-base windows at random from the non-coding, non-repetitive portions of chromosome 6, using the same criteria to define repeats and coding sequence. While it is probable that this set also included some functional sequences, we would expect them to be represented at a substantially lower level than in the conserved set.</p>
         <p>These two sets of sequence were presented to the Eponine Windowed Sequence machine learning system, as described in the methods section. Randomly chosen 5-base words were used as seed motifs, and three independent training runs were performed, each for 2000 cycles. The set of motifs used in model 1 is shown in table <tblr tid="T1">1</tblr>.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Motifs used in EWS homology model 1. The entries in this table show consensus sequences of the weight matrices used in the model (note that it is possible for two distinct weight matrices to have the same consensus sequence). Motifs are listed in both forwards and reverse-complement orientation, and the two sections of the table indicate whether that motif is given a positive or negative weight in the learned linear model.</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c cspan="2" ca="left">
                     <p>Postive</p>
                  </c>
                  <c cspan="2" ca="left">
                     <p>Negative</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Forward</p>
                  </c>
                  <c ca="left">
                     <p>Reverse</p>
                  </c>
                  <c ca="left">
                     <p>Forward</p>
                  </c>
                  <c ca="left">
                     <p>Reverse</p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>gtca</p>
                  </c>
                  <c ca="left">
                     <p>tgac</p>
                  </c>
                  <c ca="left">
                     <p>tacgt</p>
                  </c>
                  <c ca="left">
                     <p>acgta</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>tattg</p>
                  </c>
                  <c ca="left">
                     <p>caata</p>
                  </c>
                  <c ca="left">
                     <p>gggca</p>
                  </c>
                  <c ca="left">
                     <p>tgccc</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>tgcca</p>
                  </c>
                  <c ca="left">
                     <p>tggca</p>
                  </c>
                  <c ca="left">
                     <p>gtca</p>
                  </c>
                  <c ca="left">
                     <p>tgac</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ggca</p>
                  </c>
                  <c ca="left">
                     <p>tgcc</p>
                  </c>
                  <c ca="left">
                     <p>acaat</p>
                  </c>
                  <c ca="left">
                     <p>attgt</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>tacgt</p>
                  </c>
                  <c ca="left">
                     <p>acgta</p>
                  </c>
                  <c ca="left">
                     <p>gggc</p>
                  </c>
                  <c ca="left">
                     <p>gcccc</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>gtact</p>
                  </c>
                  <c ca="left">
                     <p>agtac</p>
                  </c>
                  <c ca="left">
                     <p>tact</p>
                  </c>
                  <c ca="left">
                     <p>agta</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>taac</p>
                  </c>
                  <c ca="left">
                     <p>gtta</p>
                  </c>
                  <c ca="left">
                     <p>cctcc</p>
                  </c>
                  <c ca="left">
                     <p>ggagg</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ttt</p>
                  </c>
                  <c ca="left">
                     <p>aaa</p>
                  </c>
                  <c ca="left">
                     <p>ggca</p>
                  </c>
                  <c ca="left">
                     <p>tgcc</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>acaat</p>
                  </c>
                  <c ca="left">
                     <p>attgt</p>
                  </c>
                  <c ca="left">
                     <p>tattg</p>
                  </c>
                  <c ca="left">
                     <p>caata</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>caatt</p>
                  </c>
                  <c ca="left">
                     <p>aattg</p>
                  </c>
                  <c ca="left">
                     <p>tattg</p>
                  </c>
                  <c ca="left">
                     <p>caata</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>cagc</p>
                  </c>
                  <c ca="left">
                     <p>gctg</p>
                  </c>
                  <c ca="left">
                     <p>aaatt</p>
                  </c>
                  <c ca="left">
                     <p>aattt</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>cag</p>
                  </c>
                  <c ca="left">
                     <p>ctg</p>
                  </c>
                  <c ca="left">
                     <p>caat</p>
                  </c>
                  <c ca="left">
                     <p>attg</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>cggat</p>
                  </c>
                  <c ca="left">
                     <p>atccg</p>
                  </c>
                  <c ca="left">
                     <p>gtat</p>
                  </c>
                  <c ca="left">
                     <p>atac</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>aaatt</p>
                  </c>
                  <c ca="left">
                     <p>aattt</p>
                  </c>
                  <c ca="left">
                     <p>ccagg</p>
                  </c>
                  <c ca="left">
                     <p>cctgg</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>gctcg</p>
                  </c>
                  <c ca="left">
                     <p>cgagc</p>
                  </c>
                  <c ca="left">
                     <p>catg</p>
                  </c>
                  <c ca="left">
                     <p>catg</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ggc</p>
                  </c>
                  <c ca="left">
                     <p>gcc</p>
                  </c>
                  <c ca="left">
                     <p>act</p>
                  </c>
                  <c ca="left">
                     <p>agt</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>taagg</p>
                  </c>
                  <c ca="left">
                     <p>cctta</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>aaaaa</p>
                  </c>
                  <c ca="left">
                     <p>ttttt</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>While the exact set of motifs used in the model varied somewhat from run to run, testing pairs of models on non-overlapping windows from a 1 Mb region of human chromosome 22 and plotting the scores showed that the model outputs were highly correlated (<it>e.g</it>. figure <figr fid="F2">2</figr>). We calculated the Pearson correlation coefficient for all pairs, and in all cases this was greater than 0.96. From this strong correlation, we concluded that any variations in the model were simply the result of the trainer picking one representative from a group of motifs which provide similar information.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Correlation of model scores</p>
            </caption>
            <text>
               <p><b>Correlation of model scores. </b>Scatter plot showing the scores of EWS models 1 and 2 on a set of human sequences.</p>
            </text>
            <graphic file="1471-2105-5-131-2"/>
         </fig>
         <p>We scanned genomic sequences using these models at a range of thresholds, and examined the results on the Ensembl genome browser <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> using a Distributed Annotation System <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> server. Visual inspection showed that many of the highest-scoring regions were localised near the start of genes. This prompted us to look at the distribution of high-scoring sequences with respect to the starts of a set of well-annotated genes. We considered the GD_mRNA genes from version 2.3 of the human chromosome 22 annotation. These are confidently annotated genes with experimental evidence as described in <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, which confirms at least the approximate location of the ends of the transcripts, and are independent from the chromosome 6 training data. Figure <figr fid="F3">3</figr> shows the density of predictions with EWS scores &#8805; 0.90 relative to the annotated 5' ends of these genes. This shows a strong peak of predictions close to the annotated starts, demonstrating that the model is predicting some sequences commonly located around the transcription start site of genes. Combining this observation with the fact that the model was trained from conserved (and therefore presumed functional) sequences, we believe that it is detecting signals found in the promoter regions of genes.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Localisation of predictions</p>
            </caption>
            <text>
               <p><b>Localisation of predictions. </b>Density of predictions from one of the homology models around known gene starts on human chromosome 22</p>
            </text>
            <graphic file="1471-2105-5-131-3"/>
         </fig>
         <p>Evaluation of promoter-prediction methods on a large scale is a difficult exercise, since there are no large pieces of genomic sequence for which we can be certain we know the complete set of transcribed regions, and even in the case of well-known genes we often do not know the precise location at which transcription begins. In <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, we developed a pseudochromosome, derived from release 2.3 of the chromosome 22 annotation. As described above, this includes a subset of 284 experimentally verified gene structures. The pseudochromosome was constructed to include these genes while omitting all other annotated genes (which could be substantially truncated). We considered predictions (groups of one or more overlapping windows which all have scores greater than some chosen threshold) to be correct if they lie withing 2 kb of an annotated gene start, and false otherwise. Plotting accuracy (proportions of predictions which are correct) against coverage (proportion of transcript starts which are detected by one of the correct predictions) gives a Receiver Operating Characteristic (ROC) curve. Using this criterion, a totally random set of predictions would be given an accuracy of around 0.07. ROC curves are plotted for the three independently trained models in figure <figr fid="F4">4</figr>. Firstly, this shows that predictive performance for all three models is rather similar. It also shows that they can function as accurate promoter predictors, with accuracy rising to a plateau of around 0.7, much higher than expected for random predictions.</p>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Accuracy and coverage of TSS prediction</p>
            </caption>
            <text>
               <p><b>Accuracy and coverage of TSS prediction. </b>Plots of Accuracy vs. coverage at a range of score thresholds (ROC curves) for three independently trained homology models</p>
            </text>
            <graphic file="1471-2105-5-131-4"/>
         </fig>
         <p>We picked model 1 for further study. Using a score threshold of 0.91, this gives an accuracy of 0.68 and a coverage of 0.31. We compared the set of genes correctly detected by this model to two other methods: firstly, the EponineTSS predictor described in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, and secondly, the published results from the PromoterInspector program <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. PromoterInspector results were mapped to pseudochromosome coordinates using the procedure described in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Figure <figr fid="F5">5</figr> shows how the set of promoters detected by these three distinct methods overlaps. There are clearly strong correlations between all three methods. In particular, at this threshold the EWS homology model detects 98 promoters which were found by at least one of the other methods, but only 4 novel promoters.</p>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Comparison of TSS prediction methods</p>
            </caption>
            <text>
               <p><b>Comparison of TSS prediction methods. </b>Sets of pseudochromosome promoters correctly predicted by three different prediction methods: EponineTSS [6] with a score threshold of 0.999, PromoterInspector (labelled "Pro'spector"), and the homology-EWS model 1 with a score threshold of 0.91 ("Homol_1").</p>
            </text>
            <graphic file="1471-2105-5-131-5"/>
         </fig>
         <p>We investigated the robustness of the signal learned by this process by retraining models with a variety of seed word sizes, from 2 to 6 bases. During training, motifs can be trimmed to lengths shorter than that of the seed words (down to a minimum of 2 bases) but can never grow longer than the seed word size. When evaluated on the pseudochromosome, the resulting models always showed a preference for regions around gene starts, regardless of word length, as shown in figure <figr fid="F6">6</figr>. However, the accuracy was reduced when using short seed words &#8211; particularly words of length of 2. The best accuracy was seen for a seed word length of 5, and decreased somewhat for words of length 6.</p>
         <fig id="F6">
            <title>
               <p>Figure 6</p>
            </title>
            <caption>
               <p>Effect of seed-word size of learning</p>
            </caption>
            <text>
               <p><b>Effect of seed-word size of learning. </b>Accuracy vs. coverage plots for models trained using seed word lengths of 2 to 6 bases.</p>
            </text>
            <graphic file="1471-2105-5-131-6"/>
         </fig>
         <p>This suggests that a large fraction (but not all) of the information learned by these models can be encoded in dinucleotide frequencies. It is well known that many transcription start sites are close to regions of relatively high CpG dinucleotide composition (CpG islands) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. To investigate the contribution that CpG dinucleotides make to our models, we deleted all CpG dinucleotides from the training data, then re-evaluated the resulting models on the pseudochromsome (also with CpG dinucleotides removed), as shown in figure <figr fid="F7">7</figr>. Perhaps not surprisingly, dinucleotide models now show very little tendency to detect gene starts. However, as the word size increases, the preference for gene starts gradually increases, until a seed size of 6 gives an accuracy comparable to that see when CpG dinucleotides are included, although the maximum coverage before accuracy begins to drop rapidly is somewhat lower. Broadly similar results are seen if CpG dinucleotides are randomly replaced with other dinucleotides.</p>
         <fig id="F7">
            <title>
               <p>Figure 7</p>
            </title>
            <caption>
               <p>Effect of excluding CpG dinucleotides</p>
            </caption>
            <text>
               <p><b>Effect of excluding CpG dinucleotides. </b>Accuracy vs. coverage plots for models trained using a range of seed-word sizes, with all CpG dinucleotides removed from both training and test data.</p>
            </text>
            <graphic file="1471-2105-5-131-7"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>We have shown here that, when presented with a set of non-coding sequences which are strongly conserved between human and mouse, a simple motif-oriented machine learning system consistently builds models which are able to detect a substantial fraction of human promoter regions with good accuracy. This strongly suggests that this promoter signal represents the most widely used motif-based signal in functional non-coding sequence. While the model learned here can clearly be applied for the purpose of genome-wide promoter annotation, in practise existing methods offer better coverage and (in the case of the EponineTSS predictor) predictions for the precise location of the transcription start site.</p>
         <p>It is interesting that the promoter model learned by this technique detected substantially the same set of promoters as found by the EponineTSS and PromoterInspector methods. It has previously been remarked that these two methods detect similar sets <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, but this could perhaps be explained by the fact that both methods were initially derived from similar sets of known promoter sequences (in both cases, training data was extracted from the EPD database <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. In the case of the homology models described here, there is no connection with EPD, or any similar set of known promoters: the training data was picked purely on the basis of its high similarity to corresponding portions of the mouse genome. These results therefore support the alternate view that there is a particular 'easily detected' subclass of promoter sequences.</p>
         <p>One distinct group of promoters, which previous results show may correspond to this easily detected family, is the set of promoters associated with CpG islands <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. However, while a number of the motifs listed in table <tblr tid="T1">1</tblr> are G/C rich and/or contain the CpG dinucleotide, by no means all of the motifs match this description, and indeed one motif containing CpG has a negative weight in the linear model &#8211; its presence in a sequence will reduce the model's output score &#8211; while some A/T rich motifs have positive weights. We therefore believe that the signals detected here are significantly more complex than a simple over-representation of CpG dinucleotides. Experiments with smaller seed-word sizes support this assumption: while dinucleotide-based models were also able to predict promoter regions, the accuracy was lower than for models including longer motifs. Finally, we show that while the predictive capacity of dinucleotide models is largely eliminated once CpG dinucleotides are removed from the sequence, models including longer words are still able to make correct promoter predictions in many cases. So while CpG dinucleotides are an important contribution to the promoter signal, they are clearly not the only component.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>Genomic sequence and annotation</p>
            </st>
            <p>Human genome sequence release NCBI33 and mouse genome release NCBIM30 were extracted from Ensembl databases <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, which also contained gene predictions from Genscan <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> and repeat data from RepeatMasker <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> and trf <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Curated annotation of gene structures on human chromosome 6 was obtained from the Vega database <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Vega and Ensembl data was extracted directly from the SQL databases using the BioJava toolkit with biojava-ensembl extensions <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Genome alignments</p>
            </st>
            <p>Human-mouse genome alignments were generated by the blastz alignment program. These were subsequently re-scored and filtered to give a 'tight' set of high-confidence alignments, as described in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. We downloaded the tight alignment set from the UCSC genome website <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Pseudochromosome for testing promoter-finding methods</p>
            </st>
            <p>A 16.3 Mb pseudochromosome sequence was produced based on version 2.3 of the curated annotation for human chromosome 22. This includes all the experimentally-validated gene structures and their upstream regions, while omitting regions containing genes that are predicted but not fully verified. In the case of a pair of divergent genes where one has been verified and the second has not, their shared upstream region was cut at the midpoint. More information about pseudochromosome construction is given in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Eponine Windowed Sequence learning</p>
            </st>
            <p>The Eponine Windowed Sequence (EWS) model is designed by analogy to the Eponine Anchored Sequence model first described in <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, but rather than targeting individual points in the sequence, it is designed to classify small regions or windows of a sequence, based purely on their own sequence content.</p>
            <p>The EWS model uses the Relevance Vector Machine <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> algorithm to drive the training process. Relevance Vector Machines solve classification and regression problems by building Generalised Linear Models (GLMs) as weighted sums of a "working set" of basis functions. During the training process, those basis functions which are not informative are given weights close to zero and eventually discarded from the working set. To explore very large sets of possible basis functions, it is possible to add extra basis functions during the course of the training process <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
            <p>The "sensors" of the EWS model are DNA position-weight matrices <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, which make convenient models of short sequence motifs. When using weight matrices to analyse sequence windows, we sum the weight matrix probability scores for all possible positions within the sequence. Normalising for the length of the sequence being inspected and the size of the PWM, the basis functions of the model take the form:</p>
            <p>
               <graphic file="1471-2105-5-131-i1.gif"/>
            </p>
            <p>where <it>W</it>(<it>s</it>) is the probability that sequence <it>s </it>was emitted by weight matrix <it>W</it>, |<it>S</it>| is the sequence length, |<it>W</it>| is the weight matrix length, and <graphic file="1471-2105-5-131-i2.gif"/> denotes a subsequence from <it>i </it>to <it>j</it>.</p>
            <p>An initial set of basis functions is proposed by taking all possible DNA motifs of a specified length (typically 5) and generating weight matrices which preferentially recognise these motifs. As the relevance vector machine trainer removes non-informative basis functions from the working set, they are replaced by applying one of the following sampling strategies to a basis function picked randomly from the working set:</p>
            <p>&#8226; Generate a new weight matrix in which each column is a sample from a Dirichlet distribution with its mode equal to the weights in the corresponding column of the parent weight matrix.</p>
            <p>&#8226; Generate a new weight matrix one column shorter than the parent by removing either the first of the last column.</p>
            <p>By using these sampling rules, the trainer is able to explore motif space. The process of generating candidate motifs using these rules then selecting the most informative using the RVM can be seen as a form of genetic algorithm.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>TD and TH conceived and designed this study, and analysed results. TD implemented the Eponine machine learning system and drafted the manuscript. All authors read and approved the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Chromosome 22 annotation data version 2.3 were produced by the Chromosome 22 Annotation Group at the Sanger Institute and were obtained from the World Wide Web at http://www.sanger.ac.uk/HGP/Chr22 (Dunham <it>et al</it>. unpublished data). TD would like to thank the Wellcome Trust for funding.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Initial sequencing and analysis of the human genome</p>
            </title>
            <aug>
               <au>
                  <cnm>The Genome International Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2001</pubdate>
            <volume>409</volume>
            <fpage>860</fpage>
            <lpage>921</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35057062</pubid>
                  <pubid idtype="pmpid" link="fulltext">11237011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Initial sequencing and comparative analysis of the mouse genome</p>
            </title>
            <aug>
               <au>
                  <cnm>The Mouse Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>420</volume>
            <fpage>520</fpage>
            <lpage>562</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature01262</pubid>
                  <pubid idtype="pmpid" link="fulltext">12466850</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Computational comparison of two mouse draft genomes and the human golden path</p>
            </title>
            <aug>
               <au>
                  <snm>Xuan</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Genome Biology</source>
            <pubdate>2002</pubdate>
            <volume>4</volume>
            <fpage>R1</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">151282</pubid>
                  <pubid idtype="pmpid" link="fulltext">12537546</pubid>
                  <pubid idtype="doi">10.1186/gb-2002-4-1-r1</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Human-Mouse Alignments with BLASTZ</p>
            </title>
            <aug>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kent</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Smit</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Baertsch</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hardison</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genome Res.</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>103</fpage>
            <lpage>107</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430961</pubid>
                  <pubid idtype="pmpid" link="fulltext">12529312</pubid>
                  <pubid idtype="doi">10.1101/gr.809403</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Sparse Bayesian learning and the relevance vector machine</p>
            </title>
            <aug>
               <au>
                  <snm>Tipping</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Journal of Machine Learning Research</source>
            <pubdate>2000</pubdate>
            <volume>1</volume>
            <fpage>211</fpage>
            <lpage>244</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1162/15324430152748236</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Computational Detection and Location of Transcription Start Sites in Mammalian Genomic DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Down</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Genome Res.</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <fpage>652</fpage>
            <lpage>658</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1101/gr.216102</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1990</pubdate>
            <volume>212</volume>
            <fpage>563</fpage>
            <lpage>578</lpage>
            <xrefbib>
               <pubid idtype="pmpid">2329577</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>The DNA sequence and analysis of human chromosome 6</p>
            </title>
            <aug>
               <au>
                  <snm>Mungall</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Palmer</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sims</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ashurst</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wilming</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Horton</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hunt</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Scott</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Gilbert</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Clamp</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bethel</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Milne</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ainscough</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Almeida</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ambrose</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Andrews</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ashwell</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Babbage</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bagguley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bailey</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Banerjee</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Barlow</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Bates</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Beare</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Beasley</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Beasley</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Bird</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Blakey</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bray-Allen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Brook</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Burford</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Burrill</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Burton</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Carder</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Carter</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Chapman</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Clark</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Clark</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Glee</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Clegg</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Cobley</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Collier</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Collins</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Colman</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Corby</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Coville</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Culley</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Dhami</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Davies</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dunn</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Earthrowl</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ellington</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Faulkner</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Francis</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Frankish</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Frankland</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>French</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Garnett</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ghori</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gilby</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Gillson</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Glithero</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Grafham</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Grant</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gribble</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Griffiths</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Griffiths</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hall</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Halls</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hammond</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Harley</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hart</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Heath</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Heathcott</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Holmes</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Howden</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Howe</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Howell</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Huckle</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Humphray</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Humphries</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Hunt</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Joy</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kay</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Keenan</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kimberley</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>King</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Laird</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Langford</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lawlor</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Leongamornlert</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Leversha</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lloyd</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lloyd</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Loveland</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lovell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Mashreghi-Mohammadi</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Maslen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Matthews</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>McCann</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>McLaren</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>McLay</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>McMurray</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Moore</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Mullikin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Niblett</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Nickerson</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Novik</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Oliver</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Overton-Larty</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Parker</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Patel</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Pearce</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Peck</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Phillimore</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Phillips</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Plumb</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Porter</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ramsey</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Ranby</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rice</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ross</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Searle</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sehra</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Sheridan</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Skuce</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Spraggon</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Squares</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Steward</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sycamore</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Tamlyn-Hall</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Tester</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Theaker</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Thomas</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Thorpe</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Tracey</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Tromans</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Tubby</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Wall</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Wallis</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>West</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>White</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Whitehead</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Whittaker</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Wild</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Willey</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Wilmer</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>JM</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Wray</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Wyatt</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Young</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Younger</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bentley</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Coulson</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sulston</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dunham</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>J</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Beck</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2003</pubdate>
            <volume>425</volume>
            <fpage>805</fpage>
            <lpage>811</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature02055</pubid>
                  <pubid idtype="pmpid" link="fulltext">14574404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Prediction of complete gene structures in human genomic DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Burge</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1997</pubdate>
            <volume>268</volume>
            <fpage>78</fpage>
            <lpage>94</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1997.0951</pubid>
                  <pubid idtype="pmpid" link="fulltext">9149143</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>RepeatMasker</p>
            </title>
            <aug>
               <au>
                  <snm>Smit</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <url>http://ftp.genome.washington.edu/RM/RepeatMasker.html</url>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Tandem repeats finder: a program to analyze DNA sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res.</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>573</fpage>
            <lpage>580</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">148217</pubid>
                  <pubid idtype="pmpid" link="fulltext">9862982</pubid>
                  <pubid idtype="doi">10.1093/nar/27.2.573</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The Ensembl genome database project</p>
            </title>
            <aug>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Barker</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Cameron</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>L</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Cox</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Cuff</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Curwen</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Down</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Eyras</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Gilbert</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hammond</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Huminiecki</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kasprzyk</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lehvaslaiho</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Lijnzaad</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Melsopp</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Mongin</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Pettett</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>M</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Potter</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Rust</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Schmidt</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Searle</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Slater</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Spooner</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Stabenau</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Stalker</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Stupka</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Ureta-Vidal</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>I</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Clamp</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res.</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>30</fpage>
            <lpage>31</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/30.1.38</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Distributed Annotation System</p>
            </title>
            <url>http://www.biodas.org/</url>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22</p>
            </title>
            <aug>
               <au>
                  <snm>Collins</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Goward</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Cole</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Smink</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Huckle</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Knowles</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Beare</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Dunham</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Genome Res.</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>27</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">430954</pubid>
                  <pubid idtype="pmpid" link="fulltext">12529303</pubid>
                  <pubid idtype="doi">10.1101/gr.695703</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>First pass annotation of promoters on human chromosome 22</p>
            </title>
            <aug>
               <au>
                  <snm>Scherf</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Klingenhoff</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Freeh</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Quandt</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Grote</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Frisch</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gailus-Durner</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Seidel</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Brack-Werner</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Werner</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Genome Res.</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>333</fpage>
            <lpage>340</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.154601</pubid>
                  <pubid idtype="pmpid" link="fulltext">11230158</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>The Ensembl genome database project. </p>
            </title>
            <aug>
               <au>
                  <snm>Cross</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Bird</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res.</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>30</fpage>
            <lpage>31</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/30.7.e30</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>The Eukaryotic Promoter Database (EPD)</p>
            </title>
            <aug>
               <au>
                  <snm>Perier</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Praz</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Junier</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Bonnard</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bucher</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res.</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>307</fpage>
            <lpage>309</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/28.1.302</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Vega Genome Browser</p>
            </title>
            <url>http://vega.sanger.ac.uk/</url>
         </bibl>
         <bibl id="B19">
            <title>
               <p>BioJava</p>
            </title>
            <url>http://www.biojava.org/</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>UCSC Genome Bioinformatics</p>
            </title>
            <url>http://genome.cse.ucsc.edu/</url>
         </bibl>
      </refgrp>
   </bm>
</art>
