<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>gb-2005-6-2-r18</ui>
	<ji>GBJ</ji>
	<fm>
		<dochead>Method</dochead>
		<bibl>
			<title>
				<p>Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Elemento</snm>
					<fnm>Olivier</fnm>
					<insr iid="I1"/>
					<email>elemento@princeton.edu</email>
				</au>
				<au id="A2" ca="yes">
					<snm>Tavazoie</snm>
					<fnm>Saeed</fnm>
					<insr iid="I1"/>
					<email>tavazoie@molbio.princeton.edu</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA</p>
				</ins>
			</insg>
			<source>Genome Biology</source>
			<issn>1465-6906</issn>
			<pubdate>2005</pubdate>
			<volume>6</volume>
			<issue>2</issue>
			<fpage>R18</fpage>
			<url>http://genomebiology.com/2005/6/2/R18</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">15693947</pubid><pubid idtype="doi">10.1186/gb-2005-6-2-r18</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>1</day>
					<month>9</month>
					<year>2004</year>
				</date>
			</rec>
			<revrec>
				<date>
					<day>29</day>
					<month>10</month>
					<year>2004</year>
				</date>
			</revrec>
			<acc>
				<date>
					<day>3</day>
					<month>12</month>
					<year>2004</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>26</day>
					<month>1</month>
					<year>2005</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2005</year>
			<collab>Elemento and Tavazoie; licensee BioMed Central Ltd.</collab>
			<note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<shorttitle>
			<p>Genome-wide discovery of conserved regulatory elements</p>
		</shorttitle>
		<shortabs>
			<p>The authors describe a powerful approach for discovering globally conserved regulatory elements between two genomes that does not require alignments. Its application to pairs of yeasts, worm, flies and mammals, yields a large number of known and novel putative regulatory elements, many of which show surprising conservation across large phylogenetic distances.</p>
		</shortabs>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<p>We describe a powerful new approach for discovering globally conserved regulatory elements between two genomes. The method is fast, simple and comprehensive, without requiring alignments. Its application to pairs of yeasts, worms, flies and mammals yields a large number of known and novel putative regulatory elements. Many of these are validated by independent biological observations, have spatial and/or orientation biases, are co-conserved with other elements and show surprising conservation across large phylogenetic distances.</p>
			</sec>
		</abs>
	</fm>
	<meta>
		<classifications>
			<classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010008">Evolution</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies
</classification>
		</classifications>
	</meta>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>One of the major challenges facing biology is to reconstruct the entire network of protein-DNA interactions within living cells. A large fraction of protein-DNA interactions corresponds to transcriptional regulators binding DNA in the neighborhood of protein-coding and RNA genes. By interacting with RNA polymerase or recruiting chromatin-modifying machinery, transcriptional regulators increase or decrease the transcription rate of these genes. Transcriptional regulators bind specific DNA sequences upstream, within or downstream of the genes they regulate, and a large number of experimental and computational studies are aimed at locating these sites and understanding their functions (for example <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>). The increasing availability of whole-genome sequences provides unprecedented opportunities for identifying binding sites and studying their evolution. The strong conservation of functional elements (binding sites, protein-coding genes, noncoding RNAs, and so on) across even distantly related species should make it possible to predict these functional elements and prioritize them for experimental validation. The few large-scale comparative genomics approaches for finding transcriptional regulatory elements have so far relied mostly on detecting locally conserved motifs within global alignments of orthologous upstream sequences <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Although very powerful and straightforward, these approaches cannot be used when upstream regions are very divergent or have undergone genomic rearrangements. For example, aligning the mouse and puffer fish orthologous upstream regions would be very difficult, because of the great reduction that the puffer fish intergenic regions have undergone <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>. Also, global alignments cannot be used when the positions of regulatory elements within functionally conserved promoter regions have been scrambled, for example through genomic rearrangements. Also, global alignment-based approaches often generate an overwhelming number of predictions because of the basal conservation between the genomes under study. To reduce the number of predictions, multiple global alignments of upstream sequences from several related species have been used, yielding many new candidate binding sites <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. However, multiple (more than two) closely related genome sequences are not always available; moreover, by focusing only on regulatory elements that are conserved between several genomes, these approaches might miss elements that are conserved in more local areas of the phylogenetic tree.</p>
			<p>Here we describe a simple and efficient comparative approach for finding short noncoding DNA sequences that are globally conserved between two genomes, independently of their specific location within their respective promoter regions. Our method, which we call FastCompare, is based on a principle that we have termed 'network-level conservation' <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, according to which the wiring of transcriptional regulatory networks should be largely conserved between two closely related genomes.</p>
			<p>Our previous attempts at using network-level conservation relied on Gibbs sampling to find candidate regulatory elements <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. However, Gibbs sampling and related algorithms are not fully appropriate in this context, because of the low density of actual binding sites in pairs of orthologous upstream regions. Moreover, these algorithms are non-deterministic, relatively slow, and rely on sequence sampling, which makes them likely to miss many regulatory elements. While our previous approach was successful at predicting a large fraction of functional regulatory elements in the relatively small yeast genome, analyzing larger and more complex metazoan genomes requires faster and more exhaustive algorithms. Here, we use a faster, simpler and more comprehensive approach for detecting conserved and probably functional regulatory elements using the network-level conservation principle. FastCompare allows comprehensive exploration of the conserved - but not aligned - motifs between two genomes, while retaining a linear time complexity. We apply our approach to a large number of species, including yeasts, worms, flies and mammals, and describe some of the most conserved known and unknown regulatory elements within these genomes. We also show how this approach may help reconstruct part of the transcriptional network and reveal some of its associated constraints. Finally, we show that a large number of predicted motifs are conserved within and across different phylogenetic groups.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>In the following sections, pairs of closely related species are termed phylogenetic groups. We applied FastCompare to the four following phylogenetic groups: yeasts (<it>Saccharomyces cerevisiae </it>and <it>S. bayanus</it>), worms (<it>Caenorhabditis elegans </it>and <it>C. briggsae</it>), flies (<it>Drosophila melanogaster </it>and <it>D. pseudoobscura</it>) and mammals (<it>Homo sapiens </it>and <it>Mus musculus</it>). For each phylogenetic group, we describe some of the most interesting, known and novel, predicted regulatory elements. For each of these regulatory elements, we perform independent validation using gene expression data, chromatin immunoprecipitation (IP) data, known motifs and data from several biological databases (Gene Ontology (GO)/MIPS, TRANSFAC), and show that the most globally conserved predicted regulatory elements are strongly supported by these independent sources.</p>
			<sec>
				<st>
					<p>Yeasts</p>
				</st>
				<p>The average nucleotide identity between <it>S. cerevisiae </it>and <it>S. bayanus </it>upstream regions is approximately 62% <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> (similar to the identity between human and mouse upstream regions) and divergence times are estimated between 5 and 20 million years <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. The number of ortholog pairs between <it>S. cerevisiae </it>and <it>S. bayanus </it>is 4,358 (see Materials and methods). We chose to analyze 1 kb-long upstream regions, because most of the known transcription factor binding sites in <it>S. cerevisiae </it>are located within this range <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. Using FastCompare, we calculated a conservation score for all possible 7-, 8- and 9-mers on the corresponding 8.6 megabase-pairs (Mbp) of sequences and sorted each list separately according to conservation score (see Figure <figr fid="F1">1</figr>; the raw sorted lists are available on our website <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>). On a typical desktop PC, this analysis took approximately 5 minutes (for example, the entire set (8,170) of 7-mers was processed in 35 seconds).</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Overview of the FastCompare approach</p>
					</caption>
					<text>
						<p>Overview of the FastCompare approach. <b>(a) </b>Determination of orthologous pairs of ORFs, and extraction of the associated upstream regions (data not shown). <b>(b) </b>For each <it>k</it>-mer (here CACGTGA), determination of the sets of ORFs that contain it in their upstream regions, in each species separately. The conservation score (hypergeometric <it>p</it>-values to assess the overlap between both sets) is then calculated. <b>(c) </b>Ranking of all <it>k</it>-mers on the basis of their conservation scores.</p>
					</text>
					<graphic file="gb-2005-6-2-r18-1"/>
				</fig>
				<sec>
					<st>
						<p>Distribution of conservation scores</p>
					</st>
					<p>As described in Materials and methods, conservation scores are calculated for all <it>k</it>-mers (with fixed <it>k</it>), and are relative measures of network-level conservation for these <it>k</it>-mers (the higher the conservation score, the more conserved the corresponding <it>k</it>-mer). We first describe the distribution of conservation scores for all 7-mers. As shown in Figure <figr fid="F2">2</figr>, the distribution of conservation scores has a very long tail and many 7-mers on the tail correspond to well known regulatory elements in <it>S. cerevisiae </it>(see below for a detailed description of these sites). To verify that such high conservation scores could not be obtained by chance, we generated randomized sequences as described in Materials and methods and re-ran FastCompare on these sequences. The corresponding distribution of conservation scores is shown on Figure <figr fid="F2">2</figr> and clearly shows that the high conservation scores corresponding to known regulatory elements are extremely unlikely to arise by chance.</p>
					<fig id="F2">
						<title>
							<p>Figure 2</p>
						</title>
						<caption>
							<p>Distributions of conservation scores for actual (red) and randomized (black) data obtained when applying FastCompare to <it>S. cerevisiae </it>and <it>S. bayanus</it></p>
						</caption>
						<text>
							<p>Distributions of conservation scores for actual (red) and randomized (black) data obtained when applying FastCompare to <it>S. cerevisiae </it>and <it>S. bayanus</it>. Both distributions were constructed using bin sizes of 5. The top portion of the figure is not shown for the purpose of presentation. The distributions show that high conservation scores are unlikely to be obtained from randomized data. Also, a large number of 7-mers on the tail of the distribution correspond to experimentally verified transcription-factor-binding sites in yeast.</p>
						</text>
						<graphic file="gb-2005-6-2-r18-2"/>
					</fig>
				</sec>
				<sec>
					<st>
						<p>Validation using independent biological data</p>
					</st>
					<p>We used various independent sources of biological data to demonstrate that <it>k</it>-mers with the highest conservation scores are likely to be functional. For a given <it>k</it>-mer, we define the 'conserved set' as the set of ORFs corresponding to the overlap between the two sets of orthologous ORFs containing at least one exact match to the <it>k</it>-mer in their upstream regions (see Materials and methods). We found that conserved sets defined for the highest-scoring 7-mers are significantly enriched with genes whose upstream regions contain occurrences of known motifs in yeast (Figure <figr fid="F3">3a</figr>), significantly enriched with genes whose upstream regions were shown to be bound by known transcription factors <it>in vivo </it>(Figure <figr fid="F3">3b</figr>), and significantly enriched in at least one MIPS functional category (Figure <figr fid="F3">3c</figr>). We also show that the number of 7-mers found upstream of over- or underexpressed genes in at least one microarray condition increases with the conservation score (Figure <figr fid="F3">3d</figr>) and that the number of 7-mers matching at least one TRANSFAC consensus also increases with the conservation score (Figure <figr fid="F3">3e</figr>). Altogether, these data provide strong and independent evidence that our method identifies functional yeast regulatory elements by giving them a high conservation score.</p>
					<fig id="F3">
						<title>
							<p>Figure 3</p>
						</title>
						<caption>
							<p>Proportions of 7-mers supported by different types of independent biological data</p>
						</caption>
						<text>
							<p>Proportions of 7-mers supported by different types of independent biological data (<b>(a) </b>known motifs, <b>(b) </b>chromatin-IP, <b>(c) </b>functional enrichment, <b>(d) </b>under/overexpression, <b>(e) </b>TRANSFAC; windows of size 100 were used to construct the figures, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to <it>S. cerevisiae </it>and <it>S. bayanus</it>. (a-e) strongly indicate that the frequency of support increases with conservation score as calculated by FastCompare.</p>
						</text>
						<graphic file="gb-2005-6-2-r18-3"/>
					</fig>
					<p>Closer examination of Figure <figr fid="F3">3a-d</figr> shows that the 400 highest-scoring 7-mers are most strongly supported by independent data. Therefore we retain them for further analysis and, when possible, replace them by 8-mers and 9-mers with higher conservation scores and also add the high-scoring 8-mers and 9-mers without high-scoring substrings, as described in Materials and methods. This processing yields 398 <it>k</it>-mers (<it>k </it>= 7, 8 and 9).</p>
					<p>Then, for each of these 398 <it>k</it>-mers, we determine the optimal window within the initial 1 kb which maximizes the conservation score (see Materials and methods); we then re-evaluate the functionality of each of the 398 <it>k</it>-mers with the independent biological information described above, using the new conserved sets. The full information for the 398 <it>k</it>-mers is available at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
				</sec>
				<sec>
					<st>
						<p>Known regulatory elements</p>
					</st>
					<p>Using known transcription factor binding site motifs, genome-wide <it>in vivo </it>binding data, functional annotation and literature searches, we found at least 27 different known transcription factor binding sites among the 398 highest scoring <it>k</it>-mers. These regulatory elements, along with their support from independent biological data, are shown in Table <tblr tid="T1">1</tblr>. Some of the best-known binding sites are represented several times within the 398 top scoring <it>k</it>-mers, in the form of slightly distinct or overlapping sequences (see <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>). Note also that we use very stringent criteria for identifying known binding sites among our predictions. When we matched our predictions to the known motifs published in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> (regular expressions), we predicted 42 out of 53 known motifs (Kellis <it>et al</it>. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> predict exactly the same number of motifs, and essentially the same motifs, but using multiple alignments of four yeast genomes).</p>
					<tbl id="T1">
						<title>
							<p>Table 1</p>
						</title>
						<caption>
							<p>Known regulatory elements obtained when applying FastCompare to <it>S. cerevisiae </it>and <it>S. bayanus</it></p>
						</caption>
						<tblbdy cols="10">
							<r>
								<c ca="left">
									<p>Name</p>
								</c>
								<c ca="left">
									<p>Sequence</p>
								</c>
								<c ca="center">
									<p>Rank</p>
								</c>
								<c ca="center">
									<p>D<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>W<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>U/C</p>
								</c>
								<c ca="center">
									<p>Motif</p>
								</c>
								<c ca="center">
									<p>ChIP</p>
								</c>
								<c ca="center">
									<p>Experiment</p>
								</c>
								<c ca="left">
									<p>Best MIPS enrichment</p>
								</c>
							</r>
							<r>
								<c cspan="10">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Bas1</p>
								</c>
								<c ca="left">
									<p>AAGAGTCA</p>
								</c>
								<c ca="center">
									<p>159</p>
								</c>
								<c ca="center">
									<p>307</p>
								</c>
								<c ca="center">
									<p>[0;500]</p>
								</c>
								<c ca="center">
									<p>1.24</p>
								</c>
								<c ca="center">
									<p>BAS1</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2(1/1)</p>
								</c>
								<c ca="left">
									<p>Amino-acid metabolism (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Cbf1</p>
								</c>
								<c ca="left">
									<p>CACGTGA</p>
								</c>
								<c ca="center">
									<p>3</p>
								</c>
								<c ca="center">
									<p>368</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.70</p>
								</c>
								<c ca="center">
									<p>CBF1</p>
								</c>
								<c ca="center">
									<p>CBF1</p>
								</c>
								<c ca="center">
									<p>6(3/3)</p>
								</c>
								<c ca="left">
									<p>Amino-acid metabolism (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Ecm22/Upc6</p>
								</c>
								<c ca="left">
									<p>TAAACGA</p>
								</c>
								<c ca="center">
									<p>59</p>
								</c>
								<c ca="center">
									<p>362</p>
								</c>
								<c ca="center">
									<p>[100;500]</p>
								</c>
								<c ca="center">
									<p>1.36</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>11(9/2)</p>
								</c>
								<c ca="left">
									<p>Lipid, fatty-acid and isoprenoid biosynthesis (<it>p </it>&lt; 10<sup>-8</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Fkh1/2</p>
								</c>
								<c ca="left">
									<p>TAAACAAA</p>
								</c>
								<c ca="center">
									<p>88</p>
								</c>
								<c ca="center">
									<p>353</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.73</p>
								</c>
								<c ca="center">
									<p>FKH1</p>
								</c>
								<c ca="center">
									<p>FKH2</p>
								</c>
								<c ca="center">
									<p>2(1/1)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Gcn4</p>
								</c>
								<c ca="left">
									<p>TGACTCA</p>
								</c>
								<c ca="center">
									<p>160</p>
								</c>
								<c ca="center">
									<p>323.5</p>
								</c>
								<c ca="center">
									<p>[0;400]</p>
								</c>
								<c ca="center">
									<p>1.02</p>
								</c>
								<c ca="center">
									<p>GCN4</p>
								</c>
								<c ca="center">
									<p>GCN4</p>
								</c>
								<c ca="center">
									<p>102(76/26)</p>
								</c>
								<c ca="left">
									<p>Amino acid biosynthesis (<it>p </it>&lt; 10<sup>-29</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Gcr1</p>
								</c>
								<c ca="left">
									<p>TGGAAGC</p>
								</c>
								<c ca="center">
									<p>260</p>
								</c>
								<c ca="center">
									<p>663</p>
								</c>
								<c ca="center">
									<p>[600:1000]</p>
								</c>
								<c ca="center">
									<p>1.24</p>
								</c>
								<c ca="center">
									<p>GCR1</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>4(4/0)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Gis1</p>
								</c>
								<c ca="left">
									<p>AAGGGAT</p>
								</c>
								<c ca="center">
									<p>207</p>
								</c>
								<c ca="center">
									<p>402.5</p>
								</c>
								<c ca="center">
									<p>[100;800]</p>
								</c>
								<c ca="center">
									<p>1.31</p>
								</c>
								<c ca="center">
									<p>GIS1</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1(1/0)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Hap4</p>
								</c>
								<c ca="left">
									<p>CCAATCA</p>
								</c>
								<c ca="center">
									<p>114</p>
								</c>
								<c ca="center">
									<p>540</p>
								</c>
								<c ca="center">
									<p>[100:700]</p>
								</c>
								<c ca="center">
									<p>0.83</p>
								</c>
								<c ca="center">
									<p>HAP4</p>
								</c>
								<c ca="center">
									<p>HAP4</p>
								</c>
								<c ca="center">
									<p>3(2/1)</p>
								</c>
								<c ca="left">
									<p>Respiration (<it>p </it>&lt; 10<sup>-15</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Ino4</p>
								</c>
								<c ca="left">
									<p>CATGTGA</p>
								</c>
								<c ca="center">
									<p>177</p>
								</c>
								<c ca="center">
									<p>454</p>
								</c>
								<c ca="center">
									<p>[100:1000]</p>
								</c>
								<c ca="center">
									<p>1.24</p>
								</c>
								<c ca="center">
									<p>INO4</p>
								</c>
								<c ca="center">
									<p>INO4</p>
								</c>
								<c ca="center">
									<p>1(0/1)</p>
								</c>
								<c ca="left">
									<p>Lipid, fatty-acid and isoprenoid metabolism (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Mbp1</p>
								</c>
								<c ca="left">
									<p>ACGCGTC</p>
								</c>
								<c ca="center">
									<p>23</p>
								</c>
								<c ca="center">
									<p>225</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>3.25</p>
								</c>
								<c ca="center">
									<p>MBP1</p>
								</c>
								<c ca="center">
									<p>MBP1</p>
								</c>
								<c ca="center">
									<p>29(18/11)</p>
								</c>
								<c ca="left">
									<p>DNA synthesis and replication (<it>p </it>&lt; 10<sup>-11</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Met31</p>
								</c>
								<c ca="left">
									<p>TGTGGCG</p>
								</c>
								<c ca="center">
									<p>302</p>
								</c>
								<c ca="center">
									<p>424</p>
								</c>
								<c ca="center">
									<p>[100;1000]</p>
								</c>
								<c ca="center">
									<p>1.35</p>
								</c>
								<c ca="center">
									<p>MET31</p>
								</c>
								<c ca="center">
									<p>MET31</p>
								</c>
								<c ca="center">
									<p>4(4/0)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Met4</p>
								</c>
								<c ca="left">
									<p>CTGTGGC</p>
								</c>
								<c ca="center">
									<p>362</p>
								</c>
								<c ca="center">
									<p>500</p>
								</c>
								<c ca="center">
									<p>[100;800]</p>
								</c>
								<c ca="center">
									<p>1.08</p>
								</c>
								<c ca="center">
									<p>MET4</p>
								</c>
								<c ca="center">
									<p>MET4</p>
								</c>
								<c ca="center">
									<p>1(1/0)</p>
								</c>
								<c ca="left">
									<p>Amino acid metabolism (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Msn2/4</p>
								</c>
								<c ca="left">
									<p>AAAGGGG</p>
								</c>
								<c ca="center">
									<p>49</p>
								</c>
								<c ca="center">
									<p>332</p>
								</c>
								<c ca="center">
									<p>[0;500]</p>
								</c>
								<c ca="center">
									<p>1.92</p>
								</c>
								<c ca="center">
									<p>MSN2/4</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>105(93/12)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Gln3</p>
								</c>
								<c ca="left">
									<p>GATAAGA</p>
								</c>
								<c ca="center">
									<p>143</p>
								</c>
								<c ca="center">
									<p>434</p>
								</c>
								<c ca="center">
									<p>[0;900]</p>
								</c>
								<c ca="center">
									<p>1.23</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>7(7/0)</p>
								</c>
								<c ca="left">
									<p>Nitrogen and sulfur metabolism (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>PAC</p>
								</c>
								<c ca="left">
									<p>GCGATGAG</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="center">
									<p>164.5</p>
								</c>
								<c ca="center">
									<p>[0;400]</p>
								</c>
								<c ca="center">
									<p>6.77</p>
								</c>
								<c ca="center">
									<p>PAC</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>141(28/113)</p>
								</c>
								<c ca="left">
									<p>rRNA transcription (<it>p </it>&lt; 10<sup>-10</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Pdr3</p>
								</c>
								<c ca="left">
									<p>CCGCGGA</p>
								</c>
								<c ca="center">
									<p>357</p>
								</c>
								<c ca="center">
									<p>378</p>
								</c>
								<c ca="center">
									<p>[0;500]</p>
								</c>
								<c ca="center">
									<p>2.34</p>
								</c>
								<c ca="center">
									<p>PDR3</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>18(15/3)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Rap1</p>
								</c>
								<c ca="left">
									<p>TGGGTGT</p>
								</c>
								<c ca="center">
									<p>110</p>
								</c>
								<c ca="center">
									<p>498.5</p>
								</c>
								<c ca="center">
									<p>[100;900]</p>
								</c>
								<c ca="center">
									<p>1.19</p>
								</c>
								<c ca="center">
									<p>RAP1</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>13(1/12)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Reb1</p>
								</c>
								<c ca="left">
									<p>CGGGTAA</p>
								</c>
								<c ca="center">
									<p>1</p>
								</c>
								<c ca="center">
									<p>213</p>
								</c>
								<c ca="center">
									<p>[0;1000]</p>
								</c>
								<c ca="center">
									<p>6.48</p>
								</c>
								<c ca="center">
									<p>REB1</p>
								</c>
								<c ca="center">
									<p>REB1</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Rox1</p>
								</c>
								<c ca="left">
									<p>AACAATAG</p>
								</c>
								<c ca="center">
									<p>77</p>
								</c>
								<c ca="center">
									<p>288.5</p>
								</c>
								<c ca="center">
									<p>[0;500]</p>
								</c>
								<c ca="center">
									<p>2.05</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1 (0/1)*</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Rpn4</p>
								</c>
								<c ca="left">
									<p>TTTGCCACC</p>
								</c>
								<c ca="center">
									<p>20</p>
								</c>
								<c ca="center">
									<p>175.5</p>
								</c>
								<c ca="center">
									<p>[0;800]</p>
								</c>
								<c ca="center">
									<p>2.01</p>
								</c>
								<c ca="center">
									<p>RPN4</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>10(10/0)</p>
								</c>
								<c ca="left">
									<p>Cytoplasmic and nuclear degradation (<it>p </it>&lt; 10<sup>-31</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>RRPE</p>
								</c>
								<c ca="left">
									<p>AAAAATTTT</p>
								</c>
								<c ca="center">
									<p>2</p>
								</c>
								<c ca="center">
									<p>188</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>3.04</p>
								</c>
								<c ca="center">
									<p>RRPE</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>167(31/136)</p>
								</c>
								<c ca="left">
									<p>rRNA transcription (<it>p </it>&lt; 10<sup>-16</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Ste12</p>
								</c>
								<c ca="left">
									<p>TGAAACA</p>
								</c>
								<c ca="center">
									<p>282</p>
								</c>
								<c ca="center">
									<p>477</p>
								</c>
								<c ca="center">
									<p>100;1000]</p>
								</c>
								<c ca="center">
									<p>1.15</p>
								</c>
								<c ca="center">
									<p>STE12</p>
								</c>
								<c ca="center">
									<p>STE12</p>
								</c>
								<c ca="center">
									<p>5(3/2)</p>
								</c>
								<c ca="left">
									<p>fungal cell differentiation (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Sum1/Ndt80</p>
								</c>
								<c ca="left">
									<p>TGACACA</p>
								</c>
								<c ca="center">
									<p>51</p>
								</c>
								<c ca="center">
									<p>385</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>1.32</p>
								</c>
								<c ca="center">
									<p>SUM1</p>
								</c>
								<c ca="center">
									<p>SUM1</p>
								</c>
								<c ca="center">
									<p>1(1/0)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Swi4</p>
								</c>
								<c ca="left">
									<p>CGCGAAA</p>
								</c>
								<c ca="center">
									<p>19</p>
								</c>
								<c ca="center">
									<p>261</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>3.25</p>
								</c>
								<c ca="center">
									<p>SWI4</p>
								</c>
								<c ca="center">
									<p>SWI4</p>
								</c>
								<c ca="center">
									<p>39(22/17)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TATA</p>
								</c>
								<c ca="left">
									<p>TATATAA</p>
								</c>
								<c ca="center">
									<p>18</p>
								</c>
								<c ca="center">
									<p>291</p>
								</c>
								<c ca="center">
									<p>[100;700]</p>
								</c>
								<c ca="center">
									<p>4.70</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>49(40/9)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Ume6</p>
								</c>
								<c ca="left">
									<p>TAGCCGCC</p>
								</c>
								<c ca="center">
									<p>6</p>
								</c>
								<c ca="center">
									<p>457.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>3.92</p>
								</c>
								<c ca="center">
									<p>UME6</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Meiosis (<it>p </it>&lt; 10<sup>-7</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>Xbp1</p>
								</c>
								<c ca="left">
									<p>CCTCGAG</p>
								</c>
								<c ca="center">
									<p>219</p>
								</c>
								<c ca="center">
									<p>348</p>
								</c>
								<c ca="center">
									<p>[0;700]</p>
								</c>
								<c ca="center">
									<p>2.41</p>
								</c>
								<c ca="center">
									<p>XBP1</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>40(34/6)</p>
								</c>
								<c ca="left">
									<p>-</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p>For each known regulatory element, we show the best <it>k</it>-mer, its rank within the set of 398 highest-scoring <it>k</it>-mers, the median distance to ATG (for occurrences upstream of genes within the conserved set), the optimal window, the corrected ratio of upstream/coding bias, the best known motif (see Materials and methods), the best chromatin IP (ChIP) enrichment (see Materials and methods), the total (upregulated/downregulated) number of microarray conditions in which the <it>k</it>-mer was found (see Materials and methods), and the best MIPS enrichment. *This sequence was the most significantly over-represented 8-mer in the upstream regions of genes that were downregulated upon overexpression of the <it>Rox1 </it>gene (a known repressor of hypoxia-induced genes under aerobic conditions [95]), as part of a series of microarray experiments measuring <it>S. cerevisiae </it>transcriptional response to various stresses [96].</p>
						</tblfn>
					</tbl>
					<p>Among the 27 different known regulatory elements returned by FastCompare, several (Swi4, Mbp1, Sum1/Ndt80, Fkh1/2) are involved in regulating the yeast cell cycle. The other known sites are also involved in fundamental biological processes in yeast: amino-acid metabolism (Cbf1, Gcn4), meiosis (Ume6), rRNA transcription (PAC and RRPE), proteolytic degradation (Rpn4), stress response (Msn2/Msn4) and general activation/repression (Rap1, Reb1). As described in Materials and methods, our approach also handles gapped motifs. Thus, the binding sites for Abf1, a chromatin reorganizing transcription factor (CGTNNNNNNTGA), and Mcm1, a factor involved in cell-cycle regulation and pheromone response (CCCNNNNNGGA), were also identified as very high-scoring patterns and strongly supported by independent information (known motifs and chromatin immunoprecipitation).</p>
					<p>When we used the same independent biological data to evaluate the 400 highest-scoring 7-mers obtained on randomized data, we found only three known binding sites (RRPE, FKH1 and BAS1).</p>
					<p>Several known binding sites are not found among the 398 top-scoring <it>k</it>-mers, perhaps because their transcriptional network has undergone extensive rewiring since the speciation of the two yeasts, or because the corresponding transcription factors regulate few genes. In some cases, the presence of several known sites (clearly identified in terms of independent data) among the full set of 7-mers argues in favor of the rewiring hypothesis. For example, the binding site for the Rcs1 transcription factor, TGCACCC, only appears at the 1,883rd position within the list of ranked 7-mers. Despite its lack of conservation, this site is strongly backed by independent biological information: it is identified as a known motif, it is found in 33 microarray conditions, and its conserved set is significantly enriched in genes annotated with homeostasis of metal ions (<it>p </it>&lt; 10<sup>-5</sup>), which is the known function for Rcs1 <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. Similarly, the known binding sites for the Ace2/Swi5 and Hsf1 transcription factors were clearly identified (in terms of independent data) within the complete list of 7-mers, but not among the 398 highest scoring <it>k</it>-mers.</p>
				</sec>
				<sec>
					<st>
						<p>Positional constraints</p>
					</st>
					<p>It is now known that functional regulatory elements can be positionally constrained, relative to other regulatory elements or to the start of transcription <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. To assess whether some of the predicted regulatory elements are positionally constrained in yeast, we calculated the median distance to ATG for the conserved sets of each of the 398 <it>k</it>-mers and independently built the distribution of median distances to ATG for all 7-mers as described in Materials and methods (the distribution is shown in Figure <figr fid="F4">4</figr>) and found <it>d</it><sub>0.025 </sub>= 350 and <it>d</it><sub>0.975 </sub>= 680. In other words, a median distance to ATG of less than 350 or higher than 680 should each arise by chance with only a 2.5% probability. Among the 398 most conserved <it>k</it>-mers, more than a fifth (86) have their median distance below 350 (<it>p </it>&lt; 10<sup>-52</sup>), while only seven have a median distance greater than 680. A closer examination reveals that a few known sites are particularly constrained. For example, the binding sites for Reb1, PAC, TATA, Swi4, Rpn4, RRPE and Mbp1 are found to be situated relatively close to the start of translation, with a median distance to ATG between 150 and 300 bp. Some of these constraints were also found to be good predictors of gene expression in a recent study <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> (for RPN4, PAC and RRPE, for example). In contrast, binding sites for Met4, Ume6, Hap4, Rap1, Ino4 and Ste12 are found to be situated at a greater median distance, between 400 and 500 bp from ATG.</p>
					<fig id="F4">
						<title>
							<p>Figure 4</p>
						</title>
						<caption>
							<p>Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to <it>S. cerevisiae </it>and <it>S. bayanus</it></p>
						</caption>
						<text>
							<p>Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to <it>S. cerevisiae </it>and <it>S. bayanus</it>. For each 7-mer, a median distance to ATG was calculated using the positions of matches upstream of <it>S. cerevisiae </it>genes within the conserved set for this 7-mer. The 8,170 median distances were then binned into 20-bp bins, and the resulting histogram was smoothed using a normal kernel. The median distances for several known binding sites in <it>S. cerevisiae </it>are also indicated (see Table 1).</p>
						</text>
						<graphic file="gb-2005-6-2-r18-4"/>
					</fig>
				</sec>
				<sec>
					<st>
						<p>Novel predicted regulatory elements</p>
					</st>
					<p>We found many novel motifs among our highest-scoring predictions. For example, we found two strongly conserved motifs, AGGGTAA (rank 17) and TGTAAATA (rank 31), which are situated relatively close to ATG (with a median distance to ATG of 349 and 378.5 bp, respectively) and more often in upstream regions than in coding regions (with ratios of 1.95 and 1.83, respectively). Interestingly, TGTAAATA also has a statistically significant 5' to 3' orientation bias (binomial <it>p</it>-value &lt; 10<sup>-7</sup>). However, neither of the two putative sites is supported by independent biological data. Additional expression data may help define their biological role. Other sites, such as CAGCCGC or GCGCCGC are found upstream of over- or underexpressed genes in many microarray conditions (15 and 6, respectively). While these two sites are similar to the canonical Ume6-binding site, the latter was not found in any microarray conditions (as none of the microarray experiments we used is related to meiosis, the biological process which Ume6 is known to be involved in), suggesting that the two sites are bound by other factors.</p>
				</sec>
				<sec>
					<st>
						<p>Comparing closer and more distant yeast species</p>
					</st>
					<p>We repeated the same analysis on distinct pairs of yeast species other than <it>S. cerevisiae</it>/<it>S. bayanus</it>. We first compared <it>S. cerevisiae </it>and <it>S. paradoxus </it>(a much closer relative of <it>S. cerevisiae</it>) and found 15 of the 27 known motifs we obtained when comparing <it>S. cerevisiae </it>and <it>S. bayanus </it>(results are available at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>). We also compared <it>S. cerevisiae </it>with <it>S. castellii</it>, which is a more distant relative within the <it>Saccharomyces </it>phylogenetic group. <it>S. castelli </it>is interesting in that its upstream regions cannot be globally aligned with those of <it>S. cerevisiae</it>, because of extensive sequence divergence <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. We also found 15 of the 27 known motifs found in the <it>S. cerevisiae</it>/<it>S. bayanus </it>comparison (results at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>), although they were different from the <it>S. cerevisiae/S. paradoxus </it>conserved motifs. Interesting similarities and differences in conservation were revealed when comparing the known motifs discovered in each comparison. For example, the PAC, RRPE and Mbp1 motifs were found within the highest-scoring <it>k</it>-mers in all three comparisons, hinting at the conserved role of the corresponding proteins. However, the Reb1-binding site, which was found to be highly conserved between <it>S. cerevisiae </it>and <it>S. bayanus </it>(rank 1), is much less conserved between <it>S. cerevisiae </it>and <it>S. castelli </it>(rank 230). This argues for extensive rewiring in the Reb1 transcriptional network in the lineage that led to <it>S. castelli</it>.</p>
				</sec>
				<sec>
					<st>
						<p>Motif interactions</p>
					</st>
					<p>To discover interactions between regulatory elements, we searched for co-conservation of pairs of high-scoring predicted regulatory elements, as described in Materials and methods. Not surprisingly, the most conserved interaction is between RRPE (AAAAATTTT) and PAC (CTCATCGC), with a median distance <it>D </it>= 22 bp <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B13">13</abbr></abbrgrp>. We also find that the Cbf1-binding site (CACGTGA) is strongly co-conserved with the Met4-binding site (CTGTGGC), and that these two sites are separated by a short distance (<it>D </it>= 44.5) in <it>S. cerevisiae</it>. Indeed, it has been shown that the binding of Cbf1 in the vicinity of a very similar sequence (AAACTGTG) enhances the DNA-binding affinity of a Met4-Met28-Met31 complex for this sequence <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, and that the median distance between the above Cbf1 and Met4 sites is small <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>.</p>
					<p>Many of the predicted interactions have not yet been experimentally studied. For example, we found that the highest scoring Reb1 motif (CGGGTAA) is significantly co-conserved with both the highest scoring RRPE motif (AAAAATTTT) and the highest scoring PAC motif (CTCATCGC), with a short median distance between the two sites in both cases (<it>D </it>= 38 and <it>D </it>= 63.5, respectively). The Reb1/RRPE interaction was also discovered independently as a good predictor of expression <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. We also found that Reb1 interacts with the Cbf1 motif (CACGTGA), also at a short median distance (<it>D </it>= 30). An interesting interaction between RRPE and an unknown motif, TGAAGAA, displays a conserved set strongly enriched in translation (p &lt; 10<sup>-11</sup>), while RRPE alone is more strongly enriched in rRNA transcription (p &lt; 10<sup>-14</sup>). The full sorted list of interactions is available at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Worms</p>
				</st>
				<p>In contrast to yeast, relatively little is known about <it>cis</it>-regulatory sequences in <it>C. elegans</it>. There is a dramatically greater complexity of transcriptional regulation in multicellular organisms. Indeed, transcription factors in multicellular organisms regulate cohorts of genes in different tissues and at different times during development <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. <it>C. elegans </it>promoter regions often contain many domains of activation/repression and, as a result, are much larger than those in yeast.</p>
				<p>We applied FastCompare to the genomes of <it>C. elegans </it>and <it>C. briggsae</it>, two worms that diverged about 50-120 million years ago <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. The number of orthologous open reading frames (ORFs) between these two species is 13,046 and here we have only considered 2,000 bp upstream regions. It takes approximately 11 minutes for FastCompare to process the corresponding 50 Mbp of sequences and calculate a conservation score for all 7-, 8- and 9-mers on a typical desktop PC.</p>
				<sec>
					<st>
						<p>Validations</p>
					</st>
					<p>The distribution of conservation scores for all 7-mers shows that high conservation scores are unlikely to be obtained by chance (Figure <figr fid="F5">5a</figr>). As shown in Figure <figr fid="F5">5a</figr>, many known regulatory elements fall on the tail of the distribution. We then used functional categories, over- or underexpression, and TRANSFAC motifs to assess the ability of FastCompare to predict functional regulatory elements. Figure <figr fid="F5">5b-d</figr> shows that support for the highest-scoring <it>k</it>-mers by functional enrichment, expression and TRANSFAC strongly increases with conservation score. We have only retained the 400 highest-scoring 7-mers, which are particularly well supported by independent biological information as shown in Figure <figr fid="F5">5b,c</figr>. Starting from these 400 highest-scoring 7-mers, we obtain 437 <it>k</it>-mers (<it>k </it>= 7, 8 or 9) using the procedure described in Materials and methods.</p>
					<fig id="F5">
						<title>
							<p>Figure 5</p>
						</title>
						<caption>
							<p>Validation of the conservation scores obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it></p>
						</caption>
						<text>
							<p>Validation of the conservation scores obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it>. <b>(a) </b>Distributions of conservation scores for actual (red) and randomized (black) data, showing that high conservation scores are unlikely to be obtained by chance. Conservation scores for some known regulatory elements are also indicated. Both distributions were constructed using bin sizes of 5, and the top portion of the figure is not shown for the purpose of presentation. <b>(b-d) </b>Proportion of 7-mers supported by different types of independent biological data (using windows of size 100, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it>. (b-d) indicate that the frequency of support increases with conservation score as calculated by FastCompare.</p>
						</text>
						<graphic file="gb-2005-6-2-r18-5"/>
					</fig>
				</sec>
				<sec>
					<st>
						<p>Known regulatory elements</p>
					</st>
					<p>As shown in Table <tblr tid="T2">2</tblr>, at least 15 distinct known binding sites in <it>C. elegans </it>and other metazoan organisms were identified among the 437 predicted regulatory elements.</p>
					<tbl id="T2">
						<title>
							<p>Table 2</p>
						</title>
						<caption>
							<p>Known regulatory elements obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it></p>
						</caption>
						<tblbdy cols="9">
							<r>
								<c ca="left">
									<p>Sequence</p>
								</c>
								<c ca="center">
									<p>Rank</p>
								</c>
								<c ca="center">
									<p>D<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>W<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>Orientation</p>
								</c>
								<c ca="center">
									<p>U/C</p>
								</c>
								<c ca="center">
									<p>Experiment</p>
								</c>
								<c ca="center">
									<p>TRANSFAC</p>
								</c>
								<c ca="left">
									<p>Comments</p>
								</c>
							</r>
							<r>
								<c cspan="9">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGATAAG</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="center">
									<p>746</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>&#8592; (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
								<c ca="center">
									<p>1.67</p>
								</c>
								<c ca="center">
									<p>103(56/47)</p>
								</c>
								<c ca="center">
									<p>GATA-1, GATA-2</p>
								</c>
								<c ca="left">
									<p>Known GATA factor</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AATCGAT</p>
								</c>
								<c ca="center">
									<p>6</p>
								</c>
								<c ca="center">
									<p>865.5</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.00</p>
								</c>
								<c ca="center">
									<p>14(2/12)</p>
								</c>
								<c ca="center">
									<p>CDP, Clox</p>
								</c>
								<c ca="left">
									<p>Similar to DRE, embryonic development (<it>p </it>&lt; 10<sup>-8</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGACTCAT</p>
								</c>
								<c ca="center">
									<p>8</p>
								</c>
								<c ca="center">
									<p>708</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>&#8594; (<it>p </it>&lt; 10<sup>-4</sup>)</p>
								</c>
								<c ca="center">
									<p>1.40</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>AP-1, GCN4, NF-E2</p>
								</c>
								<c ca="left">
									<p>Known AP-1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GTGTTTGC</p>
								</c>
								<c ca="center">
									<p>9</p>
								</c>
								<c ca="center">
									<p>383.5</p>
								</c>
								<c ca="center">
									<p>[0;800]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.44</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known forkhead-related activator 4</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CACGTGG</p>
								</c>
								<c ca="center">
									<p>16</p>
								</c>
								<c ca="center">
									<p>935</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.73</p>
								</c>
								<c ca="center">
									<p>12(9/3)</p>
								</c>
								<c ca="center">
									<p>Myc/Max, PHO4, USF</p>
								</c>
								<c ca="left">
									<p>Known Myc-Max site in <it>Drosophila</it></p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AAGGTCA</p>
								</c>
								<c ca="center">
									<p>22</p>
								</c>
								<c ca="center">
									<p>882</p>
								</c>
								<c ca="center">
									<p>[0;1400]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.52</p>
								</c>
								<c ca="center">
									<p>35(16/19)</p>
								</c>
								<c ca="center">
									<p>ER, HNF-4</p>
								</c>
								<c ca="left">
									<p>Known HRE</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGACGTC</p>
								</c>
								<c ca="center">
									<p>32</p>
								</c>
								<c ca="center">
									<p>858</p>
								</c>
								<c ca="center">
									<p>[0;1700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.94</p>
								</c>
								<c ca="center">
									<p>1(1/0)</p>
								</c>
								<c ca="center">
									<p>CREB, ATF</p>
								</c>
								<c ca="left">
									<p>Known CREB site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGTCATCA</p>
								</c>
								<c ca="center">
									<p>42</p>
								</c>
								<c ca="center">
									<p>879</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.80</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>Skn-1</p>
								</c>
								<c ca="left">
									<p>Known SKN-1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAGCTGG</p>
								</c>
								<c ca="center">
									<p>56</p>
								</c>
								<c ca="center">
									<p>1093</p>
								</c>
								<c ca="center">
									<p>[100;2000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.67</p>
								</c>
								<c ca="center">
									<p>5(2/3)</p>
								</c>
								<c ca="center">
									<p>AP-4, HEN-1</p>
								</c>
								<c ca="left">
									<p>Known AP-4 and MyoD/CeMyoD site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AGAGAGA</p>
								</c>
								<c ca="center">
									<p>57</p>
								</c>
								<c ca="center">
									<p>893</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>&#8594; (<it>p </it>&lt; 10<sup>-90</sup>)</p>
								</c>
								<c ca="center">
									<p>1.43</p>
								</c>
								<c ca="center">
									<p>4(2/2)</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known GAGA-factor site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GTAAACA</p>
								</c>
								<c ca="center">
									<p>79</p>
								</c>
								<c ca="center">
									<p>818</p>
								</c>
								<c ca="center">
									<p>[0;400]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.69</p>
								</c>
								<c ca="center">
									<p>28(28/0)</p>
								</c>
								<c ca="center">
									<p>Freac, SRY</p>
								</c>
								<c ca="left">
									<p>Known DAF-16 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CCCGCCC</p>
								</c>
								<c ca="center">
									<p>88</p>
								</c>
								<c ca="center">
									<p>535</p>
								</c>
								<c ca="center">
									<p>[0;1400]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.48</p>
								</c>
								<c ca="center">
									<p>1(0/1)</p>
								</c>
								<c ca="center">
									<p>Sp1, GC box</p>
								</c>
								<c ca="left">
									<p>Known Sp1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATCAATCA</p>
								</c>
								<c ca="center">
									<p>100</p>
								</c>
								<c ca="center">
									<p>911</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.93</p>
								</c>
								<c ca="center">
									<p>1(1/0)</p>
								</c>
								<c ca="center">
									<p>Pbx-1</p>
								</c>
								<c ca="left">
									<p>Known Pbx-1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAGGTGA</p>
								</c>
								<c ca="center">
									<p>111</p>
								</c>
								<c ca="center">
									<p>845</p>
								</c>
								<c ca="center">
									<p>[0;200]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.25</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>Lmo2, RAV1</p>
								</c>
								<c ca="left">
									<p>Known Snail site in <it>Drosophila</it></p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TTCGCGC</p>
								</c>
								<c ca="center">
									<p>148</p>
								</c>
								<c ca="center">
									<p>651.5</p>
								</c>
								<c ca="center">
									<p>[0;1200]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.7</p>
								</c>
								<c ca="center">
									<p>16(7/9)</p>
								</c>
								<c ca="center">
									<p>E2F</p>
								</c>
								<c ca="left">
									<p>Known E2F site, embryonic development (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p>For each known regulatory element, we show the best <it>k</it>-mer, its rank within the set of 437 highest scoring <it>k</it>-mers, the median distance to ATG (for occurrences upstream of genes within the conserved set), the optimal window, the orientation bias, the corrected ratio of upstream/coding bias, the total (up-regulated/down-regulated) number of microarray conditions in which the <it>k</it>-mer was found (see Materials and methods), TRANSFAC matches, and the best GO enrichment.</p>
						</tblfn>
					</tbl>
					<p>One of the most conserved is TGATAAG, the binding site for the GATA factors, a family of regulators controlling intestinal development (see <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> for review). Another motif returned by FastCompare, GTGTTTGC, corresponds to the binding site for the forkhead-related activator-4 (Freac-4) <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. Note that this motif is also compatible with the PHA-4-binding site (published consensus: T[AG]TT[GT][AG][CT] <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>), present in the upstream regions of pharyngeal genes <abbrgrp><abbr bid="B20">20</abbr></abbrgrp> (PHA-4 is also a member of the forkhead family of transcription factors). FastCompare also returned TGTCATCA, the known binding site for the SKN-1 transcription factor (published consensus [AT][AT]T[AG]TCAT). In <it>C. elegans</it>, SKN-1 is known to initiate mesendodermal development by inducing expression of the GATA factors MED-1 and MED-2 (required for mesendodermal differentiation in the EMS lineage) <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
					<p>The GAGA-factor binding site (AGAGAGA) was also found as a highly conserved pattern. GAGA repeats in upstream regions have been shown to be functional in <it>C. elegans </it>in at least two separate studies <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. At least one GAGA-binding protein has been identified in <it>D. melanogaster</it>, and is assumed to create nucleosome-free regions of DNA, thus allowing additional transcription factors to bind those regions <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. However, the ortholog of this protein has not yet been identified in <it>C. elegans </it><abbrgrp><abbr bid="B24">24</abbr></abbrgrp>.</p>
					<p>We also found CAGCTGG, a site known to be bound by the myogenic basic helix-loop-helix (bHLH) family of transcription factors (in worms, flies and mammals) and AP-4 transcription factors (in mammals) <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp> (published consensus CAGCTG <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr></abbrgrp>). The homolog of human AP-4 was found to be ubiquitously expressed in <it>D. melanogaster </it>and a <it>C. elegans </it>homolog has also been identified <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. FastCompare returned GTAAACA, the known binding site for the DAF-16 transcription factor (published consensus GTAAACA <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr></abbrgrp>). DAF-16, a FOXO-family transcription factor, was shown to influence the rate of aging of <it>C. elegans </it>in response to insulin/insulin-like growth factor-1 signaling <abbrgrp><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>.</p>
					<p>Searching for gapped motifs found few strongly conserved sites. However, when searching for 8-mers with a 5-bp gap, we found that TGGCNNNNNGCCA, the known binding site for nuclear factor I (NFI) <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, had a score comparable to those of the highest-scoring <it>k</it>-mers.</p>
					<p>Several of the <it>C. elegans </it>sites returned by FastCompare and shown in Table <tblr tid="T2">2</tblr> are known to be functional transcription factor binding sites in other species. For example, TGACTCAT, identical to the AP-1-binding site <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, is known to be bound in yeast (by Gcn4), <it>Drosophila </it><abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, mouse and human (see <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> for a review).</p>
					<p>FastCompare also returns the CACGTGG motif, which is the binding site for the Myc/Max complex, a family of bHLH transcription factors <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. Among the top-scoring motifs in Table <tblr tid="T2">2</tblr>, we also find AAGGTCA, the hormone response element (HRE), bound by several transcription factors in human, mouse, fruit fly and silkworm (published consensus [CT]CAAGG[CT]C[AG] <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp>); TGACGTC, the cAMP response element (published consensus TGACGTCA <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>); CCCGCCC, the binding site for the mammalian Sp1 transcription factor (known consensus CCCCGCCCC); ATCAATCA, the known binding site for the human proto-oncogene Pbx-1 <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>. A similar site, ATCAATTA, has been shown to be bound <it>in vitro </it>by the <it>Drosophila </it>homolog of Pbx-1, the extradenticle (exd) protein <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. Moreover, CEH-20C was identified as the <it>C. elegans </it>homolog of both Pbx-1 and exd. Other known sites discovered by FastCompare include CAGGTGA, similar to the known binding site for the Snail protein, a transcription factor involved in dorso-ventral pattern formation in <it>Drosophila </it>(published consensus [AG][AT][AG]ACAGGTG[CT]AC <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>), and TTCGCGC, the known binding site for the E2F proteins, a family of transcription factors involved in regulating the cell cycle in <it>Drosophila </it>and mammals (published consensus TTTCGCGC <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>). An E2F homolog has been identified in <it>C. elegans </it>and recently shown to be involved in cell-cycle regulation <abbrgrp><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr></abbrgrp>.</p>
				</sec>
				<sec>
					<st>
						<p>Position and orientation biases</p>
					</st>
					<p>As in yeast, several of the known binding sites in <it>C. elegans </it>appear to be constrained in terms of position. Using the distribution of median distances for all 7-mers (see Materials and methods), we found <it>d</it><sub>0.025 </sub>= 690 and <it>d</it><sub>0.975 </sub>= 1,135. Among the 437 highest-scoring <it>k</it>-mers, we found that 75 are located below the lower threshold, a proportion that is much higher than the expected 2.5% (<it>p </it>&lt; 10<sup>-38</sup>). The binding sites for forkhead-related activator-4 (Freac-4), Sp1, E2F and AP-1 are particularly constrained (see Figure <figr fid="F6">6</figr>). We found only 21 <it>k</it>-mers to be located further away from the distant <it>d</it><sub>0.975 </sub>threshold. Interestingly, the most conserved <it>k</it>-mer among these 21, CCACCAGGA (rank 96), is found in the upstream regions of over- or underexpressed genes in 57 microarray conditions.</p>
					<fig id="F6">
						<title>
							<p>Figure 6</p>
						</title>
						<caption>
							<p>Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it></p>
						</caption>
						<text>
							<p>Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it>. For each 7-mer, a median distance to ATG was calculated using the positions of matches upstream of <it>C. elegans </it>genes within the conserved set for this 7-mer. The 8,170 median distances were then binned into 20-bp bins, and the resulting histogram was smoothed using a normal kernel. The median distances for several known binding sites in <it>C. elegans </it>are also indicated.</p>
						</text>
						<graphic file="gb-2005-6-2-r18-6"/>
					</fig>
					<p>Note that for a few predicted elements (for example, CAGGTGA, rank 111), the median distance falls outside of the optimal window; this is due to the fact that, for these elements, the median distance does not correspond to the peak of the distribution of distances to ATG. Hence, for these elements, the optimal window provides a better descriptor of the positional bias than the median distance. Additional analysis reveals that several of the known binding sites discovered in this study are constrained in term of orientation. For example, the binding site for the GATA-factor(s) (as shown in Table <tblr tid="T2">2</tblr>) is significantly more often found in the 3' to 5' orientation, relative to downstream genes. Probably the most interesting finding is that the GAGA repeats appear to be strongly oriented 3' to 5' relative to their downstream genes. Indeed, 2,375 out of 3,557 (67%) of the AGAGAGA sites are oriented 3' to 5', a proportion that is much larger than the expected 50% (p &lt; 10<sup>-90</sup>). This bias is confirmed by the fact that TCTCTCT alone (not taking into account its reverse complement) has a much higher conservation score (129.2) than AGAGAGA (34.3). We also found that several related motifs display a similar, albeit weaker, orientation bias, for example, GAAGAAG (<it>p </it>&lt; 10<sup>-16</sup>), GGAGGAG (<it>p </it>&lt; 10<sup>-10</sup>). It is interesting that all the GAGA repeats found to be necessary for correct expression of the <it>ceh-24 </it>and <it>unc-54 </it>genes are in fact TCTC repeats <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>. The conserved sets for TCTCTCT or AGAGAGA were not found to be enriched in any GO category. Note that this orientation bias is not due to genes with the repeats in their upstream regions being predominantly located on one strand, as these genes are approximately identically distributed on each strand (1,065/1,122, <it>p </it>= 0.89). Interestingly, conserved GAGA repeats in <it>D. melanogaster </it>were also found to be constrained in terms of orientation, but at a much lower significance (p &lt; 10<sup>-4</sup>, see below). Although it is possible that the TCTC repeats are bound at the 5' untranslated region (UTR) mRNA level, the positional distribution of the conserved AGAGAGA sites does not indicate a strong positional bias with respect to ATG (D<sub>ATG </sub>= 893).</p>
				</sec>
				<sec>
					<st>
						<p>Novel predicted regulatory elements</p>
					</st>
					<p>FastCompare also returned many novel motifs; some of the most interesting ones are shown in Table <tblr tid="T3">3</tblr>. The top-scoring motif, CTGCGTCT, belongs to this category. A larger version of that motif, TCTGCGTCTCT, was found in a recent study to be necessary for the expression of several ethanol-response genes <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>. However, the very high conservation of this site suggests a broader role. It is interesting to note that this site was not significantly found upstream of under- or overexpressed genes in any microarray conditions (including the data from <abbrgrp><abbr bid="B47">47</abbr></abbrgrp>). Interestingly, the most conserved <it>k</it>-mer found in yeast, the binding site for the Reb1 protein, had the same property. Moreover, this site displays a relatively strong orientation bias 5' to 3' (<it>p </it>&lt; 10<sup>-10</sup>).</p>
					<tbl id="T3">
						<title>
							<p>Table 3</p>
						</title>
						<caption>
							<p>Novel predicted regulatory elements obtained when applying FastCompare to <it>C. elegans </it>and <it>C. briggsae</it></p>
						</caption>
						<tblbdy cols="8">
							<r>
								<c ca="left">
									<p>Sequence</p>
								</c>
								<c ca="center">
									<p>Rank</p>
								</c>
								<c ca="center">
									<p>D<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>W<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>Orientation</p>
								</c>
								<c ca="center">
									<p>U/C</p>
								</c>
								<c ca="center">
									<p>Experiment</p>
								</c>
								<c ca="left">
									<p>Comments</p>
								</c>
							</r>
							<r>
								<c cspan="8">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CTGCGTCT</p>
								</c>
								<c ca="center">
									<p>1</p>
								</c>
								<c ca="center">
									<p>635.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>&#8594; (<it>p </it>&lt; 10<sup>-10</sup>)</p>
								</c>
								<c ca="center">
									<p>2.70</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGACACTCC</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="center">
									<p>234</p>
								</c>
								<c ca="center">
									<p>[0;1500]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.49</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, positive regulation of growth (<it>p </it>&lt; 10<sup>-7</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CTCCGCCC</p>
								</c>
								<c ca="center">
									<p>14</p>
								</c>
								<c ca="center">
									<p>440</p>
								</c>
								<c ca="center">
									<p>[0;900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>3.51</p>
								</c>
								<c ca="center">
									<p>2(2/0)</p>
								</c>
								<c ca="left">
									<p>Unknown site, similar to Sp1</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGAGACC</p>
								</c>
								<c ca="center">
									<p>20</p>
								</c>
								<c ca="center">
									<p>738</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.34</p>
								</c>
								<c ca="center">
									<p>30(7/23)</p>
								</c>
								<c ca="left">
									<p>Unknown site, embryonic development (<it>p </it>&lt; 10<sup>-7</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGCGACGC</p>
								</c>
								<c ca="center">
									<p>23</p>
								</c>
								<c ca="center">
									<p>457</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.34</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATTTCGCAA</p>
								</c>
								<c ca="center">
									<p>29</p>
								</c>
								<c ca="center">
									<p>641</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.50</p>
								</c>
								<c ca="center">
									<p>1(0/1)</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGTAAATC</p>
								</c>
								<c ca="center">
									<p>31</p>
								</c>
								<c ca="center">
									<p>514</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.78</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TTGCGGAC</p>
								</c>
								<c ca="center">
									<p>39</p>
								</c>
								<c ca="center">
									<p>253</p>
								</c>
								<c ca="center">
									<p>[0;1700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.43</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATGATGCAA</p>
								</c>
								<c ca="center">
									<p>44</p>
								</c>
								<c ca="center">
									<p>600</p>
								</c>
								<c ca="center">
									<p>[0;1600]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.88</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGCGCTC</p>
								</c>
								<c ca="center">
									<p>46</p>
								</c>
								<c ca="center">
									<p>576</p>
								</c>
								<c ca="center">
									<p>[0;900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.73</p>
								</c>
								<c ca="center">
									<p>2(0/2)</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGGCGCC</p>
								</c>
								<c ca="center">
									<p>49</p>
								</c>
								<c ca="center">
									<p>770.5</p>
								</c>
								<c ca="center">
									<p>[0;1800]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.01</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown palindromic site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AACCGGTT</p>
								</c>
								<c ca="center">
									<p>50</p>
								</c>
								<c ca="center">
									<p>651</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.41</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown palindromic site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TAAAGGCGC</p>
								</c>
								<c ca="center">
									<p>61</p>
								</c>
								<c ca="center">
									<p>524</p>
								</c>
								<c ca="center">
									<p>[0;700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>8.67</p>
								</c>
								<c ca="center">
									<p>27(12/15)</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGCGCGC</p>
								</c>
								<c ca="center">
									<p>120</p>
								</c>
								<c ca="center">
									<p>455</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>5.40</p>
								</c>
								<c ca="center">
									<p>11(3/8)</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CTAATCC</p>
								</c>
								<c ca="center">
									<p>228</p>
								</c>
								<c ca="center">
									<p>934</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>&#8594; (<it>p </it>&lt; 10<sup>-7</sup>)</p>
								</c>
								<c ca="center">
									<p>1.20</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown homeodomain site, similar to Bicoid</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TACCGTA</p>
								</c>
								<c ca="center">
									<p>242</p>
								</c>
								<c ca="center">
									<p>975</p>
								</c>
								<c ca="center">
									<p>[0;500]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.23</p>
								</c>
								<c ca="center">
									<p>20(18/2)</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p><it>k</it>-mers shown here were selected from the list of 437 highest scoring <it>k</it>-mers based on their short median distance to ATG, short optimal window, significant orientation bias, strong over-representation ratio (U/C), presence in upstream regions of over/underexpressed genes in several microarray conditions, palindromicity or resemblance to known sites in other species.</p>
						</tblfn>
					</tbl>
					<p>Several of the other novel predicted regulatory elements in Table <tblr tid="T3">3</tblr> have interesting properties. For example, the fourth most-conserved <it>k</it>-mer, CGACACTCC, is one of the closest motifs to ATG, with a median distance of 234 bp, and its conserved set is strongly enriched in genes involved in positive regulation of growth (a biological process defined in GO as the increase in size or mass of all or part of the worm) (p &lt; 10<sup>-7</sup>). Another predicted regulatory element, CGAGACC (rank 20), is found upstream of downregulated genes in 23 microarray conditions. Interestingly, it is found upstream of downregulated genes in a study measuring gene-expression changes at several time points during worm aging <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>, in two distinct strains (<it>fer-15 </it>and <it>spe-9;fer-15</it>) and at similar time points (6, 9 and 10 days for <it>fer-15</it>, 9 and 11 for <it>spe-9;fer-15</it>). In addition, the functional enrichment of its conserved set points at a potential role in embryonic development (<it>p </it>&lt; 10<sup>-7</sup>). Another strongly conserved and novel motif, CTCCGCCC (rank 14), was independently found upstream of almost all transcribed worm microRNA genes in a recent study <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>.</p>
				</sec>
				<sec>
					<st>
						<p>Motif interactions</p>
					</st>
					<p>We found many interactions between the most conserved <it>k</it>-mers found at the previous stage. For example, the most conserved <it>k</it>-mer, TCTGCGTCT, is very often co-conserved with AGAGAGA. The high-scoring interaction between the DRE-like motif, AATCGAT and the putative E2F-binding site, TTTTCGC, also appears interesting. Indeed, the conserved sets for both <it>k</it>-mers are separately enriched significantly with genes involved in embryonic development, according to GO (<it>p </it>&lt; 10<sup>-8 </sup>and <it>p </it>&lt; 10<sup>-7</sup>, respectively). However, the conserved set of genes having both elements in their upstream regions is even more enriched in this GO category (<it>p </it>&lt; 10<sup>-9</sup>). TTTTCGC also seems to interact with the novel site CGACACTCC, and the corresponding conserved set is enriched with genes involved in modification-dependent protein catabolism (<it>p </it>&lt; 10<sup>-5</sup>). The full list of motif interactions is available at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Flies</p>
				</st>
				<p>We applied FastCompare to the genomes of <it>D. melanogaster </it>and <it>D. pseudoobscura</it>, two species of <it>Drosophila </it>that diverged about 46 million years ago <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>. The number of orthologous ORFs between these two species is 11,306 and here we only consider 2,000-bp upstream regions. Using 5,000 bp instead produced similar results, but also produced additional putative binding sites (results are available at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>). It takes approximately 10 minutes for FastCompare to process the corresponding 45 Mbp of sequences and calculate a conservation score for all 7-mers, 8-mers and 9-mers on a typical desktop PC.</p>
				<sec>
					<st>
						<p>Validations</p>
					</st>
					<p>The distribution of conservation scores shown in Figure <figr fid="F7">7a</figr>, for actual and randomized data, shows once again that the high conservation scores obtained with the real sequences are very unlikely to be achieved by chance. Also, as shown in Figure <figr fid="F7">7a</figr>, many known regulatory elements fall on the tail of the distribution.</p>
					<fig id="F7">
						<title>
							<p>Figure 7</p>
						</title>
						<caption>
							<p>Validation of the conservation scores obtained when applying FastCompare to <it>D. melanogaster </it>and <it>D. pseudoobscura</it></p>
						</caption>
						<text>
							<p>Validation of the conservation scores obtained when applying FastCompare to <it>D. melanogaster </it>and <it>D. pseudoobscura</it>. <b>(a) </b>Distributions of conservation scores for actual (red) and randomized (black) data, showing that high conservation scores are unlikely to be obtained from randomized data. Conservation scores for certain known regulatory elements are also indicated. Both distributions were constructed using bin sizes of 5, and the top portion of the figure is not shown for the purpose of presentation. <b>(b, c) </b>Proportion of 7-mers supported by different types of independent biological data (using windows of size 100, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to <it>D. melanogaster </it>and <it>D. pseudoobscura</it>. (b, c) strongly indicate that the frequency of support increases with conservation score as calculated by FastCompare.</p>
						</text>
						<graphic file="gb-2005-6-2-r18-7"/>
					</fig>
					<p>As for the yeast and worm genomes, we used functional annotations (GO), expression data and known TRANSFAC sites to evaluate the FastCompare predictions. Unfortunately, expression data is often available for only a subset of genes and its analysis led to very few validations. However, Figure <figr fid="F7">7b,c</figr> clearly shows that functional enrichment of the conserved sets and TRANSFAC matches strongly correlate with conservation score. As with yeasts and worms, we focused on the 400 highest-scoring 7-mers, which are particularly well supported by the functional enrichment analysis (see Figure <figr fid="F7">7b</figr>). The simple processing described in Materials and methods yielded 469 <it>k</it>-mers (<it>k </it>= 7, 8 or 9), which we further analyze below.</p>
				</sec>
				<sec>
					<st>
						<p>Known regulatory elements</p>
					</st>
					<p>As shown in Table <tblr tid="T4">4a</tblr>, we found at least 16 distinct known regulatory elements among the 469 highest-scoring <it>k</it>-mers. The most conserved element, AACAGCTG, is similar to the site known to be bound by AP-4 (mammals) and MyoD (worms, flies and mammals). One of the most interesting predictions is TATCGATA (rank 12); this palindromic motif, known as the DNA replication-related element (DRE), has been experimentally proved to be necessary for proper expression of several cell proliferation-related genes in <it>D. melanogaster </it><abbrgrp><abbr bid="B51">51</abbr></abbrgrp> and, more recently, the genes encoding the TATA-binding protein (TBP) <abbrgrp><abbr bid="B52">52</abbr></abbrgrp> and catalase <abbrgrp><abbr bid="B53">53</abbr></abbrgrp> in the same organism. Interestingly, it is both the motif with the closest median distance to ATG (D<sub>ATG </sub>= 168), and the most over-represented <it>k</it>-mer (among the 469 highest scoring ones) within <it>D. melanogaster </it>upstream regions compared to exons, with a ratio of 5.39.</p>
					<tbl id="T4">
						<title>
							<p>Table 4</p>
						</title>
						<caption>
							<p>Known and novel predicted regulatory elements, obtained when applying FastCompare to <it>D. melanogaster </it>and <it>D. pseudoobscura</it></p>
						</caption>
						<tblbdy cols="8">
							<r>
								<c ca="left">
									<p>Sequence</p>
								</c>
								<c ca="center">
									<p>Rank</p>
								</c>
								<c ca="center">
									<p>D<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>W<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>Orientation</p>
								</c>
								<c ca="center">
									<p>U/C</p>
								</c>
								<c ca="center">
									<p>TRANSFAC</p>
								</c>
								<c ca="left">
									<p>Comments</p>
								</c>
							</r>
							<r>
								<c cspan="8">
									<hr/>
								</c>
							</r>
							<r>
								<c cspan="8" ca="left">
									<p><b>(a) </b>Known regulatory elements</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AACAGCTG</p>
								</c>
								<c ca="center">
									<p>1</p>
								</c>
								<c ca="center">
									<p>373</p>
								</c>
								<c ca="center">
									<p>[0;1800]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.64</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known AP-4/MyoD site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATTTGCATA</p>
								</c>
								<c ca="center">
									<p>3</p>
								</c>
								<c ca="center">
									<p>882</p>
								</c>
								<c ca="center">
									<p>[100;2000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>3.20</p>
								</c>
								<c ca="center">
									<p>Oct-1</p>
								</c>
								<c ca="left">
									<p>Known (mammalian) Oct-1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CACGTGC</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="center">
									<p>825.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.02</p>
								</c>
								<c ca="center">
									<p>Myc/Max, PHO4, USF</p>
								</c>
								<c ca="left">
									<p>Known Myc/Max site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATTTATGC</p>
								</c>
								<c ca="center">
									<p>6</p>
								</c>
								<c ca="center">
									<p>866</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>3.52</p>
								</c>
								<c ca="center">
									<p>CdxA</p>
								</c>
								<c ca="left">
									<p>Known CdxA site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGACGTCA</p>
								</c>
								<c ca="center">
									<p>9</p>
								</c>
								<c ca="center">
									<p>825</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.36</p>
								</c>
								<c ca="center">
									<p>CREB</p>
								</c>
								<c ca="left">
									<p>Known CREB site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGATAAG</p>
								</c>
								<c ca="center">
									<p>11</p>
								</c>
								<c ca="center">
									<p>760.5</p>
								</c>
								<c ca="center">
									<p>[0;1100]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.53</p>
								</c>
								<c ca="center">
									<p>GATA</p>
								</c>
								<c ca="left">
									<p>Known GATA site, carbohydrate metabolism (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TATCGATA</p>
								</c>
								<c ca="center">
									<p>12</p>
								</c>
								<c ca="center">
									<p>168</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>5.39</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known DRE site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TTTATGGC</p>
								</c>
								<c ca="center">
									<p>14</p>
								</c>
								<c ca="center">
									<p>978.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.82</p>
								</c>
								<c ca="center">
									<p>Abd-B</p>
								</c>
								<c ca="left">
									<p>Known Abd-B site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TAATTGA</p>
								</c>
								<c ca="center">
									<p>24</p>
								</c>
								<c ca="center">
									<p>907</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.58</p>
								</c>
								<c ca="center">
									<p>Ubx, Athb-1</p>
								</c>
								<c ca="left">
									<p>Known Antp site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GAGAGAG</p>
								</c>
								<c ca="center">
									<p>26</p>
								</c>
								<c ca="center">
									<p>705.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>&#8592; (<it>p </it>&lt; 10<sup>-4</sup>)</p>
								</c>
								<c ca="center">
									<p>1.87</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known GAGA site, morphogenesis (<it>p </it>&lt; 10<sup>-23</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAGGTGC</p>
								</c>
								<c ca="center">
									<p>33</p>
								</c>
								<c ca="center">
									<p>1020.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.83</p>
								</c>
								<c ca="center">
									<p>Sn</p>
								</c>
								<c ca="left">
									<p>Known Snail site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGACTCA</p>
								</c>
								<c ca="center">
									<p>46</p>
								</c>
								<c ca="center">
									<p>911</p>
								</c>
								<c ca="center">
									<p>[100;2000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.89</p>
								</c>
								<c ca="center">
									<p>AP-1, GCN4</p>
								</c>
								<c ca="left">
									<p>Known AP-1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATCAATCA</p>
								</c>
								<c ca="center">
									<p>51</p>
								</c>
								<c ca="center">
									<p>967</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.72</p>
								</c>
								<c ca="center">
									<p>Pbx-1</p>
								</c>
								<c ca="left">
									<p>Known Pbx-1 site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AAGGTCA</p>
								</c>
								<c ca="center">
									<p>93</p>
								</c>
								<c ca="center">
									<p>1015.5</p>
								</c>
								<c ca="center">
									<p>[400;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.16</p>
								</c>
								<c ca="center">
									<p>HNF-4, ER</p>
								</c>
								<c ca="left">
									<p>Known HRE</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AACATGTG</p>
								</c>
								<c ca="center">
									<p>105</p>
								</c>
								<c ca="center">
									<p>994</p>
								</c>
								<c ca="center">
									<p>[100;2000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.62</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known Twist site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GTAAACA</p>
								</c>
								<c ca="center">
									<p>147</p>
								</c>
								<c ca="center">
									<p>813</p>
								</c>
								<c ca="center">
									<p>[0;1200]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.54</p>
								</c>
								<c ca="center">
									<p>Freac, SRY</p>
								</c>
								<c ca="left">
									<p>Known DAF-16 site in <it>C. elegans</it></p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c cspan="8" ca="left">
									<p><b>(b) </b>Novel predicted regulatory elements</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ACACACAC</p>
								</c>
								<c ca="center">
									<p>2</p>
								</c>
								<c ca="center">
									<p>922.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>&#8594; (<it>p </it>&lt; 10<sup>-12</sup>)</p>
								</c>
								<c ca="center">
									<p>1.97</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, embryonic development (<it>p </it>&lt; 10<sup>-9</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAAGGAG</p>
								</c>
								<c ca="center">
									<p>13</p>
								</c>
								<c ca="center">
									<p>1091</p>
								</c>
								<c ca="center">
									<p>[200;2000]</p>
								</c>
								<c ca="center">
									<p>&#8592; (<it>p </it>&lt; 10<sup>-8</sup>)</p>
								</c>
								<c ca="center">
									<p>0.84</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GCACACAC</p>
								</c>
								<c ca="center">
									<p>29</p>
								</c>
								<c ca="center">
									<p>886</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.80</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, histogenesis (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAAGTTCA</p>
								</c>
								<c ca="center">
									<p>30</p>
								</c>
								<c ca="center">
									<p>920</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.23</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TAATTAA</p>
								</c>
								<c ca="center">
									<p>31</p>
								</c>
								<c ca="center">
									<p>871</p>
								</c>
								<c ca="center">
									<p>[500;2000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>3.07</p>
								</c>
								<c ca="center">
									<p>Ftz</p>
								</c>
								<c ca="left">
									<p>Unknown palindromic homeodomain-like site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAACAACA</p>
								</c>
								<c ca="center">
									<p>42</p>
								</c>
								<c ca="center">
									<p>968.5</p>
								</c>
								<c ca="center">
									<p>[200;2000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.22</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, regulation of transcription (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGGCGCC</p>
								</c>
								<c ca="center">
									<p>48</p>
								</c>
								<c ca="center">
									<p>951</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.84</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown palindromic site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CCTGTTGC</p>
								</c>
								<c ca="center">
									<p>111</p>
								</c>
								<c ca="center">
									<p>653</p>
								</c>
								<c ca="center">
									<p>[0;1800]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.90</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GTGTGACC</p>
								</c>
								<c ca="center">
									<p>112</p>
								</c>
								<c ca="center">
									<p>296</p>
								</c>
								<c ca="center">
									<p>[0;1900]</p>
								</c>
								<c ca="center">
									<p>&#8594; (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
								<c ca="center">
									<p>2.22</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CAGGTAG</p>
								</c>
								<c ca="center">
									<p>143</p>
								</c>
								<c ca="center">
									<p>924.5</p>
								</c>
								<c ca="center">
									<p>[0;1700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>0.94</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, cell fate commitment (<it>p </it>&lt; 10<sup>-8</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CACACGCA</p>
								</c>
								<c ca="center">
									<p>145</p>
								</c>
								<c ca="center">
									<p>968.5</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.49</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, cellular morphogenesis (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GTCAACAA</p>
								</c>
								<c ca="center">
									<p>169</p>
								</c>
								<c ca="center">
									<p>904</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.48</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site, similar to DAF-16</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>AAATGGCG</p>
								</c>
								<c ca="center">
									<p>205</p>
								</c>
								<c ca="center">
									<p>592</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.54</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TTGACCCA</p>
								</c>
								<c ca="center">
									<p>239</p>
								</c>
								<c ca="center">
									<p>860</p>
								</c>
								<c ca="center">
									<p>[0;1700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.60</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGACACAC</p>
								</c>
								<c ca="center">
									<p>273</p>
								</c>
								<c ca="center">
									<p>860</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.83</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGTCAAC</p>
								</c>
								<c ca="center">
									<p>281</p>
								</c>
								<c ca="center">
									<p>999</p>
								</c>
								<c ca="center">
									<p>[100;1900]</p>
								</c>
								<c>
									<p/>
								</c>
								<c ca="center">
									<p>1.55</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Unknown site</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p><b>(a) </b>For each known regulatory element, we show the best <it>k</it>-mer, its rank within the set of 469 highest scoring <it>k</it>-mers, the median distance to ATG (for occurrences upstream of genes within the conserved set), the optimal window, the orientation bias, the corrected ratio of upstream/coding bias, the total (up-regulated/down-regulated) number of microarray conditions in which the <it>k</it>-mer was found (see Method), TRANSFAC matches, and the best GO enrichment. <b>(b) </b>Novel predicted regulatory elements. <it>k</it>-mers shown here were selected from the list of 469 highest scoring <it>k</it>-mers based on their short median distance to ATG, short optimal window, significant orientation bias, strong over-representation ratio (U/C), presence in upstream regions of over/underexpressed genes in several microarray conditions, palindromicity or ressemblance to known sites in other species.</p>
						</tblfn>
					</tbl>
					<p>Several of the other predicted sites are known to be bound by <it>Drosophila </it>transcription factors involved in development. For example, FastCompare predicts TTTATGGC (rank 14) and TAATTGA (rank 24), the binding sites for two homeodomain transcription factors. The first site matches the TRANSFAC consensus binding site for Abd-B ([CG]NTTTATGGC), while the second site is the known consensus binding site for the Antennapedia (Antp) class of homeodomain proteins <abbrgrp><abbr bid="B54">54</abbr></abbrgrp> (TAATTGA matches the TRANSFAC consensus binding site for Ubx, a member of the Antp class). FastCompare also predicts ATTTATGC, a site matching the TRANSFAC consensus binding site for the chicken CdxA protein ([AC]TTTAT[AG]), the homolog of the Caudal protein in <it>D. melanogaster</it>. Also, FastCompare predicts CAGGTGC, the binding site for the Snail repressor/activator protein, a transcription factor required for proper mesodermal development <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>.</p>
					<p>FastCompare also predicts ATTTGCATA (rank 3) as one of the most conserved putative regulatory elements between the two flies. This site is the binding site for the POU-domain family of transcription factors, and it is probably bound by one or several of the three POU-domain transcription factors in <it>Drosophila</it>: DFR, PDM-1 and PDM-2. These three proteins are involved in different stages of <it>Drosophila </it>development: DFR is expressed in midline glia and in tracheal cells <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>, whereas the redundant PDM-1 and PDM-2 are essential for proper neuronal development <abbrgrp><abbr bid="B56">56</abbr></abbrgrp>.</p>
					<p>Many of the known motifs found when comparing the two <it>Drosophila </it>genomes were also found when analyzing the worm genomes. For example, GAGA repeats are found to be strongly conserved, slightly oriented 3' to 5' (<it>p </it>&lt; 10<sup>-4</sup>), and very significantly found upstream of genes involved in morphogenesis (<it>p </it>&lt; 10<sup>-23</sup>). GTAAACA (rank 147), the DAF16-binding site in <it>C. elegans</it>, is also one of the most conserved sites between the two <it>Drosophila </it>genomes. This site is probably bound by dFOXO, the unique homolog of the <it>C. elegans </it>DAF16 protein in <it>D. melanogaster </it><abbrgrp><abbr bid="B57">57</abbr></abbrgrp>.</p>
					<p>As for both previous phylogenetic groups (yeasts and worms), the median distances to ATG for the conserved elements show that some of the predicted regulatory elements are severely constrained in terms of position. Among the most constrained <it>k</it>-mers are the DRE site (TATCGATA, D<sub>ATG </sub>= 168) and the known AP-4/MyoD binding site (AACAGCTG, D<sub>ATG </sub>= 373). However, both the optimal windows and the median distances in Table <tblr tid="T4">4a</tblr> show that, compared to previously studied organisms, a smaller number of conserved regulatory element are constrained. Using the distribution of median distances for all 7-mers, we find that the <it>d</it><sub>0.025 </sub>= 798 and <it>d</it><sub>0.975 </sub>= 1,126. Among the 469 highest scoring <it>k</it>-mers, 45 fall below 798 (<it>p </it>&lt; 10<sup>-13</sup>) and 36 above 1,126 (<it>p </it>&lt; 10<sup>-8</sup>), once again suggesting weaker positional constraints than in yeasts and worms, at least when considering the first 2,000 bp of 5' upstream sequences.</p>
				</sec>
				<sec>
					<st>
						<p>Novel predicted regulatory elements</p>
					</st>
					<p>FastCompare predicts many putative regulatory elements in <it>Drosophila </it>that to the best of our knowledge are unknown (Table <tblr tid="T4">4b</tblr>). One of these novel sites, CAGGTAG (rank 143), was found upstream of several genes that are activated before widespread activation of zygotic transcription (which begins during the 14th nuclear cycle), in several <it>Drosophila </it>species <abbrgrp><abbr bid="B58">58</abbr></abbrgrp>; it was also found to be necessary for the early expression of several of these genes (<it>Sxl </it>and <it>sisterlessB</it>) in a subsequent study (J.R. ten Bosch, J.A. Benavides and T.W. Cline, personal communication). It is interesting to see that this particular site is significantly conserved upstream of genes involved in cell fate commitment (<it>p </it>&lt; 10<sup>-8</sup>).</p>
					<p>Some of these sites, such as the palindromic TTAATTA (rank 31), are found much more often in upstream regions than in exons (with an over-representation ratio of 3.07). Others, such as ACACACAC, are found to be significantly enriched upstream of genes in known functional categories (embryonic development, <it>p </it>&lt; 10<sup>-9</sup>). The same site appears to be strongly oriented 5' to 3' (<it>p </it>&lt; 10<sup>-12</sup>). Others, such as GTGTGACC or AAATGGCG, appear to be located closer to ATG than most other sites (D<sub>ATG </sub>= 296 and 592, respectively).</p>
				</sec>
				<sec>
					<st>
						<p>Motif interactions</p>
					</st>
					<p>We found many potential interactions between the most conserved sites discovered by FastCompare. For example, the POU-domain-binding site ATTTGCATA was found to be strongly co-conserved with TAATTGA, the Antp-binding site, and with many other potential homeodomain sites, such as AATAAAT and TAATTAA. The CACA repeats were also found to be co-conserved with several different sites, and in some cases, the set of genes having both sites simultaneously conserved in their upstream regions (conserved sets) was found to be enriched in certain functional categories, for example, ACACACAC and GAGAGAG, regulation of transcription (<it>p </it>&lt; 10<sup>-12</sup>); ACACACAC and TAATTGC (an Antp variant site), embryonic development (<it>p </it>&lt; 10<sup>-5</sup>). The full list of interactions is available at <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Mammals</p>
				</st>
				<p>The much larger noncoding regions of mammalian genomes present significant challenges for computational motif discovery. Also, many repeat elements (for example, <it>Alu</it>) have colonized mammalian genomes and are likely to be conserved between closely related genomes. The distance between enhancers and the transcriptional start of the genes they regulate can be extremely large, reaching tens of kilobases. Finally, gene predictions and gene boundaries are still largely unverified experimentally for a large number of genes.</p>
				<p>We applied FastCompare to the genomes of <it>H. sapiens </it>and <it>M. musculus</it>,, which diverged about 75 million years ago <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>. The number of orthologous ORFs between these two species is 15,983 and again, we have only considered 2,000-bp upstream regions. As in flies, using 5,000-bp instead produced similar results. It takes approximately 15 minutes for FastCompare to process the corresponding 60 Mbp of sequences and calculate a conservation score for all 7-mers, 8-mers and 9-mers on a typical desktop PC.</p>
				<sec>
					<st>
						<p>Validations</p>
					</st>
					<p>Unlike the other genomes considered so far, the output of FastCompare from the mammalian genomes is dominated by GC-rich sequences, probably corresponding to CpG islands (GC-rich regions known to be associated with the promoters of many genes). However, analysis of the FastCompare output yielded the same validations as for other species. Indeed, the distribution of conservation scores obtained on actual and randomized sequences shows that high conservation scores are very unlikely to be obtained by chance (Figure <figr fid="F8">8a</figr>). As with other species, many known regulatory elements are on the tail of the distribution (Figure <figr fid="F8">8a</figr>). Also, as shown in Figure <figr fid="F8">8b-d</figr>, more <it>k</it>-mers are found upstream of over or underexpressed genes, more <it>k</it>-mers have their conserved set enriched with GO functional categories, and more <it>k</it>-mers match TRANSFAC consensus sites as the conservation score increases.</p>
					<fig id="F8">
						<title>
							<p>Figure 8</p>
						</title>
						<caption>
							<p>Validation of the conservation scores obtained when applying FastCompare to <it>H. sapiens </it>and <it>M. musculus</it></p>
						</caption>
						<text>
							<p>Validation of the conservation scores obtained when applying FastCompare to <it>H. sapiens </it>and <it>M. musculus</it>. <b>(a) </b>Distributions of conservation scores for actual and randomized data, showing that high conservation scores are unlikely to be obtained by chance. Conservation scores for some known regulatory elements are also indicated. Both distributions were constructed using bin sizes of 5, and the top portion of the figure is not shown for the purpose of presentation. <b>(b-d) </b>Proportion of 7-mers supported by different types of independent biological data (using windows of size 100, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to <it>H. sapiens </it>and <it>M. musculus</it>. (b-d) strongly indicate that the frequency of support increases with conservation score as calculated by FastCompare.</p>
						</text>
						<graphic file="gb-2005-6-2-r18-8"/>
					</fig>
					<p>We found that masking <it>Alu </it>repeats did not influence the output of FastCompare (data not shown). To overcome the overabundance of GC-rich sequences in the FastCompare output, we use longer <it>k</it>-mers as starting points, namely 8-mers instead of 7-mers. We started with the 600 highest-scoring 8-mers, and replaced each of these 8-mers by one of its substrings (7-mer) or one of its superstrings (9-mer), when their conservation score is higher. We then removed duplicates in the list and added the high-scoring 9-mers that have no substrings within the list. This procedure yielded 284 <it>k</it>-mers (<it>k </it>= 7, 8, 9). Subsequent validation was limited to this small set of high-scoring predictions.</p>
				</sec>
				<sec>
					<st>
						<p>Known regulatory elements</p>
					</st>
					<p>As shown in Table <tblr tid="T5">5a</tblr>, we found 17 distinct known regulatory elements among the 284 highest-scoring <it>k</it>-mers. Among these are the well characterized sites for the Sp1, C/EBP, CREB and Myc/Max proteins or families of proteins. These four sites reside very close to ATG (their median distance to ATG is between 100 and 250 bp), suggesting that the four proteins (or families of proteins) may be involved in intimate interactions with the transcriptional complex. Sp1 is an ubiquitous transcription factor, involved in the basal expression of a large number of genes in mammals (see <abbrgrp><abbr bid="B60">60</abbr></abbrgrp> for review). The CCAAT/enhancer binding protein (C/EBP) has been implicated in the regulation of cell-specific gene expression mainly in hepatocytes, adipocytes and hematopoietic cells (see <abbrgrp><abbr bid="B61">61</abbr></abbrgrp> for review). Both Sp1 and C/EBP are constitutive transcription factors whose presence is necessary for significant induction of a large number of genes <abbrgrp><abbr bid="B62">62</abbr></abbrgrp>. The CRE-binding protein (CREB or CBP) is a transcription factor that binds cyclic AMP (cAMP) response elements (CREs) in the promoters of specific genes, and functions as a co-activator for a large number of other transcription factors (see <abbrgrp><abbr bid="B63">63</abbr></abbrgrp> for review). The Myc/Max heterodimer binds the CACGTG sequence, and also acts as a transcriptional activator (see <abbrgrp><abbr bid="B64">64</abbr></abbrgrp> for review).</p>
					<tbl id="T5">
						<title>
							<p>Table 5</p>
						</title>
						<caption>
							<p>Known and novel predicted regulatory elements, obtained when applying FastCompare to <it>H. sapiens </it>and <it>M. musculus</it></p>
						</caption>
						<tblbdy cols="9">
							<r>
								<c ca="left">
									<p>Sequence</p>
								</c>
								<c ca="center">
									<p>Rank</p>
								</c>
								<c ca="center">
									<p>D<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>W<sub>ATG</sub></p>
								</c>
								<c ca="center">
									<p>Orientation</p>
								</c>
								<c ca="center">
									<p>U/C</p>
								</c>
								<c ca="center">
									<p>Experiment</p>
								</c>
								<c ca="center">
									<p>TRANSFAC</p>
								</c>
								<c ca="left">
									<p>Comments</p>
								</c>
							</r>
							<r>
								<c cspan="9">
									<hr/>
								</c>
							</r>
							<r>
								<c cspan="9" ca="left">
									<p><b>(a) </b>Known regulatory sequences</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CCCGCCC</p>
								</c>
								<c ca="center">
									<p>1</p>
								</c>
								<c ca="center">
									<p>256</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.26</p>
								</c>
								<c ca="center">
									<p>8(7/1)</p>
								</c>
								<c ca="center">
									<p>Sp1, GC box</p>
								</c>
								<c ca="left">
									<p>Known Sp1 site, transcription from pol II promoter (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>GCCCCGCCC</p>
								</c>
								<c ca="center">
									<p>2</p>
								</c>
								<c ca="center">
									<p>165</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>4.64</p>
								</c>
								<c ca="center">
									<p>9(9/0)</p>
								</c>
								<c ca="center">
									<p>Sp1, GC box</p>
								</c>
								<c ca="left">
									<p>Known Sp1 site, variant from above</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CCGGAAG</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="center">
									<p>160.5</p>
								</c>
								<c ca="center">
									<p>[0;700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.37</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>Ets1, Elk1</p>
								</c>
								<c ca="left">
									<p>Known Ets site, RNA metabolism (<it>p </it>&lt; 10<sup>-6</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CACGTGAC</p>
								</c>
								<c ca="center">
									<p>18</p>
								</c>
								<c ca="center">
									<p>122.5</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>4.90</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>USF, GBP, SREBP-1</p>
								</c>
								<c ca="left">
									<p>Known Myc/Max site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TGACGTCA</p>
								</c>
								<c ca="center">
									<p>19</p>
								</c>
								<c ca="center">
									<p>107</p>
								</c>
								<c ca="center">
									<p>[0;1000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>4.24</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>CREB</p>
								</c>
								<c ca="left">
									<p>Known CREB site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGCATGCG</p>
								</c>
								<c ca="center">
									<p>24</p>
								</c>
								<c ca="center">
									<p>132</p>
								</c>
								<c ca="center">
									<p>[0;1600]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>4.26</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known palindromic octamer sequence (POS)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CCAATCAG</p>
								</c>
								<c ca="center">
									<p>37</p>
								</c>
								<c ca="center">
									<p>239</p>
								</c>
								<c ca="center">
									<p>[0;700]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.85</p>
								</c>
								<c ca="center">
									<p>4(0/4)</p>
								</c>
								<c ca="center">
									<p>NF-Y, CCAAT</p>
								</c>
								<c ca="left">
									<p>Known CAAT box and CCAAT enhancer binding protein site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CGGAAGTGA</p>
								</c>
								<c ca="center">
									<p>51</p>
								</c>
								<c ca="center">
									<p>94</p>
								</c>
								<c ca="center">
									<p>[0;1000]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>3.96</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>STAT3</p>
								</c>
								<c ca="left">
									<p>Known GA-binding protein (GAB) site</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CCGCCTC</p>
								</c>
								<c ca="center">
									<p>78</p>
								</c>
								<c ca="center">
									<p>632</p>
								</c>
								<c ca="center">
									<p>[0;500]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>4.26</p>
								</c>
								<c ca="center">
									<p>9(8/1)</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Known insulin response element</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CACGTGG</p>
								</c>
								<c ca="center">
									<p>82</p>
								</c>
								<c ca="center">
									<p>429.5</p>
								</c>
								<c ca="center">
									<p>[0;300]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>2.09</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>USF, Myc-Max</p>
								</c>
								<c ca="left">
									<p>Known Myc/Max site, different from above</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>TAATCCCAG</p>
								</c>
								<c ca="center">
									<p>119</p>
								</c>
								<c ca="center">
									<p>1258</p>
								</c>
								<c ca="center">
									<p>[100;2000]</p>
								</c>
								<c ca="center">
									<p>&#8592; (<it>p </it>&lt; 10<sup>-14</sup>)</p>
								</c>
								<c ca="center">
									<p>7.06</p>
								</c>
								<c ca="center">
									<p>3(1/2)</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="left">
									<p>Similar to Bicoid (<it>Drosophila</it>), RNA processing (<it>p </it>&lt; 10<sup>-5</sup>)</p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>CACCTGC</p>
								</c>
								<c ca="center">
									<p>227</p>
								</c>
								<c ca="center">
									<p>925</p>
								</c>
								<c ca="center">
									<p>[0;600]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.64</p>
								</c>
								<c ca="center">
									<p>1(1/0)</p>
								</c>
								<c ca="center">
									<p>E47, Lmo2</p>
								</c>
								<c ca="left">
									<p>Known ZEB site in vertebrates, Zfh-1 in <it>Drosophila</it></p>
								</c>
							</r>
							<r>
								<c ca="left">
									<p>ATTTGCAT</p>
								</c>
								<c ca="center">
									<p>234</p>
								</c>
								<c ca="center">
									<p>729</p>
								</c>
								<c ca="center">
									<p>[0;300]</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>1.95</p>
								</c>
								<c ca="center">
									<p>-</p>
								</c>
								<c ca="center">
									<p>Oct-1</p>
								</c>
								<c ca="left">
									<p>Known Oct-1 site