<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>gb-2006-7-6-r49</ui>
	<ji>GBJ</ji>
	<fm>
		<dochead>Method</dochead>
		<bibl>
			<title>
				<p>A steganalysis-based approach to comprehensive identification and characterization of functional regulatory elements</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Wang</snm>
					<fnm>Guandong</fnm>
					<insr iid="I1"/>
					<email>gw2@cse.wustl.edu</email>
				</au>
				<au id="A2" ca="yes">
					<snm>Zhang</snm>
					<fnm>Weixiong</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>zhang@cse.wustl.edu</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130, USA</p>
				</ins>
				<ins id="I2">
					<p>Department of Genetics, Washington University, St. Louis, MO 63130, USA</p>
				</ins>
			</insg>
			<source>Genome Biology</source>
			<issn>1465-6906</issn>
			<pubdate>2006</pubdate>
			<volume>7</volume>
			<issue>6</issue>
			<fpage>R49</fpage>
			<url>http://genomebiology.com/2006/7/6/R49</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">16787547</pubid><pubid idtype="doi">10.1186/gb-2006-7-6-r49</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>3</day>
					<month>2</month>
					<year>2006</year>
				</date>
			</rec>
			<revrec>
				<date>
					<day>10</day>
					<month>4</month>
					<year>2006</year>
				</date>
			</revrec>
			<acc>
				<date>
					<day>17</day>
					<month>5</month>
					<year>2006</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>20</day>
					<month>6</month>
					<year>2006</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2006</year>
			<collab>Wang and Zhang; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<shorttitle>
			<p>Steganalysis-based <it>cis</it>-regulatory element identification</p>
		</shorttitle>
		<shortabs>
			<p>WordSpy, a novel, steganalysis-based approach for genome-wide motif-finding is described and applied to yeast and <it>Arabidopsis </it>promoters, identifying cell-cycle motifs.</p>
		</shortabs>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<p>The comprehensive identification of <it>cis</it>-regulatory elements on a genome scale is a challenging problem. We develop a novel, steganalysis-based approach for genome-wide motif finding, called WordSpy, by viewing regulatory regions as a stegoscript with <it>cis</it>-elements embedded in 'background' sequences. We apply WordSpy to the promoters of cell-cycle-related genes of <it>Saccharomyces cerevisiae </it>and <it>Arabidopsis thaliana</it>, identifying all known cell-cycle motifs with high ranking. WordSpy can discover a complete set of <it>cis</it>-elements and facilitate the systematic study of regulatory networks.</p>
			</sec>
		</abs>
	</fm>
	<meta>
		<classifications>
			<classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
		</classifications>
	</meta>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>The comprehensive identification and characterization of short functional sequence elements has become increasingly important as we begin to elucidate transcriptional regulation on a large scale. Transcriptional regulation involves a complex molecular network. The interaction of transcription factors (TFs) and <it>cis</it>-acting DNA elements determines the expression levels of different genes under various environmental conditions <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Deciphering such a network is to infer regulatory rules that can properly explain the expressions of different genes with the regulatory elements in their promoters and the presence of TFs <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. Therefore, a complete set of regulatory elements is essential for systematic analysis of transcriptional regulation networks on a genome-wide scale.</p>
			<p>The discovery of <it>cis</it>-regulatory elements in a genome has been a challenging problem for decades. Most widely applied approaches first cluster genes into small groups with similar expression profiles or similar biological functions, and then search for common short sequences (or motifs) in the regulatory regions of the genes in a group. This is based on the assumption that coexpressed genes are more likely to be co-regulated. Many efficient algorithms, including multiple local alignment-based <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>, word enumeration-based <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, and dictionary-based <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, have been developed to search for statistically significant motifs from a small number of sequences. Despite the success of these methods, this approach has noticeable limitations. Computational gene clustering is often inaccurate and subjective, in terms of what similarity measure to use and how many clusters to form. Importantly, many genes belonging to a common pathway may have similar expression patterns, but are not regulated by the same TFs. Furthermore, transcriptional regulation is combinatorial <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, in that a regulatory element needs to combine with various others to function under different conditions. This means that the same motif may appear in the promoters of genes that express or function differently. Therefore, clustering genes into small sets may split the genes containing a particular set of motifs into different clusters, which makes it difficult, if not impossible, to find all regulatory elements <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
			<p>In recent years, comparative genome analysis has been successfully applied to the discovery of regulatory motifs <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>. Taking advantage of sequence conservation in related species, this approach can effectively identify regulatory elements on a genome scale without any prior knowledge of co-regulation or gene function. This approach is limited in some situations, however. First, the species considered in a comparative analysis must be properly diversified evolutionarily. They must be evolutionarily separated long enough to allow nonfunctional elements to diverge. On the other hand, they must not be evolutionarily too far apart from one another so that functional elements remain conserved. For many applications, not many such genomes are available. Second and more important, there exist species-specific regulatory elements, which a comparative genomic method can hardly detect.</p>
			<p>In this paper we propose a novel genome-wide approach to comprehensively identify regulatory elements from a single genome. Instead of clustering genes into groups, we use all the genes of interest together - for instance, the genes related to a particular biological process such as the cell cycle or the genes responding to a particular stress condition. In this approach, we first search for statistically over-represented motifs as completely as possible. We then use additional information, such as the coherency of expression profiles of genes containing a motif and the specificity of a motif to target genes, in order to evaluate the biological relevance of the extracted motifs so as to find truly functional regulatory elements.</p>
			<p>We view this genome-wide motif-finding problem from a perspective of steganography and steganalysis. Steganography is a technique for concealing the existence of information by embedding the messages to be protected in a covertext to create a 'stegoscript' <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Steganalysis is the deciphering of a stegoscript by discovering the hidden message <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. In this approach, we consider the regulatory regions of a genome as though they constituted a stegoscript with over-represented words (that is, regulatory elements) embedded in a covertext (that is, 'background' genomic sequences). We then model the stegoscript with a statistical model - a hidden Markov model <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> - consisting of a dictionary of motifs and a grammar. We progressively learn a series of models that are most likely to have generated the script. The final model is then used to decipher the stegoscript as well as to extract over-represented motifs. On the basis of this novel viewpoint, we have developed an efficient genome-wide motif-finding algorithm called WordSpy that can discover a large number of motifs from a large collection of regulatory sequences. Note that our technical approach of using a dictionary is inspired by the work of Bussemaker <it>et al</it>. <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, in which they introduced innovative ideas of segmenting sequences into words and building a dictionary of words from the sequences.</p>
			<p>Our WordSpy method has several salient properties. First of all, by statistically modeling the regulatory regions as stegoscripts, WordSpy aims to discover a complete set of significant motifs. Therefore, instead of being trapped by some pseudo-motifs, for example, over-represented repeats, WordSpy includes them in its model, making it less vulnerable to spurious motifs. Second, WordSpy combines word counting and statistical modeling. It applies word counting to efficiently detect high-frequency words. It then enhances the representation of words by position weight matrices (PWMs) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> to capture degenerate motifs. Third, WordSpy is able to detect discriminatory motifs that can be used to properly separate two sets of sequences. Finally, by incorporating gene-expression information and a genome-wide specificity analysis, we augment the basic algorithm in order to distinguish biologically relevant motifs from spurious ones, making the overall method practical for genome-wide identification of functional <it>cis</it>-regulatory elements, as we will demonstrate here.</p>
			<p>We will first evaluate the method with an English stegoscript and 645 cell-cycle-related genes of <it>Saccharomyces cerevisiae</it>. We will then apply it to identify cell-cycle-related motifs from more than 1,000 genes in model plant, <it>Arabidopsis thaliana</it>. Furthermore, we will apply WordSpy as a discriminative motif-finding algorithm by incorporating TF location information - that is, chromatin immunoprecipitation DNA binding microarray (ChIP-chip) data - and build a dictionary of motifs for each known TF of budding yeast. Finally, we compare WordSpy with a set of existing methods on a benchmark that includes 56 well-curated sets of sequences and motifs in four species <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>.</p>
		</sec>
		<sec>
			<st>
				<p>Results and discussion</p>
			</st>
			<sec>
				<st>
					<p>Stegoscripts and the statistical model</p>
				</st>
				<p>The regulatory regions of a genome encode transcriptional regulatory information using regulatory elements embedded in background sequences. We can thus view the regulatory regions of the genes of interest as a stegoscript, which conceals the secret messages (<it>cis</it>-elements) with some covertext (background sequences). The hidden secret messages are typically more conserved and statistically over-represented than those in the covertext. This is particularly true for genomic regulatory sequences, where a small number of TFs regulate a large number of genes <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, making functional <it>cis</it>-elements over-represented.</p>
				<p>Consider a set of regulatory sequences or a stegoscript <it>S </it>= (<it>S</it><sub>1</sub>,<it>S</it><sub>2</sub>,...,<it>S</it><sub><it>q</it></sub>) where <it>S</it><sub><it>i </it></sub>= (<it>S</it><sub><it>i</it>1</sub><it>S</it><sub><it>i</it>2</sub>...<graphic file="gb-2006-7-6-r49-i1.gif"/>) and <it>l</it><sub><it>i </it></sub>is the length of the <it>i</it>th (<it>i </it>= 1, 2,..., <it>q</it>) sequence. Deciphering the script is to annotate the sequences with a series of substrings <it>&#967; </it>= (<it>x</it><sub>1</sub>,<it>x</it><sub>2</sub>,...,<it>x</it><sub><it>t</it></sub>), where <it>x</it><sub><it>j </it></sub>denotes the <it>j</it>th substring with length <it>l</it>(<it>x</it><sub><it>j</it></sub>), which can be a background word or a functional element. In general, a stegoscript is a product of a grammar, by which all possible scripts in the language can be generated by successively rewriting strings according to a set of rules. Therefore, we model the stegoscript statistically. The model captures regulatory motifs and background words by a dictionary, and specifies how the motifs and words are used to form the stegoscript by a grammar. Given the statistical model, <it>&#967; </it>is just the optimal parse over <it>S </it>using the words in the dictionary.</p>
				<p>To accurately capture the transcriptional mechanism encoded in the regulatory regions requires a complicated grammar, which may be computationally not feasible. To reduce computational complexity, we consider that motifs are used independently. Therefore, we can use a stochastic regular grammar <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, which is equivalent to a hidden Markov model (HMM) <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. Figure <figr fid="F1">1</figr> illustrates the model. Beginning with a start symbol, a motif symbol <it>M </it>is produced with probability <it>P</it><sub><it>M</it></sub>, or a background symbol <it>B </it>is generated with probability <it>P</it><sub><it>B</it></sub>. From <it>M</it>, a degenerate motif <it>W</it><sub><it>i </it></sub>is produced, with probability <graphic file="gb-2006-7-6-r49-i2.gif"/>, from the motif subdictionary, and an exact word <it>w </it>is generated with probability <it>P</it>(<it>w</it>|<it>W</it><sub><it>i</it></sub>). The process for generating a background word from symbol <it>B </it>is similar. The generated word is then appended to the script that has been created so far and the process repeats until the whole script is created.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>A hidden Markov model for deciphering stegoscripts</p>
					</caption>
					<text>
						<p>A hidden Markov model for deciphering stegoscripts. It consists of two submodels, the 'secret message model' is for motifs and the 'covertext model' for background words. The blue boxes with dashed outlines each represent a word node, which is a combination of several position nodes. Node <it>W</it><sub><it>b </it></sub>is a single-base node and always belongs to the covertext model. States <it>S</it>, <it>B</it>, and <it>M </it>do not emit any letter.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-1"/>
				</fig>
				<p>We formally write the model as <it>G </it>= {&#936;, &#920;, <it>I</it>}, where &#936; = {<it>P</it><sub><it>B</it></sub>,<it>P</it><sub><it>M</it></sub>,<graphic file="gb-2006-7-6-r49-i3.gif"/>} is the set of transition probabilities, &#920; = {&#920;<sub><it>b</it></sub>, &#920;<sub>1</sub>, &#920;<sub>2</sub>,..., &#920;<sub><it>n</it></sub>} is a set of emission probabilities corresponding to the motifs and words in a dictionary <it>D </it>= {<it>W</it><sub><it>b</it></sub>,<it>W</it><sub>1</sub>,<it>W</it><sub>2</sub>,...,<it>W</it><sub><it>n</it></sub>}, and <it>I </it>= {<graphic file="gb-2006-7-6-r49-i4.gif"/>|<it>W</it><sub><it>i </it></sub>&#8712; <it>D</it>} is a set of indicators, where</p>
				<p>
					<graphic file="gb-2006-7-6-r49-i5.gif"/>
				</p>
				<p><it>W</it><sub><it>b </it></sub>is the only word in the model that has a single base. As we never consider a word of single base as a functional element, <it>W</it><sub><it>b </it></sub>is always a background word, that is, <graphic file="gb-2006-7-6-r49-i6.gif"/> is always set to 0.</p>
			</sec>
			<sec>
				<st>
					<p>The WordSpy algorithm</p>
				</st>
				<p>The central problem of deciphering a stegoscript is learning a statistical model with which a stegoscript was created. Assume that a stegoscript <it>S </it>was generated from an unknown model &#9001;<it>D</it>*, <it>G</it>*&#9002; of a dictionary <it>D</it>* and a grammar <it>G</it>*. With no prior knowledge of the true model, the maximum likelihood estimate, arg max<sub>&#9001;<it>D'</it>, <it>G'</it>&#9002; </sub><it>P</it>(<it>S</it>|&#9001;<it>D'</it>, <it>G'</it>&#9002;), is a good approximation of &#9001;<it>D</it>*, <it>G</it>*&#9002;. However, it is difficult to directly search for arg max<sub>&#9001;<it>D'</it>, <it>G'</it>&#9002; </sub><it>P</it>(<it>S</it>|&#9001;<it>D'</it>, <it>G'</it>&#9002;), as a large number of words need to be discovered and many unknown parameters to be optimized. Therefore, we separate the learning process into two phases, 'word sampling' and 'model optimization', and adopt an incremental learning strategy to progressively capture short to long words and gradually build such a model (see Materials and methods).</p>
				<p>The procedure for learning the model and subsequently deciphering the regulatory sequences is shown in Figure <figr fid="F2">2</figr>. The overall algorithm starts with the simplest model &#9001;<it>D</it><sub>1</sub>, <it>G</it><sub>1</sub>&#9002; with only a background word <it>W</it><sub><it>b </it></sub>in <it>D</it><sub>1</sub>. At the <it>k</it>th iteration, the algorithm first runs word sampling to identify all over-represented words of length <it>k</it>. In this process, the algorithm scans the script <it>S </it>once to tabulate all the words of length <it>k </it>in <it>S </it>and their occurrences using a hash table. Every word in the table is then tested against the current best model <graphic file="gb-2006-7-6-r49-i7.gif"/> which contains over-represented motifs shorter than <it>k</it>. A word is considered over-represented if it occurs in <it>S </it>more often than expected by <graphic file="gb-2006-7-6-r49-i7.gif"/>. Furthermore, the newly discovered words will be examined (to separate background words) and clustered, if necessary, to form degenerate preliminary motifs. All new words and motifs will be merged with the current best dictionary <graphic file="gb-2006-7-6-r49-i8.gif"/> to form the next dictionary <it>D</it><sub><it>k</it></sub>. The model is retrofitted to accommodate the new words, leading to the next grammar, <it>G</it><sub><it>k</it></sub>. The new grammar <it>G</it><sub><it>k </it></sub>is then optimized to fit the script. The word statistics are recalculated in the model optimization step and the insignificant words are discarded. The process repeats until the model covers words up to a predefined maximum length.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Components and flow diagram of WordSpy</p>
					</caption>
					<text>
						<p>Components and flow diagram of WordSpy. Starting with <it>k </it>= 1 and a grammar <it>G</it><sub>0 </sub>with a single word node <it>W</it><sub><it>b </it></sub>in background, the algorithm goes through the following steps, represented by the red numbers on the figure. 1. Model <it>G</it><sub><it>k</it>-1 </sub>is optimized to <graphic file="gb-2006-7-6-r49-i9.gif"/> which contains over-represented motifs shorter than <it>k</it>. 2. Use <graphic file="gb-2006-7-6-r49-i9.gif"/> as a base model to detect over-represented exact words of length <it>k</it>. 3. Choose over-represented words for word clustering. 4. Evaluate all the words. Select and add background words to the background model. On the basis of similarity, cluster the rest of the words to form degenerate preliminary motifs. 5. Add the preliminary motifs to the motif sub-dictionary and create a new grammar <it>G</it><sub><it>k</it></sub>. 6. Optimize <it>G</it><sub><it>k</it></sub>. 7. Apply optimized <graphic file="gb-2006-7-6-r49-i10.gif"/> to decipher the script and locate motifs.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-2"/>
				</fig>
				<p>The classification of real motifs and background words is important to the accuracy of the model. When no extra information is available, we resort to a word significant threshold to select putative motif words. We use the <it>Z</it>-score to quantify the over-representation of a word (see 'Word sampling' section in Materials and methods). If more information is available, such as gene-expression coherence in <it>G</it>-score and target gene specificity in <it>Z</it><sub><it>g</it></sub>-score (see 'Motif evaluation' section in Materials and methods), more accurate classification can be made.</p>
			</sec>
			<sec>
				<st>
					<p>Deciphering an English stegoscript</p>
				</st>
				<p>We evaluated the performance of WordSpy with a stegoscript of English text that contains the first ten chapters (approximately 112,000 letters) of the novel <it>Moby Dick </it>embedded within randomly generated covertext (approximately 156,000 letters). This stegoscript was created by Bussemaker <it>et al</it>. <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. We ran WordSpy with different <it>Z</it>-score thresholds to find words up to length 15. WordSpy reached its best performance with <it>Z</it>-score threshold 6. With covertext removed, the deciphered text contains 16,522 words. Among the total 18,930 words that appear at least twice in the original text, 13,435 (70.9%) words are 100% matched to their corresponding deciphered words, and 15,529 (82%) words overlap at least 50% with their corresponding deciphered words. Only 761 (4.6%) deciphered words match less than 50% to their counterparts in the original text. This result shows that WordSpy can accurately decipher the stegoscript and recover <it>Moby Dick </it>from the covertext with high specificity and sensitivity (see Additional data file 1 for a detailed analysis and more results).</p>
			</sec>
			<sec>
				<st>
					<p>Identifying yeast cell-cycle regulatory motifs</p>
				</st>
				<p>To evaluate the performance of WordSpy on biological sequences, we applied it to discover <it>cis</it>-regulatory elements of cell-cycle related genes of <it>S. cerevisiae </it><abbrgrp><abbr bid="B19">19</abbr></abbrgrp>. To avoid bias, we first removed homolog genes using WU-BLAST with an E-value threshold of 10<sup>-12</sup>, resulted in 645 genes in the final set. The promoter sequences were retrieved using the RSA tools <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. We compared WordSpy with three other methods, MobyDick <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, RSA-tools <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> and Weeder <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, which can handle a large number of sequences. We tuned these programs to get their best possible parameters. The <it>Z</it>-score threshold for WordSpy was set to 3. The whole-genome analysis on the specificity of the motifs, <it>Z</it><sub><it>g</it></sub>-scores, was performed with the promoters of all the genes in <it>S. cerevisiae</it>. We also used the yeast gene expression data collected in <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> to calculate the <it>G</it>-score for each motif. As shown in Table <tblr tid="T1">1</tblr>, all known cell-cycle-related <it>cis</it>-elements were identified with high ranking in either <it>Z</it><sub><it>g</it></sub>-score or <it>G</it>-score. In contrast, MobyDick failed to discover three of them, and RSA-tools and Weeder missed four of them.</p>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Identified known motifs in the promoters of 645 yeast cell-cycle genes</p>
					</caption>
					<tblbdy cols="10">
						<r>
							<c ca="left">
								<p>Transcription factors</p>
							</c>
							<c ca="center">
								<p>Known motifs</p>
							</c>
							<c ca="center">
								<p>WordSpy</p>
							</c>
							<c ca="center">
								<p><it>Z</it>-score</p>
							</c>
							<c ca="center">
								<p><it>Z</it><sub><it>g</it></sub>-score</p>
							</c>
							<c ca="center">
								<p><it>G</it>-score</p>
							</c>
							<c ca="center">
								<p>Rank</p>
							</c>
							<c ca="center">
								<p>MobyDick</p>
							</c>
							<c ca="center">
								<p>RSA</p>
							</c>
							<c ca="center">
								<p>Weeder</p>
							</c>
						</r>
						<r>
							<c cspan="10">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Ace2, Swi5</p>
							</c>
							<c ca="left">
								<p>RRCCAGCR [19]</p>
							</c>
							<c ca="left">
								<p>CCAGC(-)</p>
							</c>
							<c ca="center">
								<p>5.4</p>
							</c>
							<c ca="center">
								<p>5.2</p>
							</c>
							<c ca="center">
								<p>0.0363</p>
							</c>
							<c ca="center">
								<p>8/3/29</p>
							</c>
							<c ca="left">
								<p>ACCCGGCTGG</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GCCAGC(+)</p>
							</c>
							<c ca="center">
								<p>5.3</p>
							</c>
							<c ca="center">
								<p>2.6</p>
							</c>
							<c ca="center">
								<p>0.0551</p>
							</c>
							<c ca="center">
								<p>36/4/58</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AGCCAGC(+)</p>
							</c>
							<c ca="center">
								<p>4.6</p>
							</c>
							<c ca="center">
								<p>2.5</p>
							</c>
							<c ca="center">
								<p>0.0688</p>
							</c>
							<c ca="center">
								<p>75/13/199</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CCAGCAAA(-)</p>
							</c>
							<c ca="center">
								<p>4.3</p>
							</c>
							<c ca="center">
								<p>3.5</p>
							</c>
							<c ca="center">
								<p>0.113</p>
							</c>
							<c ca="center">
								<p>107/51/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CCAGCAAG(-)</p>
							</c>
							<c ca="center">
								<p>3.9</p>
							</c>
							<c ca="center">
								<p>2.9</p>
							</c>
							<c ca="center">
								<p>0.0976</p>
							</c>
							<c ca="center">
								<p>185/67/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GCCAGCAA(-)</p>
							</c>
							<c ca="center">
								<p>3.9</p>
							</c>
							<c ca="center">
								<p>3.4</p>
							</c>
							<c ca="center">
								<p>0.1872</p>
							</c>
							<c ca="center">
								<p>124/12/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AGCCAGCA(+)</p>
							</c>
							<c ca="center">
								<p>5.7</p>
							</c>
							<c ca="center">
								<p>2.7</p>
							</c>
							<c ca="center">
								<p>0.0929</p>
							</c>
							<c ca="center">
								<p>189/73/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACCAGC [59, 60]</p>
							</c>
							<c ca="left">
								<p>AACCAGCA(+)</p>
							</c>
							<c ca="center">
								<p>3.8</p>
							</c>
							<c ca="center">
								<p>2.6</p>
							</c>
							<c ca="center">
								<p>0.1983</p>
							</c>
							<c ca="center">
								<p>239/8/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Swi6, Mbp1</p>
							</c>
							<c ca="left">
								<p>ACGCGT [19, 60]</p>
							</c>
							<c ca="left">
								<p>AACGCGT(+)</p>
							</c>
							<c ca="center">
								<p>13.7</p>
							</c>
							<c ca="center">
								<p>11.3</p>
							</c>
							<c ca="center">
								<p>0.1816</p>
							</c>
							<c ca="center">
								<p>1/1/199</p>
							</c>
							<c ca="left">
								<p>AACGCGT</p>
							</c>
							<c ca="left">
								<p>AAACGCGT</p>
							</c>
							<c ca="left">
								<p>ACGCGT</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GACGCGTC(+)</p>
							</c>
							<c ca="center">
								<p>9.3</p>
							</c>
							<c ca="center">
								<p>4.9</p>
							</c>
							<c ca="center">
								<p>0.2106</p>
							</c>
							<c ca="center">
								<p>41/4/867</p>
							</c>
							<c ca="left">
								<p>ACGCGTC</p>
							</c>
							<c ca="left">
								<p>ACGCGTAA</p>
							</c>
							<c ca="left">
								<p>ACGCGTAA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AAACGCGT(+)</p>
							</c>
							<c ca="center">
								<p>14.6</p>
							</c>
							<c ca="center">
								<p>10.2</p>
							</c>
							<c ca="center">
								<p>0.2093</p>
							</c>
							<c ca="center">
								<p>3/5/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AACGCGTC</p>
							</c>
							<c ca="left">
								<p>CGACGCGT</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AACGCGTC(*)</p>
							</c>
							<c ca="center">
								<p>10.8</p>
							</c>
							<c ca="center">
								<p>8.9</p>
							</c>
							<c ca="center">
								<p>0.2003</p>
							</c>
							<c ca="center">
								<p>9/7/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACGCGTCA</p>
							</c>
							<c ca="left">
								<p>GACGCGTA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACGCGTAA(*)</p>
							</c>
							<c ca="center">
								<p>9.6</p>
							</c>
							<c ca="center">
								<p>9.0</p>
							</c>
							<c ca="center">
								<p>0.1341</p>
							</c>
							<c ca="center">
								<p>7/36/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACGCGTCG</p>
							</c>
							<c ca="left">
								<p>AAACGCGT</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACGCGTCA(*)</p>
							</c>
							<c ca="center">
								<p>8.9</p>
							</c>
							<c ca="center">
								<p>7.3</p>
							</c>
							<c ca="center">
								<p>0.1291</p>
							</c>
							<c ca="center">
								<p>15/41/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AACGCGTT</p>
							</c>
							<c ca="left">
								<p>GACGCGTG</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CAACGCGT(+)</p>
							</c>
							<c ca="center">
								<p>6.3</p>
							</c>
							<c ca="center">
								<p>4.0</p>
							</c>
							<c ca="center">
								<p>0.1014</p>
							</c>
							<c ca="center">
								<p>73/59/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AACGCGTA</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Swi4, Swi6</p>
							</c>
							<c ca="left">
								<p>CACGAAA [19, 60]</p>
							</c>
							<c ca="left">
								<p>CACGAAA(*)</p>
							</c>
							<c ca="center">
								<p>4.6</p>
							</c>
							<c ca="center">
								<p>5.7</p>
							</c>
							<c ca="center">
								<p>0.0623</p>
							</c>
							<c ca="center">
								<p>10/17/199</p>
							</c>
							<c ca="left">
								<p>CGCGAAA</p>
							</c>
							<c ca="left">
								<p>ACGCGAAA</p>
							</c>
							<c ca="left">
								<p>ACGCGAAA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACACGAAA(-)</p>
							</c>
							<c ca="center">
								<p>6.6</p>
							</c>
							<c ca="center">
								<p>4.5</p>
							</c>
							<c ca="center">
								<p>0.1081</p>
							</c>
							<c ca="center">
								<p>57/55/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CGCGAAAA</p>
							</c>
							<c ca="left">
								<p>CACGAAAA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CACGAAAA(+)</p>
							</c>
							<c ca="center">
								<p>7.1</p>
							</c>
							<c ca="center">
								<p>5.5</p>
							</c>
							<c ca="center">
								<p>0.1053</p>
							</c>
							<c ca="center">
								<p>32/57/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CACGAAAA</p>
							</c>
							<c ca="left">
								<p>ACACGAAA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CGCGAAA [60]</p>
							</c>
							<c ca="left">
								<p>CGCGAAA(*)</p>
							</c>
							<c ca="center">
								<p>14.9</p>
							</c>
							<c ca="center">
								<p>10.6</p>
							</c>
							<c ca="center">
								<p>0.132</p>
							</c>
							<c ca="center">
								<p>3/2/199</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ACGCGAAA(*)</p>
							</c>
							<c ca="center">
								<p>15.2</p>
							</c>
							<c ca="center">
								<p>10.3</p>
							</c>
							<c ca="center">
								<p>0.1733</p>
							</c>
							<c ca="center">
								<p>1/15/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>CGCGAAAA(+)</p>
							</c>
							<c ca="center">
								<p>17.7</p>
							</c>
							<c ca="center">
								<p>9.4</p>
							</c>
							<c ca="center">
								<p>0.1352</p>
							</c>
							<c ca="center">
								<p>4/34/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Fkh1, Fkh2</p>
							</c>
							<c ca="left">
								<p>GTAAACA [25]</p>
							</c>
							<c ca="left">
								<p>GTAAACA(+)</p>
							</c>
							<c ca="center">
								<p>8.2</p>
							</c>
							<c ca="center">
								<p>7.4</p>
							</c>
							<c ca="center">
								<p>0.084</p>
							</c>
							<c ca="center">
								<p>8/10/199</p>
							</c>
							<c ca="left">
								<p>GTAAACA</p>
							</c>
							<c ca="left">
								<p>GTAAACAA</p>
							</c>
							<c ca="left">
								<p>GTAAACAA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GGTAAACA(+)</p>
							</c>
							<c ca="center">
								<p>7.2</p>
							</c>
							<c ca="center">
								<p>4.6</p>
							</c>
							<c ca="center">
								<p>0.1578</p>
							</c>
							<c ca="center">
								<p>48/21/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ATAAACAA</p>
							</c>
							<c ca="left">
								<p>AATAAACA</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GTAAACAA [60]</p>
							</c>
							<c ca="left">
								<p>GTAAACAA(*)</p>
							</c>
							<c ca="center">
								<p>9</p>
							</c>
							<c ca="center">
								<p>6.6</p>
							</c>
							<c ca="center">
								<p>0.098</p>
							</c>
							<c ca="center">
								<p>11/66/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>AATAAACA</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>ATAAACAA [60]</p>
							</c>
							<c ca="left">
								<p>ATAAACAA(*)</p>
							</c>
							<c ca="center">
								<p>8.8</p>
							</c>
							<c ca="center">
								<p>5.9</p>
							</c>
							<c ca="center">
								<p>0.0657</p>
							</c>
							<c ca="center">
								<p>23/142/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>MCM1</p>
							</c>
							<c ca="left">
								<p>TTTCCTAA [25]</p>
							</c>
							<c ca="left">
								<p>TTTCCTAA(+)</p>
							</c>
							<c ca="center">
								<p>5.5</p>
							</c>
							<c ca="center">
								<p>5.2</p>
							</c>
							<c ca="center">
								<p>0.0435</p>
							</c>
							<c ca="center">
								<p>35/307/867</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Ste12</p>
							</c>
							<c ca="left">
								<p>TGAAACA [61]</p>
							</c>
							<c ca="left">
								<p>TTGAAACA(*)</p>
							</c>
							<c ca="center">
								<p>4.3</p>
							</c>
							<c ca="center">
								<p>4.2</p>
							</c>
							<c ca="center">
								<p>0.0647</p>
							</c>
							<c ca="center">
								<p>66/145/867</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>TGAAACAA(*)</p>
							</c>
							<c ca="center">
								<p>5</p>
							</c>
							<c ca="center">
								<p>4.8</p>
							</c>
							<c ca="center">
								<p>0.0631</p>
							</c>
							<c ca="center">
								<p>46/149/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Met4, Met28</p>
							</c>
							<c ca="left">
								<p>TCACGTG [62]</p>
							</c>
							<c ca="left">
								<p>TCACGTG(-)</p>
							</c>
							<c ca="center">
								<p>5</p>
							</c>
							<c ca="center">
								<p>1.7</p>
							</c>
							<c ca="center">
								<p>0.0845</p>
							</c>
							<c ca="center">
								<p>129/9/199</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
							<c ca="left">
								<p>N/A</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cbf1</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>GTCACGTG(-)</p>
							</c>
							<c ca="center">
								<p>5</p>
							</c>
							<c ca="center">
								<p>0.9</p>
							</c>
							<c ca="center">
								<p>0.2205</p>
							</c>
							<c ca="center">
								<p>661/3/867</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The first two columns list the known TFs and the known binding motifs. The next five columns report the results from WordSpy, followed by the last three columns for the results from MobyDick, RSA tools, and Weeder. The motifs discovered by WordSpy are marked with (+) if on the up strand, (-) if on the down strand or (*) if on both strands. Rank is based on <it>Z</it><sub><it>g</it></sub>-score and <it>G</it>-score, where the first number is the ranking on <it>Z</it><sub><it>g</it></sub>-score and the second is on <it>G</it>-score and the third is the total number of discovered motifs of the same length.</p>
					</tblfn>
				</tbl>
				<p>MBF and SBF are predominant TFs in the G1/S phase of the yeast cell-cycle. Their binding motifs, MCB (ACGCGT) and SCB (CRCGAAA) <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>, are consistent with the top motifs discovered by WordSpy. Among 199 discovered motifs of length 7, AACGCGT ranks the first in both <it>Z</it><sub><it>g</it></sub>-score and <it>G</it>-score, CGCGAAA is the second in <it>G</it>-score and the third in <it>Z</it><sub><it>g</it></sub>-score, and CACGAAA ranks the 10th in <it>Z</it><sub><it>g</it></sub>-score and the 17th in <it>G</it>-score. Another prominent motif GTAAACA (the 8th in <it>Z</it><sub><it>g</it></sub>-score and the 10th in <it>G</it>-score) has been reported to be the binding motif of Fkh2 (or Fkh1) <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, which is involved in cell-cycle control during pseudohyphal growth and in silencing of MHRa <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. WordSpy also identifies the binding motifs of Ace2/Swi5 and Met4/Met28 with high <it>G</it>-score ranking, and the binding motifs of Mcm1 and Ste12 with high <it>Z</it><sub><it>g</it></sub>-score ranking.</p>
				<p>Figure <figr fid="F3">3</figr> displays the distribution of all discovered motifs of length 8 in reference to the <it>Z</it><sub><it>g</it></sub>-score. The motifs that overlap with some known motifs by at least six nucleotides are displayed in a different color. This result shows that most of the top-ranking motifs based on the <it>Z</it><sub><it>g</it></sub>-score resemble known motifs. To facilitate motif selection, we clustered similar motifs. The motifs were first sorted by <it>Z</it><sub><it>g</it></sub>-score or <it>G</it>-score. From the highest to the lowest rankings, we took a motif that had not been clustered as a seed, and grouped it with all the motifs that shared a common substring of length 6 (out of 8 base pairs) with the seed or its reverse complementary. Combining the top 20 clusters of all motifs of length 8 based on <it>Z</it><sub><it>g</it></sub>-score and <it>G</it>-score, all the known motifs are identified (see Tables <tblr tid="T3">3</tblr> and 4 in Additional data file 1). All these encouraging results suggest that by combining <it>Z</it><sub><it>g</it></sub>-score and <it>G</it>-score analysis, WordSpy can comprehensively identify real motifs from a large set of regulatory sequences with a high specificity.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>Distribution of discovered yeast motifs of length 8</p>
					</caption>
					<text>
						<p>Distribution of discovered yeast motifs of length 8. The <it>x</it>-axis is the genome <it>Z</it>-score (<it>Z</it><sub><it>g</it></sub>-score) of a motif, which measures the motif's specificity to the cell-cycle genes. Motifs resembling known ones are marked in blue.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-3"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Identifying <it>Arabidopsis </it>cell-cycle regulatory motifs</p>
				</st>
				<p>Cell-cycle regulation in plants is more complicated than that in yeast or even mammals. One possible explanation is that the sessile life-style of plants requires a more sophisticated mechanism for growth or development to adapt to adverse environmental conditions <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. What makes the study of the cell-cycle in plants more appealing is that some plant cells have surprisingly long life spans and are extremely resistant to cancerous conditions. Understanding how plant cells are controlled during development may shed light on the control of human cell proliferation <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>.</p>
				<p>In this study, we applied WordSpy to identify regulatory elements of 1,081 cell-cycle regulated genes of <it>A. thaliana</it>, which were identified by a high-throughput expression profiling experiment <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. After having removed homologous genes with an E-value threshold of 10<sup>-12</sup>, we had 1,030 genes left for analysis. The promoter sequences were obtained from TAIR database <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. We ran WordSpy to find motifs with lengths up to 10. The <it>Arabidopsis </it>whole-genome transcription-profiling data under normal growth conditions from the Weigel lab <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> were used to calculate motif <it>G</it>-scores.</p>
				<p>Figure <figr fid="F4">4</figr> shows the distribution of 5,277 discovered over-represented words over gene specificity in <it>Z</it><sub><it>g</it></sub>-score (<it>x</it>-axis) and gene expression coherence in <it>G</it>-score (<it>y</it>-axis). We considered words with a <it>G</it>-score greater than 0.2 as biologically significant, and used <it>Z</it><sub><it>g</it></sub>-score thresholds of greater than 3.0 or less than -1.0 to select cell-cycle-related or unrelated motifs. With these criteria, motifs are split into six categories, as shown in Figure <figr fid="F4">4</figr>. The motifs in region I are putative cell-cycle-related motifs that we are mostly interested in. Region II also contains many putative binding motifs for cell-cycle genes, which may not be specific to cell-cycle processes. The motifs in region IV are putative motifs that are more plentiful in non cell-cycle genes. The motifs in regions III and V are the ones that are statistically significant although their target genes do not express coherently. We can consider the rest of the words in the middle region as background words as they do not satisfy either criterion.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Distribution of all discovered motifs from <it>Arabidopsis </it>cell-cycle-related genes</p>
					</caption>
					<text>
						<p>Distribution of all discovered motifs from <it>Arabidopsis </it>cell-cycle-related genes. The <it>x</it>-axis is the genome <it>Z</it>-score (<it>Z</it><sub><it>g</it></sub>-score) of a motif, which measures the motif's specificity to the cell-cycle genes. The <it>y</it>-axis is the <it>G</it>-score of a motif, which measures the coherency of the expression profiles of the genes whose promoters contain the motif.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-4"/>
				</fig>
				<p>There are 110 motifs in region I of Figure <figr fid="F4">4</figr> (see Tables 5 and 6 in Additional data file 1). We clustered them to obtain 55 motifs (see Additional data file 2). We selected 14 of the 55 motifs, which are similar to some known motifs listed in the plant motif databases PLACE <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> and PLANTCARE <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, and present them in Figure <figr fid="F5">5</figr>.</p>
				<fig id="F5">
					<title>
						<p>Figure 5</p>
					</title>
					<caption>
						<p>Selected putative <it>Arabidopsis </it>cell-cycle-related motifs</p>
					</caption>
					<text>
						<p>Selected putative <it>Arabidopsis </it>cell-cycle-related motifs. ID, the ranking of a motif in the overall list. The third column gives the number of cell-cycle genes whose promoters contain the motif. The following four columns are the number of target genes in S and M phases of the cell cycle and the corresponding <it>P</it> value. GO analysis gives the functional group with the best <it>P</it> value, which is shown in the last column.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-5"/>
				</fig>
				<p>To further evaluate whether WordSpy can indeed find functional <it>cis</it>-regulatory elements, we analyzed these 55 clustered motifs with respect to different cell-cycle phases. The expressions of 247, 343, 131, and 247 of the 1,081 cell-cycle genes peak in G1, S, G2, and M phases, respectively <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. On the basis of this target gene distribution in each phase, we calculated the specificity of each motif to every phase of the cell cycle. For example, 79 of 122 target genes containing motif 2 (ID = 2, Figure <figr fid="F5">5</figr>) are M-phase genes. When randomly selecting 122 genes from the set of cell-cycle genes, the chance to have 79 M phase genes is less than 3 &#215; 10<sup>-14</sup>. Therefore, motif 2 is very likely to be an M-phase motif. Surprisingly, all the motifs in Figure <figr fid="F5">5</figr> have very low <it>p </it>values in either M phase or S phase. More interestingly, most motifs with low <it>p </it>values in M phase match well with the mitotic-specific activation (MSA) elements (consensus YCYAACGGYY) <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>, and the motifs with low <it>p </it>values in S phase resemble motifs E2F (TTTYYCGYY) <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, Octamer and Hexamer <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, which are known S-phase motifs.</p>
				<p>Furthermore, to reveal possible functions for each of the 55 motifs, we calculated the enrichment of gene ontology (GO) terms <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> within the genes containing the motif (see Materials and methods). Figure <figr fid="F5">5</figr> shows that almost every motif has some enriched functional categories (<it>p </it>value &lt; 1e-2). The most common functional category is the cyclin-dependent protein kinase regulator activity (CDK). Interestingly, many motifs related to CDK are MSA elements or resemble MYB-like motifs, suggesting that MYB-like TFs regulate cyclin kinase-like proteins in G2M phase of the cell cycle. Motif 28 (TTCACCTAC, Figure <figr fid="F5">5</figr>) does not match with any known motif. However, all its 11 target genes peak in S phase, and all seven target genes with GO annotations are related to catalytic activity, implying that this is a novel functional motif. We report all new putative functional motifs in Additional data file 2.</p>
				<sec>
					<st>
						<p>MSA motifs are position dependent</p>
					</st>
					<p>The top four motifs of length 7 ordered by <it>G</it>-score - AGCCGTT, GACCGTT, ACCGTGG, and GGCGCCA - have both significant <it>Z</it><sub><it>g</it></sub>-score (&gt; 3.0) and <it>G</it>-score (&gt; 0.2). The first three of these motifs resemble MSA elements (consensus CYAACGGYY) <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. We investigated their position distribution on the promoters of the cell-cycle genes containing the motifs. The result is shown in Figure <figr fid="F6">6</figr>. Three MSA motifs - AGCCGTT, GACCGTT and ACCGTTG - are significantly over-represented near the transcription start sites (TSSs).</p>
					<fig id="F6">
						<title>
							<p>Figure 6</p>
						</title>
						<caption>
							<p>Distribution of the locations of putative <it>Arabidopsis </it>motifs</p>
						</caption>
						<text>
							<p>Distribution of the locations of putative <it>Arabidopsis </it>motifs. The location distribution of the top four putative motifs of length 7 in the promoters of <it>Arabidopsis </it>cell-cycle genes is shown.</p>
						</text>
						<graphic file="gb-2006-7-6-r49-6"/>
					</fig>
					<p>We further studied the most significant motif of length 10, ACTAGCCGTT, which is ranked the first in <it>Z</it><sub><it>g</it></sub>-score (11.4) and the second in <it>G</it>-score (0.718) (see Table 5 in Additional data file 1). Figure <figr fid="F7">7</figr> shows the expression patterns of the genes whose promoters contain ACTAGCCGTT on either strand. Both heat-map and profile chart demonstrate a highly coherent expression pattern, except for three outliers, AT3G61640, AT5G13100, and AT5G23480. Remarkably, the loci of the motif on these outliers are far away from their TSSs, as shown in Figure <figr fid="F8">8</figr>. Moreover, these cell-cycle genes, except the outliers, are all M-phase related according to the experiment in <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. These results suggest that MSA motifs are position dependent, and usually close to TSSs.</p>
					<fig id="F7">
						<title>
							<p>Figure 7</p>
						</title>
						<caption>
							<p>Expression patterns of <it>Arabidopsis </it>genes associated with ACTAGCCGTT</p>
						</caption>
						<text>
							<p>Expression patterns of <it>Arabidopsis </it>genes associated with ACTAGCCGTT. The gene-expression profiles are highly coherent except three outliers - AT3G61640, AT5G13100, and AT5G23480. <b>(a)</b> Heat-map analysis of microarray expression patterns. <b>(b)</b> Profile analysis of microarray expression patterns. Expression profiles are clustered into two groups. The profiles in both red and blue have similar patterns, but the profiles in red have relatively low values.</p>
						</text>
						<graphic file="gb-2006-7-6-r49-7"/>
					</fig>
					<fig id="F8">
						<title>
							<p>Figure 8</p>
						</title>
						<caption>
							<p>Distribution of the positions of the motif ACTAGCCGTT in the promoters of <it>Arabidopsis </it>cell-cycle genes</p>
						</caption>
						<text>
							<p>Distribution of the positions of the motif ACTAGCCGTT in the promoters of <it>Arabidopsis </it>cell-cycle genes.</p>
						</text>
						<graphic file="gb-2006-7-6-r49-8"/>
					</fig>
				</sec>
				<sec>
					<st>
						<p>E2F binding motifs may vary in cell-cycle related and unrelated genes</p>
					</st>
					<p>Various studies have shown that in addition to the cell cycle, the genes containing binding motif E2F appear in many functional categories including transcription, stress defense, and signaling <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. As expected, we also identified many E2F-like motifs in region II. Table <tblr tid="T2">2</tblr> shows the discovered motifs that match to the known E2F binding elements (consensus TTTYYCGYY) <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. The motifs in cluster 1 are in the motif region I of Figure <figr fid="F4">4</figr> with <it>Z</it><sub><it>g</it></sub>-score greater than 3.0. This cluster of motifs corresponds to motif 8 in Figure <figr fid="F5">5</figr>. The motifs in cluster 2 are in the motif region II with <it>Z</it><sub><it>g</it></sub>-score less than 3.0. Obviously, the motifs in cluster 1 are more specific to cell cycle than those in cluster 2. These two sets of motifs differ only by two nucleotides in their core sequences. The motifs that are more cell-cycle specific have 'GG' in the middle (TTT<b>GG</b>CGCC), whereas the motifs that are abundant in the genome contain 'CC' in their core sequences (TTT<b>CC</b>CGCC). Among the cell-cycle genes, TTT<b>GG</b>CGCC appears in 14 promoters and TTT<b>CC</b>CGCC in 10 promoters. In the whole genome, 100 genes have TTT<b>GG</b>CGCC in their promoters and 257 genes have TTT<b>CC</b>CGCC.</p>
					<tbl id="T2">
						<title>
							<p>Table 2</p>
						</title>
						<caption>
							<p>Discovered E2F motifs with <it>G</it>-score greater than 0.2</p>
						</caption>
						<tblbdy cols="7">
							<r>
								<c ca="left">
									<p>
										<b>Motif</b>
									</p>
								</c>
								<c ca="left">
									<p>
										<b><it>Z</it><sub><it>g</it></sub>-score</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b><it>Z </it>-score</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b><it>G </it>-score</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>Number of occurrences</b>
									</p>
								</c>
								<c ca="center">
									<p>
										<b>Number of promoters</b>
									</p>
								</c>
								<c ca="left">
									<p>
										<b>Known motifs</b>
									</p>
								</c>
							</r>
							<r>
								<c cspan="7">
									<hr/>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>Word cluster 1:</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>TT<it>GG</it>CGCCTC(-)</p>
								</c>
								<c ca="left">
									<p>3.768</p>
								</c>
								<c ca="center">
									<p>11.6</p>
								</c>
								<c ca="center">
									<p>0.633</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="center">
									<p>4</p>
								</c>
								<c ca="left">
									<p>E2F(TTTYYCGYY)</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>TTT<it>GG</it>CGCCT(-)</p>
								</c>
								<c ca="left">
									<p>4.384</p>
								</c>
								<c ca="center">
									<p>9.5</p>
								</c>
								<c ca="center">
									<p>0.438</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="left">
									<p>E2F(TTTYYCGYY)</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>T<it>GG</it>CGCC(*)</p>
								</c>
								<c ca="left">
									<p>3.006</p>
								</c>
								<c ca="center">
									<p>5.6</p>
								</c>
								<c ca="center">
									<p>0.255</p>
								</c>
								<c ca="center">
									<p>20</p>
								</c>
								<c ca="center">
									<p>20</p>
								</c>
								<c ca="left">
									<p>E2F(TTTYYCGYY)</p>
								</c>
							</r>
							<r>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>Word cluster 2:</p>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
								<c>
									<p/>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>TTT<it>CC</it>CGCCA(-)</p>
								</c>
								<c ca="left">
									<p>-0.598</p>
								</c>
								<c ca="center">
									<p>12.9</p>
								</c>
								<c ca="center">
									<p>0.508</p>
								</c>
								<c ca="center">
									<p>6</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="left">
									<p>E2FANTRNR(TTTCCCGC)</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>TTT<it>CC</it>CGCC(+)</p>
								</c>
								<c ca="left">
									<p>-0.613</p>
								</c>
								<c ca="center">
									<p>4.7</p>
								</c>
								<c ca="center">
									<p>0.289</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="center">
									<p>5</p>
								</c>
								<c ca="left">
									<p>E2FANTRNR(TTTCCCGC)</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>TT<it>CC</it>CGC(+)</p>
								</c>
								<c ca="left">
									<p>0.236</p>
								</c>
								<c ca="center">
									<p>5.7</p>
								</c>
								<c ca="center">
									<p>0.285</p>
								</c>
								<c ca="center">
									<p>36</p>
								</c>
								<c ca="center">
									<p>32</p>
								</c>
								<c ca="left">
									<p>E2FANTRNR(TTTCCCGC)</p>
								</c>
							</r>
							<r>
								<c ca="center">
									<p>TTT<it>CC</it>CGCT(+)</p>
								</c>
								<c ca="left">
									<p>0.227</p>
								</c>
								<c ca="center">
									<p>4.3</p>
								</c>
								<c ca="center">
									<p>0.273</p>
								</c>
								<c ca="center">
									<p>7</p>
								</c>
								<c ca="center">
									<p>7</p>
								</c>
								<c ca="left">
									<p>E2FANTRNR(TTTCCCGC)</p>
								</c>
							</r>
						</tblbdy>
						<tblfn>
							<p>Motifs in cluster 1 are in motif region I (Figure 4) with <it>Z</it><sub><it>g</it></sub>-score greater than 3.0. Motifs in cluster 2 are in motif region II with <it>Z</it><sub><it>g</it></sub>-score less than 3.0. The motifs are marked with (+) if on the up strand, (-) if on the down strand or (*) if on both strands. Number of occurrences is the number of occurrences of a motif and Number of promoters is the number of promoters containing the motif.</p>
						</tblfn>
					</tbl>
					<p>In summary, these observations indicate that the preferential cell-cycle-related E2F motif is TTT<b>GG</b>CGCC, and the non-cell-cycle related E2F motif is TTT<b>CC</b>CGCC. In other words, the E2F binding motifs differ based on whether or not they are cell-cycle related. Our results also demonstrate that the WordSpy method can detect such subtle and important difference in regulatory elements.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Finding discriminative motifs</p>
				</st>
				<p>Given two sets of scripts or sequences, a discriminative motif is such a motif that is over-represented in one script but not in the other. WordSpy is, in essence, an algorithm for finding discriminative motifs, because of its intrinsic feature of modeling motifs and background words in an integral model. Here, background words can be extracted from one set of sequences (negative set), while the discriminative motifs are identified from another set of sequences (positive set).</p>
				<p>We applied WordSpy as a discriminative algorithm to find regulatory motifs in <it>S. cerevisiae</it>. We constructed positive and negative sequence sets based on the ChIP-chip experiments of Lee <it>et al</it>. <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. For a particular TF, we selected as the positive dataset those promoters that the TF could bind to with <it>p </it>values &lt; 0.01 in the ChIP-chip experiments and as the negative dataset those promoters with <it>p </it>values &gt; 0.99. We also applied two widely used algorithms, MEME <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and AlignACE <abbrgrp><abbr bid="B7">7</abbr></abbrgrp> to the same data. MEME was executed with a sixth-order Markov model on the yeast noncoding regions as background. Table <tblr tid="T3">3</tblr> lists the motifs that are closest to the known cell-cycle-related motifs from these three algorithms. As shown, WordSpy not only found all known motifs for each TF but also the known motifs of cofactors. MEME and AlignACE were able to find most known motifs, but missed some binding sites of cofactors.</p>
			</sec>
			<sec>
				<st>
					<p>Evaluation with a benchmark study</p>
				</st>
				<p>Recently, Tompa <it>et al</it>. <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> developed a benchmark of a set of well-curated regulatory sequences and <it>cis</it>-regulatory elements of budding yeast, fruit fly, mouse, and human for evaluating motif-finding algorithms. They introduced seven statistical measurements to assess the performance of 13 motif-finding programs. An interesting observation on their results is that the enumeration-based methods, represented by Weeder <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and YMF <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, outperformed the model-based approaches, represented by MEME <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and AlignACE <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>.</p>
				<p>Almost all the sets of sequences in the benchmark are relatively small; none of them has more than 35 sequences. Aimed at finding motifs from a large number of sequences, for example, more than 1,000 promoters of genes related to cell cycles in <it>Arabidopsis</it>, WordSpy was not originally designed to deal with a small number of sequences. Nevertheless, it can be used to find motifs from a small set of sequences and has a very competitive performance, as we show here. We applied WordSpy to the sets of sequences in the benchmark and compared it with the other programs studied by Tompa <it>et al</it>. <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. For fair comparison, we did not use gene-expression information in WordSpy, but rather used only genomic sequences to calculate the <it>Z</it><sub><it>g</it></sub>-scores. Moreover, although WordSpy discovered a set of motifs for each sequence set, we reported the most significant motif with some selection criteria. For all the experiments, we built a dictionary up to word length 10. Then we filtered out the motifs with <it>Z</it><sub><it>g</it></sub>-scores less than 4. Finally, we selected the motif with the highest <it>Z</it>-score or <it>Z</it><sub><it>g </it></sub>-score depending on their site distributions. We always chose the ones that are close to the TSSs.</p>
				<p>Figure <figr fid="F9">9</figr> shows the comparison results of WordSpy with the 13 programs (Weeder <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, YMF <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, RSA-tool <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, QuickScore <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, AlignACE <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>, ANN-Spec <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, MEME <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, Consensus <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, MIRTA <abbrgrp><abbr bid="B41">41</abbr></abbrgrp>, GLAM <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>, Improbizer <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>, MotifSampler <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, SeSiMCMC <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>) on the seven statistics introduced in <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. A detailed description of these statistics is available on the benchmark website <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>. As shown in Figure <figr fid="F9">9</figr> and Additional data file 3, WordSpy outperforms the other programs by all the measures. Figure <figr fid="F10">10</figr> shows true positive versus false positive in both nucleotide level and site level for all the programs. WordSpy has the highest numbers of true positives and relatively low numbers of false positives in both cases. The success of WordSpy may be due to the following reasons. First, WordSpy aims to discover all over-represented motifs; the chance of it missing a significant motif is low. Second, the <it>Z</it><sub><it>g</it></sub>-scores computed in WordSpy help it to select the right motifs that are specific to a given set of sequences. Third, WordSpy uses a strategy of first searching for over-represented exact words and then combining them to form degenerate motifs. This strategy makes the motif representation in WordSpy more stringent than that in the other methods, and as a result, it has a smaller false-positive rate. Note that WordSpy performs better on the budding yeast and human datasets than on the fruit fly datasets.</p>
				<fig id="F9">
					<title>
						<p>Figure 9</p>
					</title>
					<caption>
						<p>The results of a comparison of 14 motif-detection programs on a benchmark study [17]</p>
					</caption>
					<text>
						<p>The results of a comparison of 14 motif-detection programs on a benchmark study [17]. At the nucleotide level, sensitivity (<it>nSn</it>), positive predictive value (<it>nPPV</it>), performance coefficient (<it>nPC</it>), and correlation coefficient (<it>nCC</it>) were measured. With <it>nTP</it>, <it>nFN</it>, <it>nFP </it>and <it>nTN </it>as nucleotide-level true positive, false negative, false positive, and true negative, respectively, <it>nSn </it>= <it>nTP</it>/(<it>nTP </it>+ <it>nFN</it>); <it>nPPV </it>= <it>nTP</it>/(<it>nTP </it>+ <it>nFP</it>); <it>nPC </it>= <it>nTP</it>/(<it>nTP </it>+ <it>nFN </it>+ <it>nFP</it>); and <it>nCC </it>= (<it>nTP</it>&#183;<it>nTN </it>- <it>nFN</it>&#183;<it>nFP</it>)/<graphic file="gb-2006-7-6-r49-i11.gif"/>. At the site level, sensitivity (<it>sSn</it>), positive predictive value (<it>sPPV</it>), and average site performance (<it>sASP</it>) were measured. With <it>sTP</it>, <it>sFN</it>, <it>sFP </it>as site-level true positive, false negative, and false positive, respectively, <it>sSn </it>= <it>sTP</it>/(<it>sTP </it>+ <it>sFN</it>); <it>sPPV</it>) = <it>sTP</it>/(<it>sTP </it>+ <it>sFP</it>; and <it>sASP </it>= (<it>sSn </it>+ <it>sPPV</it>)/2.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-9"/>
				</fig>
				<fig id="F10">
					<title>
						<p>Figure 10</p>
					</title>
					<caption>
						<p>True positives and false positives of the 14 motif-detection programs compared</p>
					</caption>
					<text>
						<p>True positives and false positives of the 14 motif-detection programs compared. <b>(a)</b> Nucleotide-level true positive (<it>nTP</it>) is the number of nucleotide positions in both known sites and predicted sites; nucleotide-level false positive (<it>nFP</it>) is the number of nucleotide positions not in known sites but in predicted sites. <b>(b)</b> Site-level true positive (<it>sTP</it>) is the number of known sites overlapped by predicted sites; site-level false positive (<it>sFP</it>) is the number of predicted sites not overlapped by known sites.</p>
					</text>
					<graphic file="gb-2006-7-6-r49-10"/>
				</fig>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusion</p>
			</st>
			<p>We propose a new approach to the challenging problem of genome-wide motif finding, which combines a novel steganalysis method for discovering over-represented motifs and methods for selecting biologically significant motifs. By taking a steganalysis perspective on the motif-finding problem, we were able to accurately identify a large number of motifs of nearly optimal lengths. By considering all the genes of interest altogether, we avoided the problem of subjectively partitioning the genes into small clusters, which may make some motifs difficult to detect. By applying our approach to all cell-cycle-related genes in budding yeast and <it>A. thaliana</it>, we demonstrated its power as an effective genome-wide motif finding approach that compared favorably to many existing methods.</p>
			<p>The core motif-finding algorithm, WordSpy, combines both word counting and statistical modeling. Like word-counting methods, WordSpy can simultaneously detect a large number of putative motifs. Unlike the existing word-counting methods, however, the wording-counting procedure of WordSpy is progressive and retrospective. It considers short to long words, adjusts the over-representation of shorter words after examining longer ones, and subsequently eliminates not truly over-represented shorter words. As a result, WordSpy produces fewer spurious motifs and is able to find motifs with optimal lengths. Furthermore, instead of using statistical models to characterize a small number of motifs with multiple local alignments, WordSpy models a large number of motifs, their compositions, and their usage to fit to the whole of the given sequences. Consequently, all significant words in regulatory regions can be identified.</p>
			<p>WordSpy is a dictionary-based approach, which was initiated in the innovative MobyDick algorithm by Bussemaker <it>et al</it>. <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. Nevertheless, we significantly extended their work in many important aspects. First, we took a novel steganographic view of the problem of motif finding. This allows us to combine a grammar with a dictionary in a statistical model to capture both conserved motifs and background words. Second, WordSpy accurately quantifies the over-representation of a word by considering the probability that the word can be generated by the best model that has been built so far, whereas MobyDick computes the over-representation by counting the occurrences of a word in a large synthetic dataset. Third, WordSpy considers only those words that occur in the given sequences without enumerating all possible words, which saves a substantial amount of computation, especially for long words.</p>
			<p>In the current implementation of WordSpy, we assumed that the motifs and words in a dictionary were used independently. For some applications, however, spatial relationship among motifs may be biologically important. For such cases, we may resort to a more complex grammar, such as stochastic context-free or context-sensitive grammar <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. However, the incurred computational cost could be prohibitively high for even small problems. A more efficient way to capture motif correlations is to construct motif modules using the motifs identified by a simple grammar model. Similar post-processing strategies have been proposed <abbrgrp><abbr bid="B47">47</abbr><abbr bid="B48">48</abbr></abbrgrp>.</p>
			<p>In this research, we adopted two schemes to measure the biological significance of motifs. One is the expression coherence of the genes whose promoters contain a motif, and the other is the specificity of a motif to the genes of interest with respect to the rest of the genome. Similar ideas have been proposed <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>. As shown in this study, these two biological relevance measures are effective in identifying cell-cycle-related TF-binding motifs of yeast and <it>A. thaliana</it>. However, we need to caution that a high <it>G</it>-score may not necessarily and sufficiently mean a good motif, a similar restriction to the clustering-first approaches, and that gene-expression information may not be available for all genomes. Therefore, we suggest using the <it>Z</it><sub><it>g</it></sub>-score as the major criterion, and the <it>G</it>-score and other information as supports.</p>
			<p>In this study, we applied our approach to identify significant <it>cis</it>-elements from sequences of a single species. Like most algorithms that use information of a single species, WordSpy may be vulnerable to noisy promoter sequences as a result of the uncertainty of the annotation, especially in the genomes of higher eukaryotes. A comparative approach may have an advantage in such situations by utilizing conservation information from multiple species. Therefore, we will consider using evolutionary information to improve our method in future work. Nevertheless, computational tools for large-scale <it>de novo </it>motif finding for a single species are still important, especially for applications where no sequences of closely related species are available and for problems where species-specific motifs are needed. It is interesting to note that single-species motif finding can be competitive when compared with comparative genomics methods using multiple species <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>.</p>
		</sec>
		<sec>
			<st>
				<p>Materials and methods</p>
			</st>
			<sec>
				<st>
					<p>Word sampling</p>
				</st>
				<p>The goal of word sampling is to discover over-represented motifs as completely and accurately as possible. Word sampling determines the structure of the model and initializes its parameters. For biological sequences, a regulatory motif is usually represented by a series of position profiles, each of which is the distribution of four nucleotides at that position. In our model, the emission probability of each position node is equivalent to such a profile. However, such motifs, named as 'profile motifs', exist in a continuous space. It is almost impossible to comprehensively search for all over-represented profile motifs directly. Here, we combine methods of word counting and statistical modeling. We apply a word-counting method to detect over-represented words in the discrete sequence space of four nucleotides, and then cluster similar words to form a profile motif. All resulting profile motifs will be further improved in the model optimization phase.</p>
				<p>We develop an efficient algorithm for word sampling to identify all over-represented words of length <it>k </it>in the sequence space against the optimal model <graphic file="gb-2006-7-6-r49-i9.gif"/> in linear time and linear space complexity. The algorithm scans the script <it>S </it>once, tabulates, using a hashing scheme, all exact words of length <it>k </it>in <it>S</it>, and computes their over-representativeness. A word is considered over-represented if it occurs more frequently in <it>S </it>than it could be generated by the current best model <graphic file="gb-2006-7-6-r49-i9.gif"/>. We measure the over-representativeness by a <it>Z</it>-score. Let <it>N</it><sub><it>w </it></sub>be the number of occurrences of a word <it>w </it>in <it>S </it>and random variable <graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w </it></sub>be the number of occurrences of <it>w </it>in a script with the same length as <it>S </it>which were supposedly generated by model <graphic file="gb-2006-7-6-r49-i9.gif"/>. Denote <it>E</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>) and <it>&#963;</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>) as the mean and standard deviation of <graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>. The <it>Z</it>-score of <it>w </it>is defined as <it>Z</it><sub><it>w </it></sub>= (<it>N</it><sub><it>w </it></sub>- <it>E</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/>))/<it>&#963;</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>). It is nontrivial to compute the statistics of random variable <graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>. Consider a word <it>w </it>of length <it>k </it>in a sequence of length <it>L </it>generated by model <graphic file="gb-2006-7-6-r49-i9.gif"/>. There are various ways to produce <it>w </it>using the model, for example, by concatenating words of a single letter, or by merging a word's suffix with another word's prefix. To compute the expected number of occurrences of <it>w</it>, <it>E</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>), we define <graphic file="gb-2006-7-6-r49-i13.gif"/>(<it>i</it>) (and respectively <graphic file="gb-2006-7-6-r49-i14.gif"/>(<it>j</it>)) to be the set of words in <graphic file="gb-2006-7-6-r49-i8.gif"/> whose suffixes (and respectively prefixes) match the first <it>i </it>(and respectively the last <it>j</it>) letters of <it>w</it>. The expectation <it>E</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>) can be computed as</p>
				<p>
					<graphic file="gb-2006-7-6-r49-i15.gif"/>
				</p>
				<p>where</p>
				<p>
					<graphic file="gb-2006-7-6-r49-i16.gif"/>
				</p>
				<p>and <it>w</it><sub>[<it>j</it>,<it>k</it>] </sub>represents the subsequence of <it>w </it>from its <it>j</it>th to <it>k</it>th positions, <graphic file="gb-2006-7-6-r49-i17.gif"/> is the transition probability of motif <it>W</it><sub><it>u</it></sub>, and <graphic file="gb-2006-7-6-r49-i18.gif"/> and <graphic file="gb-2006-7-6-r49-i19.gif"/> are the emission probabilities of the last <it>i </it>and first <it>j </it>positions of &#920;<sub><it>u</it></sub>, respectively. The computation of <it>&#963;</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>) is complex and costly <abbrgrp><abbr bid="B51">51</abbr><abbr bid="B52">52</abbr></abbrgrp>. Following the practice in the existing methods, in our current implementation, we approximate <it>&#963;</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>) by <it>E</it>(<graphic file="gb-2006-7-6-r49-i12.gif"/><sub><it>w</it></sub>).</p>
				<p>All the words with Z-scores greater than a threshold are considered over-represented. Thereafter, all new words are classified into background words or motif words with some motif evaluation methods. Two evaluation methods will be described in the section on 'Motif evaluation' below. After evaluation, background words are added to background sub-dictionary of <graphic file="gb-2006-7-6-r49-i8.gif"/>. The motif words are further clustered to form profile motifs.</p>
				<p>The current implementation of word clustering is a greedy algorithm. Let <it>C </it>= {<it>w</it><sub>1</sub>,<it>w</it><sub>2</sub>,...,<it>w</it><sub><it>m</it></sub>} be a set of words of length <it>k</it>, sorted in a non-increasing order of their <it>Z</it>-scores. From the beginning to the end of list <it>C</it>, we take a word <it>w</it><sub><it>j </it></sub>as a seed and search the words in <it>C </it>that match <it>w</it><sub><it>j </it></sub>by at least <it>&#954; </it>letters, where <it>&#954; </it>is determined so that the chance of two random words of length <it>k </it>having <it>&#954; </it>matched letters is less than 0.001. All such matched words are then merged with <it>w</it><sub><it>j </it></sub>and subsequently removed from the seed candidate list. The procedure terminates after all seeds have been examined. This heuristic assumes that the degeneracy is uniform over all positions of a motif. Regulatory motifs may, however, have one or two core parts that are more conserved than their flanking sequences, which sometimes may be 'do-not-care' positions. Fortunately, the current model <graphic file="gb-2006-7-6-r49-i9.gif"/> keeps all short but over-represented motifs that may include those possible cores of longer motifs. We can also make a nonuniform seed by parsing a word in <it>C </it>through <graphic file="gb-2006-7-6-r49-i8.gif"/>, finding some cores (substrings), fixing the seed at those core positions, and allowing mismatches at the other positions. Note that these word clusters are not final. During the model optimization, word clusters are dynamically changed as profile motifs are updated.</p>
				<p>At the end of word sampling, the new profile motifs are added to the motif sub-dictionary of <graphic file="gb-2006-7-6-r49-i8.gif"/> (<it>I</it><sub><it>W </it></sub>is set to 1) to form the next dictionary <it>D</it><sub><it>k</it></sub>. The model is retrofitted to accommodate the new motifs, leading to the next grammar <it>G</it><sub><it>k</it></sub>. The new model <it>G</it><sub><it>k </it></sub>is then optimized in the model optimization phase. The overall process repeats until the model covers motifs up to the maximum length.</p>
			</sec>
			<sec>
				<st>
					<p>Model optimization</p>
				</st>
				<p>The goal of model optimization is to optimize the profile motifs as well as their usage probabilities. In this phase, motif statistics are recomputed and insignificant motifs are discarded. Given a stegoscript <it>S </it>and a grammar <it>G</it><sub><it>k </it></sub>= (&#936;, &#920;, <it>I</it>), where <it>I </it>has been determined in word sampling, an optimized grammar <graphic file="gb-2006-7-6-r49-i10.gif"/> can be derived using the expectation maximization (EM) algorithm <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>.</p>
				<p>Without loss of generality, we view a set of sequences as a long sequence <it>S </it>= <it>s</it><sub>1</sub><it>s</it><sub>2</sub>...<it>s</it><sub><it>q</it></sub>. Let <graphic file="gb-2006-7-6-r49-i20.gif"/> = (&#936;<sup>(<it>t</it>)</sup>, &#920;<sup>(<it>t</it>)</sup>) be <it>G</it><sub><it>k</it></sub>'s parameters in the <it>t</it>th iteration. We can adopt a dynamic programming forward-backward algorithm <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> to compute the most probable state when observing <it>s</it><sub><it>l </it></sub>&#8712; <it>S</it>. Specifically, we compute the probability of observing <it>s</it><sub><it>l </it></sub>at the <it>j</it>th position of a motif <it>W </it>given &#936;<sup>(<it>t</it>) </sup>and &#920;<sup>(<it>t</it>) </sup>as follows,</p>
				<p>
					<graphic file="gb-2006-7-6-r49-i21.gif"/>
				</p>
				<p>where W[<it>j</it>] is the <it>j</it>th position of <it>W</it>, <it>f</it>(<it>&#956;</it>) is the probability of observing <it>S </it>up to <it>s</it><sub><it>&#956; </it></sub>(inclusive) given <graphic file="gb-2006-7-6-r49-i20.gif"/>, <it>&#956; </it>= <it>i </it>- <it>k</it>, <it>&#961;</it><sub><it>W </it></sub>= <it>P</it><sub><it>W</it></sub>(<it>I</it><sub><it>W</it></sub><it>P</it><sub><it>M </it></sub>+ (1 - <it>I</it><sub><it>W</it></sub>)<it>P</it><sub><it>B</it></sub>), and <it>b</it>(<it>&#957; </it>+ 1) is the probability of observing <it>S </it>from <it>S</it><sub><it>&#957; </it>+ 1 </sub>(inclusive) to the end of <it>S</it>, <it>&#957; </it>= <it>i </it>- <it>k </it>+ <it>l</it>(<it>W</it>), and <it>l</it>(<it>W</it>) is the length of <it>W</it>. Function <it>f</it>(<it>i</it>) can be recursively computed as <it>f</it>(<it>i</it>) = <graphic file="gb-2006-7-6-r49-i22.gif"/>&#183;<it>&#964;</it><sub><it>W</it></sub>(<it>i </it>- <it>l</it>(<it>W</it>) + 1, <it>i</it>)&#183;<it>f</it>(<it>i </it>- <it>l</it>(<it>W</it>)). Similarly <it>b</it>(<it>i</it>) can be computed as <it>b</it>(<it>i</it>) = <graphic file="gb-2006-7-6-r49-i22.gif"/>&#183;<it>&#964;</it><sub><it>W</it></sub>(<it>i</it>, <it>i </it>+ <it>l</it>(<it>W</it>) - 1)&#183;<it>b</it>(<it>i </it>+ <it>l</it>(<it>W</it>)). Evidently, <it>P</it>(<it>S</it>|<graphic file="gb-2006-7-6-r49-i20.gif"/>) = <it>f</it>(<it>q</it>) = <it>b</it>(1).</p>
				<p>With this posterior probability, we can easily have, <graphic file="gb-2006-7-6-r49-i23.gif"/> and <graphic file="gb-2006-7-6-r49-i24.gif"/>, where <graphic file="gb-2006-7-6-r49-i25.gif"/> is the average number of <it>W</it><sub><it>i </it></sub>likely to be observed and <graphic file="gb-2006-7-6-r49-i26.gif"/>(<it>&#962;</it>, <it>j</it>) is the average number of letter <it>&#962; </it>likely to be observed at the <it>j</it>th position of <it>W</it><sub><it>i</it></sub>, in all the possible parses of <it>S </it>given &#936;<sup>(<it>t</it>) </sup>and &#920;<sup>(<it>t</it>)</sup>. On the basis of the maximum likelihood principle, a model that fits the data better will have the following parameters,</p>
				<p>
					<graphic file="gb-2006-7-6-r49-i27.gif"/>
				</p>
				<p>where &#923; is the alphabet, <it>&#962; </it>&#8712; &#923;, <it>j </it>= 1,2,...,<it>l</it>(<it>W</it><sub><it>i</it></sub>), <it>l</it>(<it>W</it><sub><it>i</it></sub>) is the length of <it>W</it><sub><it>i</it></sub>, and <it>&#948; </it>(<it>x</it>, <it>y</it>) equals 1 if <it>x </it>= <it>y</it>, or 0 otherwise. The model optimization is done iteratively using equations in (3) until convergence.</p>
				<p>In this procedure, the computation of the forward-backward algorithm becomes more costly when the number of motifs in the dictionary increases because its time complexity is <it>O</it>(<it>L</it>&#183;<it>N</it>), where <it>L </it>is the sequence length and <it>N </it>the size of the dictionary. We introduce a hash scheme to index a word <it>w </it>directly to the profile motifs that may emit <it>w </it>in the dictionary, which reduces the average cost of forward-backward algorithm to <it>O</it>(<it>&#945;</it><it>L</it>), where <it>&#945; </it>is the average link length of the words in the hash table. The links are initially created during word clustering. When a profile motif is generated from a word cluster, every word in the cluster will add a link to the motif in its hash field. Because a word may appear in multiple clusters, its hash field may contain multiple links. These links will also be dynamically changed at the end of each iteration, as the profile motifs are updated.</p>
			</sec>
			<sec>
				<st>
					<p>Motif evaluation</p>
				</st>
				<p>WordSpy is designed to identify a complete list of putative motifs and usually gives a large number of significant words. How to separate true motifs from background words is critical. As the covertext consists of random strings, a proper <it>Z</it>-score threshold can be used to filter out most background words. However, the regulatory regions of a genome are not purely random. There exist many highly over-represented pseudo-motifs that make it harder to find real, functional motifs. Fortunately, functional motifs often have intrinsic properties that make them separable from spurious ones.</p>
				<sec>
					<st>
						<p>Specificity to the target promoters</p>
					</st>
					<p>An extracted motif cannot be considered as a genuine motif specific to the genes of interest if it is prevalent in other promoter regions of the genome. We utilize this property to discriminate real motifs from fake ones. This is done by a whole genome analysis with a Monte Carlo simulation of thousands of runs. In each run, a set of promoters are randomly selected from the genome and the occurrence of a motif is counted. A genome <it>Z</it>-score, shortened as <it>Z</it><sub><it>g</it></sub>-score, is calculated to measure the specificity of the motif to the target promoters from which it was discovered with respect to randomly selected promoters. A high positive <it>Z</it><sub><it>g</it></sub>-score is desired, as it means that the motif is unlikely to be a background word.</p>
				</sec>
				<sec>
					<st>
						<p>Gene-expression coherence</p>
					</st>
					<p>Statistically a set of genes sharing a motif will have more similar expression profiles than a set of arbitrary genes. Therefore, we can measure the likelihood of a motif being biologically meaningful by the coherence of the expressions of all the genes whose promoters contain the motif. We use the average coherence of pairwise gene expression to measure the coherence of a set of expression profiles. We call this measure the <it>G</it>-score, where <it>G </it>stands for genes. A higher <it>G</it>-score indicates a more biologically significant motif. The pairwise gene-expression coherence can be measured in many ways, such as Euclidean distances and Pearson correlation coefficients. Here, we present our results using Pearson correlation coefficients. We have also analyzed the expression coherence score in <abbrgrp><abbr bid="B49">49</abbr></abbrgrp> and a normalized version of the <it>G</it>-score. Our results on yeast (see Additional data file 1) indicate that the simple Pearson correlation-coefficient <it>G</it>-score works slightly better than the other two.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>GO functional analysis</p>
				</st>
				<p>To determine whether any GO terms are enriched in a specified list of genes, we use GO::TermFinder perl module<abbrgrp><abbr bid="B54">54</abbr></abbrgrp> to calculate a <it>p </it>value with accumulative hypergeometric distribution,</p>
				<p>
					<graphic file="gb-2006-7-6-r49-i28.gif"/>
				</p>
				<p>where <it>N </it>is the total number of genes, <it>M </it>is the number of genes annotated to have a specific function, <it>n </it>is the number of genes tested, and <it>k </it>is the number of genes tested which are annotated to have the specific function. The <it>p </it>values are adjusted by Bonferroni corrections for multiple tests <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. GO annotations of <it>Arabidopsis </it>were retrieved from TAIR database (version January 2006) <abbrgrp><abbr bid="B56">56</abbr></abbrgrp>. The significantly enriched functional categories were discovered with a false-discovery rate (FDR) of less than 0.05 <abbrgrp><abbr bid="B57">57</abbr></abbrgrp>.</p>
			</sec>
			<sec>
				<st>
					<p>WordSpy webserver</p>
				</st>
				<p>An online server has been set up for the WordSpy algorithm to support direct access to the software at <abbrgrp><abbr bid="B58">58</abbr></abbrgrp>.</p>
				<suppl id="S1">
					<title>
						<p>Additional file 1</p>
					</title>
					<caption>
						<p/>
					</caption>
					<text>
						<p/>
					</text>
					<file name="gb-2006-7-6-r49-S1.pdf">
						<p>Click here for file</p>
					</file>
				</suppl>
				<suppl id="S2">
					<title>
						<p>Additional file 2</p>
					</title>
					<caption>
						<p/>
					</caption>
					<text>
						<p/>
					</text>
					<file name="gb-2006-7-6-r49-S2.xls">
						<p>Click here for file</p>
					</file>
				</suppl>
				<suppl id="S3">
					<title>
						<p>Additional file 3</p>
					</title>
					<caption>
						<p/>
					</caption>
					<text>
						<p/>
					</text>
					<file name="gb-2006-7-6-r49-S3.xls">
						<p>Click here for file</p>
					</file>
				</suppl>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Additional data files</p>
			</st>
			<p>Additional data are available with this article. Additional data file <supplr sid="S1">1</supplr> contains supplementary material; Additional data file <supplr sid="S2">2</supplr> contains <it>Arabidopsis </it>cell-cycle motifs; Additional data file <supplr sid="S3">3</supplr> contains evaluation results on the benchmark.</p>
			<tbl id="T3">
				<title>
					<p>Table 3</p>
				</title>
				<caption>
					<p>Discovered motifs using positive and negative data</p>
				</caption>
				<tblbdy cols="8">
					<r>
						<c ca="left">
							<p>
								<b>Transcription factors</b>
							</p>
						</c>
						<c ca="left">
							<p>
								<b>Known motifs</b>
							</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="center">
							<p>
								<b>WordSpy</b>
							</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="center">
							<p>
								<b>MEME</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>AlignACE</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="8">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>ACE2</p>
						</c>
						<c ca="left">
							<p>CCAGCA</p>
						</c>
						<c ca="left">
							<p>GCTGG(1)</p>
						</c>
						<c ca="left">
							<p>CCAGC(2)</p>
						</c>
						<c ca="left">
							<p>GCTGGC(1)</p>
						</c>
						<c ca="left">
							<p>AACCAGC(2)</p>
						</c>
						<c ca="left">
							<p>AACCAGCA(7)</p>
						</c>
						<c ca="left">
							<p>AACCAGC(12)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Fkh1</p>
						</c>
						<c ca="left">
							<p>GTAAACA</p>
						</c>
						<c ca="left">
							<p>GTAAACA(1)</p>
						</c>
						<c ca="left">
							<p>TGTTTAC(2)</p>
						</c>
						<c ca="left">
							<p>GTAAACAA(1)</p>
						</c>
						<c ca="left">
							<p>TTGTTTAC(2)</p>
						</c>
						<c ca="left">
							<p>GTAAACAA(1)</p>
						</c>
						<c ca="left">
							<p>AAANGTAAACA(5)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Fkh2</p>
						</c>
						<c ca="left">
							<p>GTAAACA</p>
						</c>
						<c ca="left">
							<p>GTAAACA(1)</p>
						</c>
						<c ca="left">
							<p>TGTTTAC(2)</p>
						</c>
						<c ca="left">
							<p>GTAAACAA(1)</p>
						</c>
						<c ca="left">
							<p>TTGTTTAC(2)</p>
						</c>
						<c ca="left">
							<p>TTGTTTAC(1)</p>
						</c>
						<c ca="left">
							<p>AANRWAAACA(3)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Mbp1</p>
						</c>
						<c ca="left">
							<p>ACGCGT</p>
						</c>
						<c ca="left">
							<p>ACGCGT(1)</p>
						</c>
						<c ca="left">
							<p>AACGCGT(1)</p>
						</c>
						<c ca="left">
							<p>ACGCGTT(2)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>AACGCGTT(2)</p>
						</c>
						<c ca="left">
							<p>RACGCGWY(3)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>CRCGAAA</p>
						</c>
						<c ca="left">
							<p>GACGCGA(3)</p>
						</c>
						<c ca="left">
							<p>TCGCGTC(5)</p>
						</c>
						<c ca="left">
							<p>ACGCGAA(6)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
						<c ca="left">
							<p>ACGCGWAAAA(9)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Mcm1</p>
						</c>
						<c ca="left">
							<p>TTTCCTAATTAGGAAA</p>
						</c>
						<c ca="left">
							<p>TAGGAAA(1)</p>
						</c>
						<c ca="left">
							<p>TTTCCTAA(9)</p>
						</c>
						<c ca="left">
							<p>TTAGGAAA(10)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>CCTAATTAGG(1)</p>
						</c>
						<c ca="left">
							<p>TTNCCNNNTNNGGAAA(1)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Met4</p>
						</c>
						<c ca="left">
							<p>TCACGTG</p>
						</c>
						<c ca="left">
							<p>CACGTGA(1)</p>
						</c>
						<c ca="left">
							<p>TCACGTG(2)</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>CACGTGA(1)</p>
						</c>
						<c ca="left">
							<p>CACGTGAY(2)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>AAACTGTGG</p>
						</c>
						<c ca="left">
							<p>GTGGC(1)</p>
						</c>
						<c ca="left">
							<p>CCACA(3)</p>
						</c>
						<c ca="left">
							<p>TGTGG(5)</p>
						</c>
						<c ca="left">
							<p>CTGTG(6)</p>
						</c>
						<c ca="left">
							<p>CCACAGTT(3)</p>
						</c>
						<c ca="left">
							<p>AAACTGTGG(4)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>TGTGGC(2)</p>
						</c>
						<c ca="left">
							<p>CCACAGT(3)</p>
						</c>
						<c ca="left">
							<p>GCCACAC(4)</p>
						</c>
						<c ca="left">
							<p>ACTGTGG(5)</p>
						</c>
						<c ca="left">
							<p>AACTGTGG(7)</p>
						</c>
						<c>
							<p/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Met31</p>
						</c>
						<c ca="left">
							<p>AAACTGTGG</p>
						</c>
						<c ca="left">
							<p>TGTGGC(1)</p>
						</c>
						<c ca="left">
							<p>GCCACA(2)</p>
						</c>
						<c ca="left">
							<p>GCCACAC(2)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>TGTGGCG(10)</p>
						</c>
						<c ca="left">
							<p>AAAANTGTGGC(4)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>TCACGTG</p>
						</c>
						<c ca="left">
							<p>CACGTGA(1)</p>
						</c>
						<c ca="left">
							<p>TCACGTG(3)</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>GCACGTGA(2)</p>
						</c>
						<c ca="left">
							<p>CACGTGANNT(7)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Stb1</p>
						</c>
						<c ca="left">
							<p>ACGCGA</p>
						</c>
						<c ca="left">
							<p>AACGCG(4)</p>
						</c>
						<c ca="left">
							<p>TCGCGTT(3)</p>
						</c>
						<c ca="left">
							<p>TCGCGTT(3)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>TTCGCGTT(3)</p>
						</c>
						<c ca="left">
							<p>AACGCSAAAA(3)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>CRCGAAA</p>
						</c>
						<c ca="left">
							<p>TTCGCG(1)</p>
						</c>
						<c ca="left">
							<p>TTTCGCG(1)</p>
						</c>
						<c ca="left">
							<p>TTTGGCG(2)</p>
						</c>
						<c ca="left">
							<p>TTTCGTG(5)</p>
						</c>
						<c ca="left">
							<p>CGCGAAAA(1)</p>
						</c>
						<c ca="left">
							<p>AACGCSAAAA(3)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>ACGCGT</p>
						</c>
						<c ca="left">
							<p>ACGCGT(3)</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Ste12</p>
						</c>
						<c ca="left">
							<p>TGAAACA</p>
						</c>
						<c ca="left">
							<p>TGAAACA(1)</p>
						</c>
						<c ca="left">
							<p>ATGAAAC(2)</p>
						</c>
						<c ca="left">
							<p>TGAAACAA(2)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>TGAAACA(2)</p>
						</c>
						<c ca="left">
							<p>ATGMAAC(13)</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Swi4</p>
						</c>
						<c ca="left">
							<p>CGCGAAA</p>
						</c>
						<c ca="left">
							<p>ACGCGAA(1)</p>
						</c>
						<c ca="left">
							<p>GACGCGA(2)</p>
						</c>
						<c ca="left">
							<p>AAACGCG(3)</p>
						</c>
						<c ca="left">
							<p>CACGAAA(7)</p>
						</c>
						<c ca="left">
							<p>GACGCGAA(1)</p>
						</c>
						<c ca="left">
							<p>RACGCGAAAA(2)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>ACGCGT</p>
						</c>
						<c ca="left">
							<p>AACGCGT(10)</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Swi5</p>
						</c>
						<c ca="left">
							<p>CCAGCA</p>
						</c>
						<c ca="left">
							<p>GCTGG(1)</p>
						</c>
						<c ca="left">
							<p>CCAGC(2)</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
						<c ca="left">
							<p>n/a</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>Swi6</p>
						</c>
						<c ca="left">
							<p>ACGCGT</p>
						</c>
						<c ca="left">
							<p>ACGCGT(1)</p>
						</c>
						<c ca="left">
							<p>AACGCGT(2)</p>
						</c>
						<c ca="left">
							<p>ACGCGTT(3)</p>
						</c>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>AACGCGTT(2)</p>
						</c>
						<c ca="left">
							<p>AAACGCGW(4)</p>
						</c>
					</r>
					<r>
						<c>
							<p/>
						</c>
						<c ca="left">
							<p>ACGCGA</p>
						</c>
						<c ca="left">
							<p>AAACGCG(5)</p>
						</c>
						<c ca="left">
							<p>CGCGTTT(6)</p>
						</c>
						<c ca="left">
							<p>ACGCGAA(10)</p>
						</c>
						<c ca="left">
							<p>TTCGCGT(12)</p>
						</c>
						<c ca="left">
							<p>TTTCGCG(3)</p>
						</c>
						<c ca="left">
							<p>AAACGCGW(4)</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>The table lists the motifs found by three algorithms which are closest to the known regulatory motifs of the 12 yeast cell-cycle TFs. Promoters were chosen based on the ChIP-chip experiments of Lee <it>et al</it>. [38]. The rankings from each algorithm are included in parentheses. The rankings for WordSpy are among the words of the same length.</p>
				</tblfn>
			</tbl>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We are grateful to Gary Stormo and Hao Li for insightful comments on this work. Thanks to Harmen Bussemaker, Hao Li, and Eric Siggia for their MobyDick program. Thanks to Tompa <it>et al</it>. <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> for a motif-finding algorithm assessment benchmark and the corresponding website used in our study. The research was funded in part by NSF grants EIA-0113618 and IIS-0535257.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Orchestrated response: A symphony of transcription factors for gene control.</p>
				</title>
				<aug>
					<au>
						<snm>Lemon</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Tjian</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Genes Dev</source>
				<pubdate>2000</pubdate>
				<volume>14</volume>
				<fpage>2551</fpage>
				<lpage>2569</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11040209</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>Genome-wide discovery of transcriptional modules from DNA sequence and gene expression.</p>
				</title>
				<aug>
					<au>
						<snm>Segal</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Yelensky</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Koller</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19 Suppl 1</volume>
				<fpage>273</fpage>
				<lpage>282</lpage>
			</bibl>
			<bibl id="B3">
				<title>
					<p>Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection.</p>
				</title>
				<aug>
					<au>
						<snm>Tamada</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Bannai</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Imoto</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Tashiro</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Kuhara</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Miyano</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19 Suppl 2</volume>
				<fpage>II227</fpage>
				<lpage>II236</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">14534194</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.</p>
				</title>
				<aug>
					<au>
						<snm>Lawrence</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Altschul</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Bogouski</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Liu</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Neuwald</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Wooten</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1993</pubdate>
				<volume>262</volume>
				<fpage>208</fpage>
				<lpage>214</lpage>
				<xrefbib>
					<pubid idtype="pmpid">8211139</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>Unsupervised learning of multiple motifs in biopolymers using EM.</p>
				</title>
				<aug>
					<au>
						<snm>Bailey</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Elkan</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Machine Learning</source>
				<pubdate>1995</pubdate>
				<volume>21</volume>
				<fpage>51</fpage>
				<lpage>80</lpage>
			</bibl>
			<bibl id="B6">
				<title>
					<p>Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.</p>
				</title>
				<aug>
					<au>
						<snm>Hertz</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Stormo</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>1999</pubdate>
				<volume>15</volume>
				<fpage>563</fpage>
				<lpage>577</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10487864</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>Computational identification of <it>cis</it>-regulatory elements associated with groups of functionally related genes in <it>Saccharomyces cerevisiae</it>.</p>
				</title>
				<aug>
					<au>
						<snm>Hughes</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Estep</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Tavazoie</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Church</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>J Mol Biol</source>
				<pubdate>2000</pubdate>
				<volume>296</volume>
				<fpage>1205</fpage>
				<lpage>1214</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10698627</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation.</p>
				</title>
				<aug>
					<au>
						<snm>Sinha</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Tompa</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2003</pubdate>
				<volume>31</volume>
				<fpage>3586</fpage>
				<lpage>3588</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">169024</pubid>
						<pubid idtype="pmpid" link="fulltext">12824371</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Discovery of conserved sequence patterns using a stochastic dictionary model.</p>
				</title>
				<aug>
					<au>
						<snm>Gupta</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Liu</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>J Am Stat Assoc</source>
				<pubdate>2003</pubdate>
				<volume>98</volume>
				<fpage>55</fpage>
				<lpage>66</lpage>
			</bibl>
			<bibl id="B10">
				<title>
					<p>Large scale gene expression data analysis: a new challenge to computational biologists.</p>
				</title>
				<aug>
					<au>
						<snm>Zhang</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Genome Res</source>
				<pubdate>1999</pubdate>
				<volume>9</volume>
				<fpage>681</fpage>
				<lpage>688</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10447504</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Sequencing and comparison of yeast species to identify genes and regulatory elements.</p>
				</title>
				<aug>
					<au>
						<snm>Kellis</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Patterson</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Endrizzi</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Birren</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2003</pubdate>
				<volume>423</volume>
				<fpage>241</fpage>
				<lpage>254</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12748633</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Human-mouse genome comparisons to locate regulatory sites.</p>
				</title>
				<aug>
					<au>
						<snm>Wasserman</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Palumbo</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Thompson</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Fickett</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Lawrence</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2000</pubdate>
				<volume>26</volume>
				<fpage>225</fpage>
				<lpage>228</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11017083</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<aug>
					<au>
						<snm>Wayner</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Disappearing Cryptography</source>
				<publisher>San Francisco, California:Morgan Kaufmann</publisher>
				<edition>2</edition>
				<pubdate>2002</pubdate>
			</bibl>
			<bibl id="B14">
				<aug>
					<au>
						<snm>Durbin</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Eddy</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Krogh</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Mitchison</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids</source>
				<publisher>Cambridge: Cambridge University Press</publisher>
				<pubdate>1998</pubdate>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis.</p>
				</title>
				<aug>
					<au>
						<snm>Bussemaker</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Siggia</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>2000</pubdate>
				<volume>97</volume>
				<fpage>10096</fpage>
				<lpage>10100</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">27717</pubid>
						<pubid idtype="pmpid" link="fulltext">10944202</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>DNA binding sites: representation and discovery.</p>
				</title>
				<aug>
					<au>
						<snm>Stormo</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2000</pubdate>
				<volume>16</volume>
				<fpage>16</fpage>
				<lpage>23</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10812473</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B17">
				<title>
					<p>Assessing computational tools for the discovery of transcription factor binding sites.</p>
				</title>
				<aug>
					<au>
						<snm>Tompa</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Bailey</snm>
						<fnm>TL</fnm>
					</au>
					<au>
						<snm>Church</snm>
						<fnm>GM</fnm>
					</au>
					<au>
						<snm>Moor</snm>
						<fnm>BD</fnm>
					</au>
					<au>
						<snm>Eskin</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Favorov</snm>
						<fnm>AV</fnm>
					</au>
					<au>
						<snm>Frith</snm>
						<fnm>MC</fnm>
					</au>
					<au>
						<snm>Fu</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Kent</snm>
						<fnm>WJ</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nat Biotechnol</source>
				<pubdate>2005</pubdate>
				<volume>23</volume>
				<fpage>137</fpage>
				<lpage>144</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15637633</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B18">
				<aug>
					<au>
						<snm>Hopcroft</snm>
						<fnm>JE</fnm>
					</au>
					<au>
						<snm>Motwani</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Ullman</snm>
						<fnm>JD</fnm>
					</au>
				</aug>
				<source>Introduction to Automata Theory, Languages, and Computation</source>
				<publisher>Reading, MA:Addison-Wesley</publisher>
				<edition>2</edition>
				<pubdate>2000</pubdate>
			</bibl>
			<bibl id="B19">
				<title>
					<p>Comprehensive identification of cell cycle-regulated genes of the yeast <it>Saccharomyces cerevisiae </it>by microarray hybridization.</p>
				</title>
				<aug>
					<au>
						<snm>Spellman</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Lyer</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Anders</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Eisen</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>abd D Botstein</snm>
						<fnm>PB</fnm>
					</au>
					<au>
						<snm>Futcher</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Mol Biol Cell</source>
				<pubdate>1998</pubdate>
				<volume>9</volume>
				<fpage>3273</fpage>
				<lpage>3297</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">25624</pubid>
						<pubid idtype="pmpid" link="fulltext">9843569</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B20">
				<title>
					<p>A web site for the computational analysis of yeast regulatory sequences.</p>
				</title>
				<aug>
					<au>
						<snm>van Helden</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Andre</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Collado-Vides</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Yeast</source>
				<pubdate>2000</pubdate>
				<volume>16</volume>
				<fpage>177</fpage>
				<lpage>187</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10641039</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B21">
				<title>
					<p>Discovering regulatory elements in noncoding sequences by analysis of spaced dyads.</p>
				</title>
				<aug>
					<au>
						<snm>van Helden</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Rios</snm>
						<fnm>AF</fnm>
					</au>
					<au>
						<snm>Collado-Vides</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2000</pubdate>
				<volume>28</volume>
				<fpage>1808</fpage>
				<lpage>1018</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">102821</pubid>
						<pubid idtype="pmpid" link="fulltext">10734201</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B22">
				<title>
					<p>Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes.</p>
				</title>
				<aug>
					<au>
						<snm>Pavesi</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Mereghetti</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Mauri</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Pesole</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2004</pubdate>
				<issue>32 Web Server</issue>
				<fpage>W199</fpage>
				<lpage>W203</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">441603</pubid>
						<pubid idtype="pmpid" link="fulltext">15215380</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B23">
				<title>
					<p>A gene coexpression network for global discovery of conserved genetic modules.</p>
				</title>
				<aug>
					<au>
						<snm>Stuart</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Segal</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Koller</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Kim</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2003</pubdate>
				<volume>302</volume>
				<fpage>249</fpage>
				<lpage>255</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12934013</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B24">
				<title>
					<p>A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase.</p>
				</title>
				<aug>
					<au>
						<snm>Koch</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Moll</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Neuberg</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Ahorn</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Nasmyth</snm>
						<fnm>K</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>1993</pubdate>
				<volume>261</volume>
				<fpage>1551</fpage>
				<lpage>1557</lpage>
				<xrefbib>
					<pubid idtype="pmpid">8372350</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B25">
				<title>
					<p>Identifying combinatorial regulation of transcription factors and binding motifs.</p>
				</title>
				<aug>
					<au>
						<snm>Kato</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hata</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Banerjee</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Futcher</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Genome Biol</source>
				<pubdate>2004</pubdate>
				<volume>5</volume>
				<fpage>R56</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">507881</pubid>
						<pubid idtype="pmpid" link="fulltext">15287978</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B26">
				<title>
					<p>Forkhead genes in transcriptional silencing, cell morphology and the cell cycle: overlapping and distinct functions for FKH1 and FKH2 in <it>Saccharomyces cerevisiae</it>.</p>
				</title>
				<aug>
					<au>
						<snm>Hollenhorst</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Bose</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mielke</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>M&#252;ller</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Fox</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Genetics</source>
				<pubdate>2000</pubdate>
				<volume>154</volume>
				<fpage>1533</fpage>
				<lpage>1548</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10747051</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B27">
				<title>
					<p>Why should we study the plant cell cycle?</p>
				</title>
				<aug>
					<au>
						<snm>Inz&#233;</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>J Exp Bot</source>
				<pubdate>2003</pubdate>
				<volume>54</volume>
				<fpage>1125</fpage>
				<lpage>1126</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12654862</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B28">
				<title>
					<p>Genome-wide gene expression in <it>Arabidopsis </it>cell suspension.</p>
				</title>
				<aug>
					<au>
						<snm>Menges</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hennig</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Gruissem</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Murray</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Plant Mol Biol</source>
				<pubdate>2003</pubdate>
				<volume>53</volume>
				<fpage>423</fpage>
				<lpage>442</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15010610</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B29">
				<title>
					<p>TAIR database</p>
				</title>
				<url>http://www.arabidopsis.org</url>
			</bibl>
			<bibl id="B30">
				<title>
					<p>A gene expression map of <it>Arabidopsis thaliana </it>development.</p>
				</title>
				<aug>
					<au>
						<snm>Schmid</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Davison</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Henz</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Pape</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Demar</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Vingron</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Scholkopf</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Weigel</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Lohmann</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2005</pubdate>
				<volume>37</volume>
				<fpage>501</fpage>
				<lpage>506</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15806101</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B31">
				<title>
					<p>Plant <it>cis</it>-acting regulatory DNA elements (PLACE) database.</p>
				</title>
				<aug>
					<au>
						<snm>Higo</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Ugawa</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Iwamoto</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Korenaga</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>1999</pubdate>
				<volume>27</volume>
				<fpage>297</fpage>
				<lpage>300</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">148163</pubid>
						<pubid idtype="pmpid" link="fulltext">9847208</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B32">
				<title>
					<p>PlantCARE, a database of plant <it>cis</it>-acting regulatory elements and a portal to tools for <it>in silico</it> analysis of promoter sequences.</p>
				</title>
				<aug>
					<au>
						<snm>Lescot</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Dehais</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Thijs</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Marchal</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Moreau</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>van de Peer</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Rouze</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Rombauts</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2002</pubdate>
				<volume>30</volume>
				<fpage>325</fpage>
				<lpage>327</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">99092</pubid>
						<pubid idtype="pmpid" link="fulltext">11752327</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B33">
				<title>
					<p>A novel <it>cis</it>-acting element in promoters of plant B-type cyclin genes activates M phase specific transcription.</p>
				</title>
				<aug>
					<au>
						<snm>Ito</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Iwase</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Kodama</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Lavisse</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Komamine</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Nishihama</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Machida</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Watanabe</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Plant Cell</source>
				<pubdate>1998</pubdate>
				<volume>10</volume>
				<fpage>331</fpage>
				<lpage>341</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">144003</pubid>
						<pubid idtype="pmpid" link="fulltext">9501108</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B34">
				<title>
					<p>Cell cycle-regulated gene expression in <it>Arabidopsis</it>.</p>
				</title>
				<aug>
					<au>
						<snm>Menges</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Hennig</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Gruissem</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Murray</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>J Biol Chem</source>
				<pubdate>2002</pubdate>
				<volume>277</volume>
				<fpage>41987</fpage>
				<lpage>42002</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12169696</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B35">
				<title>
					<p>Nucleotide sequences of two corn histone H3 genes. Genomic organization of the corn histone H3 and H4 genes.</p>
				</title>
				<aug>
					<au>
						<snm>Chaubet</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Philipps</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Chaboute</snm>
						<fnm>ME</fnm>
					</au>
					<au>
						<snm>Ehling</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Giot</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Plant Mol Biol</source>
				<pubdate>1986</pubdate>
				<volume>6</volume>
				<fpage>253</fpage>
				<lpage>263</lpage>
			</bibl>
			<bibl id="B36">
				<title>
					<p>The Gene Ontology (GO) project in 2006.</p>
				</title>
				<aug>
					<au>
						<snm>Harris</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Clark</snm>
						<fnm>JI</fnm>
					</au>
					<au>
						<snm>Ireland</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Lomax</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Ashburner</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Collins</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Eilbeck</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Lewis</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Mungall</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Richter</snm>
						<fnm>J</fnm>
					</au>
					<etal/>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2006</pubdate>
				<volume>34</volume>
				<issue>Database issue</issue>
				<fpage>D322</fpage>
				<lpage>D226</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1347384</pubid>
						<pubid idtype="pmpid" link="fulltext">16381878</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B37">
				<title>
					<p>A genome-wide identification of E2F-regulated genes in <it>Arabidopsis</it>.</p>
				</title>
				<aug>
					<au>
						<snm>Ramirez-Parra</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Fr&#252;ndt</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Gutierrez</snm>
						<fnm>C</fnm>
					</au>
				</aug>
				<source>Plant J</source>
				<pubdate>2003</pubdate>
				<volume>33</volume>
				<fpage>801</fpage>
				<lpage>811</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12609051</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B38">
				<title>
					<p>Transcriptional regulatory networks in <it>Saccharomyces cerevisiae</it>.</p>
				</title>
				<aug>
					<au>
						<snm>Lee</snm>
						<fnm>TI</fnm>
					</au>
					<au>
						<snm>Rinaldi</snm>
						<fnm>NJ</fnm>
					</au>
					<au>
						<snm>Robert</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Odom</snm>
						<fnm>DT</fnm>
					</au>
					<au>
						<snm>Bar-Joseph</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Gerber</snm>
						<fnm>GK</fnm>
					</au>
					<au>
						<snm>Hannett</snm>
						<fnm>NM</fnm>
					</au>
					<au>
						<snm>Harbison</snm>
						<fnm>CT</fnm>
					</au>
					<au>
						<snm>Thompson</snm>
						<fnm>CM</fnm>
					</au>
					<au>
						<snm>Simon</snm>
						<fnm>I</fnm>
					</au>
					<etal/>
				</aug>
				<source>Science</source>
				<pubdate>2002</pubdate>
				<volume>298</volume>
				<fpage>799</fpage>
				<lpage>804</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12399584</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B39">
				<title>
					<p>Rare events and conditional events on random strings.</p>
				</title>
				<aug>
					<au>
						<snm>Regnier</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Denise</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Discrete Math Theor Comput Sci</source>
				<pubdate>2004</pubdate>
				<volume>6</volume>
				<fpage>191</fpage>
				<lpage>214</lpage>
			</bibl>
			<bibl id="B40">
				<title>
					<p>ANN-Spec: a method for discovering transcription factor binding sites with improved specificity.</p>
				</title>
				<aug>
					<au>
						<snm>Workman</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Stormo</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Pac Symp Biocomput</source>
				<pubdate>2000</pubdate>
				<volume>5</volume>
				<fpage>464</fpage>
				<lpage>475</lpage>
			</bibl>
			<bibl id="B41">
				<title>
					<p>Finding composite regulatory patterns in DNA sequences.</p>
				</title>
				<aug>
					<au>
						<snm>Eskin</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Pevzner</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<issue>Suppl 1</issue>
				<fpage>S354</fpage>
				<lpage>S363</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">12169566</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B42">
				<title>
					<p>Finding functional sequence elements by multiple local alignment.</p>
				</title>
				<aug>
					<au>
						<snm>Frith</snm>
						<fnm>MC</fnm>
					</au>
					<au>
						<snm>Hansen</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Spouge</snm>
						<fnm>JL</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>Z</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2004</pubdate>
				<volume>32</volume>
				<fpage>189</fpage>
				<lpage>200</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">373279</pubid>
						<pubid idtype="pmpid" link="fulltext">14704356</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B43">
				<title>
					<p>Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR.</p>
				</title>
				<aug>
					<au>
						<snm>Ao</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Gaudet</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Kent</snm>
						<fnm>WJ</fnm>
					</au>
					<au>
						<snm>Muttumu</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Mango</snm>
						<fnm>SE</fnm>
					</au>
				</aug>
				<source>Science</source>
				<pubdate>2004</pubdate>
				<volume>305</volume>
				<fpage>1743</fpage>
				<lpage>1746</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15375261</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B44">
				<title>
					<p>A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling.</p>
				</title>
				<aug>
					<au>
						<snm>Thijs</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Lescot</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Marchal</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Rombauts</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Moor</snm>
						<fnm>BD</fnm>
					</au>
					<au>
						<snm>Rouze</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Moreau</snm>
						<fnm>Y</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>1113</fpage>
				<lpage>1122</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11751219</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B45">
				<title>
					<p>A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length.</p>
				</title>
				<aug>
					<au>
						<snm>Favorov</snm>
						<fnm>AV</fnm>
					</au>
					<au>
						<snm>Gelfand</snm>
						<fnm>MS</fnm>
					</au>
					<au>
						<snm>Gerasimova</snm>
						<fnm>AV</fnm>
					</au>
					<au>
						<snm>Ravcheev</snm>
						<fnm>DA</fnm>
					</au>
					<au>
						<snm>Mironov</snm>
						<fnm>AA</fnm>
					</au>
					<au>
						<snm>Makeev</snm>
						<fnm>VJ</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2005</pubdate>
				<volume>21</volume>
				<fpage>2240</fpage>
				<lpage>2245</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15728117</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B46">
				<title>
					<p>Assessment Statistics</p>
				</title>
				<url>http://bio.cs.washington.edu/assessment/statistics.html</url>
			</bibl>
			<bibl id="B47">
				<title>
					<p>Detection of <it>cis</it>-element clusters in higher eukaryotic DNA.</p>
				</title>
				<aug>
					<au>
						<snm>Frith</snm>
						<fnm>MC</fnm>
					</au>
					<au>
						<snm>Hansen</snm>
						<fnm>U</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>Z</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2001</pubdate>
				<volume>17</volume>
				<fpage>878</fpage>
				<lpage>889</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11673232</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B48">
				<title>
					<p>A probabilistic method to detect regulatory modules.</p>
				</title>
				<aug>
					<au>
						<snm>Sinha</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Nimwegen</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Siggia</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2003</pubdate>
				<volume>19 Suppl 1</volume>
				<fpage>292</fpage>
				<lpage>301</lpage>
			</bibl>
			<bibl id="B49">
				<title>
					<p>Identifying regulatory networks by combinatorial analysis of promoter elements.</p>
				</title>
				<aug>
					<au>
						<snm>Pilpel</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Sudarsanam</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Church</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Nat Genet</source>
				<pubdate>2001</pubdate>
				<volume>29</volume>
				<fpage>153</fpage>
				<lpage>159</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">11547334</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B50">
				<title>
					<p>Computational methods for transcriptional regulation.</p>
				</title>
				<aug>
					<au>
						<snm>Siggia</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Curr Opin Genet Dev</source>
				<pubdate>2005</pubdate>
				<volume>15</volume>
				<fpage>214</fpage>
				<lpage>221</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15797205</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B51">
				<title>
					<p>A unified approach to word statistics.</p>
				</title>
				<aug>
					<au>
						<snm>R&#233;gnier</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>RECOMB (Proceedings of the Second Annual International Conference on Research in Computational Molecular Biology)</source>
				<pubdate>1998</pubdate>
				<fpage>207</fpage>
				<lpage>213</lpage>
				<note>[DOI: 10.1145/279069.279116]</note>
			</bibl>
			<bibl id="B52">
				<title>
					<p>Probabilistic and statistical properties of words: an overview.</p>
				</title>
				<aug>
					<au>
						<snm>Reinert</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Schbath</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Waterman</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>J Comput Biol</source>
				<pubdate>2000</pubdate>
				<volume>7</volume>
				<fpage>1</fpage>
				<lpage>46</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10890386</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B53">
				<title>
					<p>Maximum likelihood from incomplete data via the EM algorithm.</p>
				</title>
				<aug>
					<au>
						<snm>Dempster</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Laird</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Rubin</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>J R Stat Soc</source>
				<pubdate>1977</pubdate>
				<volume>39</volume>
				<fpage>1</fpage>
				<lpage>38</lpage>
			</bibl>
			<bibl id="B54">
				<title>
					<p>GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes.</p>
				</title>
				<aug>
					<au>
						<snm>Boyle</snm>
						<fnm>EI</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Gollub</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Jin</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Botstein</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Cherry</snm>
						<fnm>JM</fnm>
					</au>
					<au>
						<snm>Sherlock</snm>
						<fnm>G</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2004</pubdate>
				<volume>20</volume>
				<fpage>3710</fpage>
				<lpage>3715</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15297299</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B55">
				<aug>
					<au>
						<snm>Sokal</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Rohlf</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>Biometry: The Principles and Practice of Statistics in Biological Research</source>
				<publisher>New York: Freeman</publisher>
				<edition>3</edition>
				<pubdate>1995</pubdate>
			</bibl>
			<bibl id="B56">
				<title>
					<p>Functional annotation of the <it>Arabidopsis </it>genome using controlled vocabularies.</p>
				</title>
				<aug>
					<au>
						<snm>Berardini</snm>
						<fnm>TZ</fnm>
					</au>
					<au>
						<snm>Mundodi</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Reiser</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Huala</snm>
						<fnm>E</fnm>
					</au>
					<au>
						<snm>Garcia-Hernandez</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Mueller</snm>
						<fnm>LA</fnm>
					</au>
					<au>
						<snm>Yoon</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Doyle</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>G</fnm>
					</au>
					<etal/>
				</aug>
				<source>Plant Physiol</source>
				<pubdate>2004</pubdate>
				<volume>135</volume>
				<fpage>745</fpage>
				<lpage>755</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">514112</pubid>
						<pubid idtype="pmpid" link="fulltext">15173566</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B57">
				<title>
					<p>Controlling the false discovery rate: a practical and powerful approach to multiple testing.</p>
				</title>
				<aug>
					<au>
						<snm>Benjamini</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Hochberg</snm>
						<fnm>Y</fnm>
					</au>
				</aug>
				<source>J R Stat Soc</source>
				<pubdate>1995</pubdate>
				<volume>57</volume>
				<fpage>289</fpage>
				<lpage>300</lpage>
			</bibl>
			<bibl id="B58">
				<title>
					<p>WordSpy</p>
				</title>
				<url>http://cic.cs.wustl.edu/wordspy</url>
			</bibl>
			<bibl id="B59">
				<title>
					<p>Role of negative regulation in promoter specificity of the homologous transcriptional activators Ace2p and Swi5p.</p>
				</title>
				<aug>
					<au>
						<snm>Dohrmann</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Voth</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Stillman</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Mol Cell Biol</source>
				<pubdate>1996</pubdate>
				<volume>16</volume>
				<fpage>1746</fpage>
				<lpage>1758</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">231161</pubid>
						<pubid idtype="pmpid" link="fulltext">8657150</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B60">
				<title>
					<p>SCPD: a promoter database of yeast <it>Saccharomyces cerevisiae</it>.</p>
				</title>
				<aug>
					<au>
						<snm>Zhu</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Zhang</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>1999</pubdate>
				<volume>15</volume>
				<fpage>607</fpage>
				<lpage>611</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">10487868</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B61">
				<title>
					<p>The yeast STE12 protein binds to the DNA sequence mediating pheromone induction.</p>
				</title>
				<aug>
					<au>
						<snm>Dolan</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Kirkman</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Fields</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Proc Natl Acad Sci USA</source>
				<pubdate>1989</pubdate>
				<volume>86</volume>
				<fpage>5703</fpage>
				<lpage>5707</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">297698</pubid>
						<pubid idtype="pmpid" link="fulltext">2668945</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B62">
				<title>
					<p>Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA.</p>
				</title>
				<aug>
					<au>
						<snm>Blaiseau</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Thomas</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>EMBO J</source>
				<pubdate>1998</pubdate>
				<volume>17</volume>
				<fpage>6327</fpage>
				<lpage>6336</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1170957</pubid>
						<pubid idtype="pmpid" link="fulltext">9799240</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
