<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2164-9-S1-S2</ui>
	<ji>1471-2164</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>Prediction-based approaches to characterize bidirectional promoters in the mammalian genome</p>
			</title>
			<aug>
				<au id="A1">
					<snm>Yang</snm>
					<fnm>Mary Qu</fnm>
					<insr iid="I1"/>
					<email>yangma@mail.nih.gov</email>
				</au>
				<au id="A2" ca="yes">
					<snm>Elnitski</snm>
					<mi>L</mi>
					<fnm>Laura</fnm>
					<insr iid="I1"/>
					<email>elnitski@mail.nih.gov</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>National Human Genome Research Institute, National Institutes of Health, US Department of Health and Human Services, Bethesda, MD 20892, USA</p>
				</ins>
			</insg>
			<source>BMC Genomics</source>
			<supplement>
				<title>
					<p>The 2007 International Conference on Bioinformatics &amp; Computational Biology (BIOCOMP'07)</p>
				</title>
				<editor>Jack Y Jang, Mary Qu Yang, Mengxia (Michelle) Zhu, Youping Deng and Hamid R Arabnia</editor>
				<note>Research</note>
			</supplement>
			<conference>
				<title>
					<p>The 2007 International Conference on Bioinformatics &amp; Computational Biology (BIOCOMP'07)</p>
				</title>
				<location>Las Vegas, NV, USA</location>
				<date-range>25-28 June 2007</date-range>
				<url>http://www.world-academy-of-science.org</url>
			</conference>
			<issn>1471-2164</issn>
			<pubdate>2008</pubdate>
			<volume>9</volume>
			<issue>Suppl 1</issue>
			<fpage>S2</fpage>
			<url>http://www.biomedcentral.com/1471-2164/9/S1/S2</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">18366609</pubid><pubid idtype="doi">10.1186/1471-2164-9-S1-S2</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>20</day>
					<month>03</month>
					<year>2008</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2008</year>
			<collab>Yang and Elnitski; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>Machine learning approaches are emerging as a way to discriminate various classes of functional elements. Previous attempts to create Regulatory Potential (RP) scores to discriminate functional DNA from nonfunctional DNA included using Markov models trained to identify sequences from promoters and enhancers from ancestral repeats. We proposed that knowledge gleaned from those methods could be further refined using a multiple class predictor to separate classes of promoter elements from enhancers or nonfunctional DNA.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We extended our previous work, which identified over 5,000 candidate bidirectional promoters in the human genome, to map the orthologous promoter regions in the mouse genome. Our algorithm measured the robustness of evidence provided by the spliced EST annotations and incorporated evidence from annotations of UCSC Known Genes and GenBank mRNA. In preparation for de novo prediction of this promoter type, we examined characteristic features of the dataset as a whole. For instance, bidirectional promoters score very highly among all functional elements for Regulatory Potential Scores. This result was unexpected due to the limited sequence conservation found in these noncoding regions. We demonstrate that bidirectional promoters can be classified apart from other genomic features including non-bidirectional promoters, i.e. those promoters having no nearby upstream genes. Furthermore bidirectional promoters consistently score at the level of very highly conserved functional elements in the genome- developmental enhancers. The high scores are due to sequence-based characteristics within the promoters, not the surrounding exons. These results indicate that high-scoring RP regions can be deconvoluted into various functional classes of genomic elements. Using a multiple class predictor we are able to discriminate bidirectional promoters from enhancers, non-bidirectional promoters, and non-promoter regions on the basis of RP scores and CpG islands.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>We examine orthology at bidirectional promoters, use discriminatory machine learning approaches to differentiate multiple types of promoters from other functional and nonfunctional features in the genome and begin the process of deconvoluting classes of functional regions that score well with RP scores. These types of approaches precede supervised learning techniques to discover unannotated promoter regions.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>The intricate details of regulated gene expression are not well-characterized in the human genome. Currently our understanding relies greatly on our ability to experimentally identify prospective regulatory regions and to computationally evaluate features of those experimental datasets. We have found that searching for genes arranged in a &#8216;head-to-head&#8217; configuration can precisely identify a set of candidate regulatory regions, without the intermediate step of experimental identification. The designation of the 5&#8242; and 3&#8242; ends of a gene (i.e. from start-to-stop or head-to-tail) indicates that a head-to-head arrangement places the transcription start sites (TSSs) of two genes in close proximity. The directionality of transcription (from 5&#8242; to 3&#8242;) by RNA polymerase allows these adjacent genes to produce products without interfering with each other. Two genes in a head-to-head configuration that have their 5&#8242; ends located fairly close together, within 1000 base pairs, are assumed to have a shared promoter region located between the two 5&#8242; ends. This promoter is defined as a bidirectional promoter, because it influences expression of the two genes simultaneously. This influence can be concordant or discordant.</p>
			<p>Bidirectional promoters occur frequently in the human genome <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. Despite their prevalence, their full biological significance is not yet known. Nevertheless, evidence of significant biological implications is emerging <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Further elucidation may come from studies in other species' genomes. The process of mapping bidirectional promoters in other species is fairly simple once the algorithms are developed. More importantly, a comprehensive set of these regulators in multiple species allows comparative analyses across species. Predictions made within a single species can be validated by their appearance in another. Bidirectional promoters represent a special class of promoter sequences, specifically those having an exon on either side of the promoter region (i.e. the first exon of each gene regulated by the promoter). Thus, the promoter region is &#8216;bounded&#8217; by sequences with described functions on both sides, and thereby limited to the intervening portion. This arrangement solves the problem of defining the upstream boundary of the promoter, which is a troublesome reality of studying promoters with no discernible upstream endpoints. If fundamental differences are present in the sequences underlying functional elements, machine-learning approaches may be able to identify them. The key to success lies in a precise description of each of the functional categories. For instance, sequences characterizing bidirectional promoters can be compared to non-promoter regions found between the &#8216;tails&#8217; of adjacent genes arranged in a tail-to-tail configuration. Additionally, further characterization may be possible by discriminating bidirectional promoter sequences from enhancer regions, which are often highly conserved and can act at extreme distances from a responsive gene. The most challenging regions to distinguish from bidirectional promoters are other promoter regions, including unidirectional promoters that have a neighboring gene (head-to-tail arrangement) and unbounded promoters, which have no upstream neighboring gene.</p>
			<p>Progress in discerning classes of functional elements from each other, without the aid of experimental data, represents a significant goal in our ability to decode the human genome. In this manuscript, we present a detailed mapping of bidirectional promoters in the mouse genome, analogous to our work in the human genome <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Furthermore, we compare data from human and mouse as a means to validate our predictions, and to further characterize features within bidirectional promoters. Using bidirectional promoters as a model dataset, we describe results of machine learning approaches to score functional elements in genomic sequences. We conclude with a multiple class predictor that aims to accurately discriminate classes of promoters from one another, from enhancers, and from nonfunctional regions.</p>
		</sec>
		<sec>
			<st>
				<p>Results and Discussion</p>
			</st>
			<sec>
				<st>
					<p>Mapping bidirectional promoters in the mouse genome</p>
				</st>
				<p>In an analogous approach to our studies in the human genome, we systematically mapped bidirectional promoters in the mouse genome. These promoters were defined by their position between two oppositely-oriented transcription units, whose transcription start sites (TSSs) were no more than 1000 bp apart. All transcripts used in the analysis originated at one of three repositories :</p>
				<p>&#8226; The UCSC List of Known Genes <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>.</p>
				<p>&#8226; GenBank mRNA data <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
				<p>&#8226; Spliced EST data from the GenBank dbEST database <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>.</p>
				<p>As discussed in <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> the procedure for mapping bidirectional promoters from the Known Gene annotations is quite straightforward due to the quality of these gene descriptions. Initially, all genes are represented as clusters containing overlapping transcripts. Each cluster extends from the farthest 5&#8242; to the farthest 3&#8242; coordinate of any included transcript. Neighboring clusters are then examined with respect to the distance and orientation of their 5&#8242; ends. If the 5&#8242; ends of two genes are no more than 1000 bp apart and the genes are transcribed in opposite directions, the region between them is considered to be a bidirectional promoter. Identifying bidirectional promoters from other annotation sources in the mouse genome can be more complex due to the diversity and fragmented nature of the current transcripts. For instance, both the spliced ESTs and the GenBank mRNA transcripts contain multiple overlapping segments of transcribed regions, which are frequently updated as new information becomes available. To handle the complexity of the data in the spliced ESTs, we applied an algorithm to extract the bidirectional promoters that passed a variety of conditional tests. These included conformity to the rules of distance and orientation.</p>
				<p>Furthermore, transcripts were classified as intergenic or intragenic by comparison with the Known Genes as a reference track. Additional criteria requiring majority agreement with the orientation of co-localized ESTs and with the orientation of Known Genes are described in Yang and Elnitski (2007) <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
				<p>The mapping algorithm identified 5,647 candidate bidirectional promoter regions in the mouse genome. This number is similar to the number of candidate bidirectional promoters identified in the human genome using a similar strategy <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. In both genomes, the number of bidirectional promoters was larger than previously reported <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>, as a result of updated gene annotations and the use of spliced EST data. The validity of these candidate regions was assessed by comparison to the RIKEN CAGE dataset <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. The CAGE technique captures the true 5&#8242; ends of transcripts, allowing a direct comparison to our bidirectional promoters by their coordinates in the mouse genome. Figure <figr fid="F1">1</figr> shows bidirectional promoters that are fully validated when a CAGE transcript flanks both sides of the promoter region. In the human genome, bidirectional promoters from the Known Gene, mRNA, and EST data are validated at 96%, 78%, and 81%, respectively (Figure <figr fid="F1">1</figr>, upper panel), while in the mouse genome, bidirectional promoters from the Known Gene, mRNA, and EST data are validated at 95%, 40%, and 65%, respectively (Figure <figr fid="F1">1</figr>, lower panel). The low validation score for mouse mRNA appears to reflect an incomplete description of the mouse genes in the mouse genome assembly mm5 (May 2004).</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Validation of bidirectional promoters using the RIKEN CAGE dataset. Pie charts depict the number of bidirectional promoters with CAGE transcripts that correspond to detectable transcripts on both sides (black), only one side (gray), or no evidence (white). Note that these do not have to be transcribed in the same tissues to be included in our study. The upper panel is based on human transcripts from the human sequence assembly, hg17, while the lower panel uses CAGE data and transcripts from the mouse sequence assembly, mm5. Bidirectional promoters were mapped in Known Genes (left column), GenBank mRNA (middle column), and spliced ESTs (right column)</p>
					</caption>
					<text>
						<p>Validation of bidirectional promoters using the RIKEN CAGE dataset. Pie charts depict the number of bidirectional promoters with CAGE transcripts that correspond to detectable transcripts on both sides (black), only one side (gray), or no evidence (white). Note that these do not have to be transcribed in the same tissues to be included in our study. The upper panel is based on human transcripts from the human sequence assembly, hg17, while the lower panel uses CAGE data and transcripts from the mouse sequence assembly, mm5. Bidirectional promoters were mapped in Known Genes (left column), GenBank mRNA (middle column), and spliced ESTs (right column).</p>
					</text>
					<graphic file="1471-2164-9-S1-S2-1"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Comparison of human and mouse bidirectional promoter sets</p>
				</st>
				<p>Bidirectional promoters are ancient features, exhibiting orthology from human to <it>Fugu rubripes</it><abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. To compare the co-occurrence of bidirectional promoters in the human and mouse genomes, we mapped human genes regulated by bidirectional promoters to the mouse genome and assessed whether the corresponding mouse gene also formed a bidirectional promoter with its 5&#8242; neighbor. Of 1637 Known Genes, as shown in Figure <figr fid="F2">2</figr>, 41% were associated with bidirectional promoters in the mouse genome by the same gene name. An additional 4% were added from Genbank mRNA and 7% from the spliced ESTs. Roughly 7% of the set had a gene in the mouse genome but shows no evidence of a bidirectional promoter. The remaining 40% could not be mapped to the mouse using this method. Table <tblr tid="T1">1</tblr> shows the orthologous pairs of mouse genes corresponding to ten human genes involved in cancer that have bidirectional promoters. From this data we predict that 4 mouse genes will be positioned closer together than they currently appear. BRCA2, ERBB2, FANCA and FANCF are much farther apart in mouse than in human. Table <tblr tid="T2">2</tblr> shows the GO terms for genes that are regulated by bidirectional promoters in human, but not in mouse, implying that regulatory changes could change the expression of these genes between species. It should be noted that strategies such as ours to map orthologs by gene name provide high confidence assignments, but underestimate the number of orthologous bidirectional promoters in the human and mouse genomes. We have further proven this point by mapping orthologous gene pairs regulated by bidirectional promoters in twelve species using rigorous genomic alignment information <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Orthologous mapping of human bidirectional promoters to mouse. Promoter orthology was de-termined by identifying ortholgous genes in mouse and checking for evidence of bidirectional promoters. Genes that had a 5&#8242; neighbor transcribed in the opposite direction are shown for promoters of Known Genes(maroon), Genbank mRNA (pink), and ESTs (red). Genes with no neighbor in mouse lack evidence for bidirectional promoters (green). Genes that could not be mapped to mouse are shown in blue</p>
					</caption>
					<text>
						<p>Orthologous mapping of human bidirectional promoters to mouse. Promoter orthology was de-termined by identifying ortholgous genes in mouse and checking for evidence of bidirectional promoters. Genes that had a 5&#8242; neighbor transcribed in the opposite direction are shown for promoters of Known Genes(maroon), Genbank mRNA (pink), and ESTs (red). Genes with no neighbor in mouse lack evidence for bidirectional promoters (green). Genes that could not be mapped to mouse are shown in blue.</p>
					</text>
					<graphic file="1471-2164-9-S1-S2-2"/>
				</fig>
				<tbl id="T1" hint_layout="single">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>Tumor suppressor genes in human and mouse</p>
					</caption>
					<tblbdy cols="5">
						<r>
							<c>
								<p>BOC gene</p>
							</c>
							<c>
								<p>Bidirectional partner</p>
							</c>
							<c>
								<p>Annotation of partner</p>
							</c>
							<c>
								<p>Distance between TSSs</p>
							</c>
							<c>
								<p>CpG island at TSSs</p>
							</c>
						</r>
						<r>
							<c cspan="5">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>BARD1 (Human) </p>
							</c>
							<c>
								<p>DA865307 </p>
							</c>
							<c>
								<p>mRNA, EST </p>
							</c>
							<c>
								<p>518</p>
							</c>
							<c>
								<p>Across/First exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>BARD1 (Mouse)</p>
							</c>
							<c>
								<p>AK007117</p>
							</c>
							<c>
								<p>mRNA</p>
							</c>
							<c>
								<p>-425</p>
							</c>
							<c>
								<p>Across/First exon of both</p>
							</c>
						</r>
						<r>
							<c>
								<p>BRCA1 (Human)</p>
							</c>
							<c>
								<p>NBR2</p>
							</c>
							<c>
								<p>KG, mRNA, EST </p>
							</c>
							<c>
								<p>81</p>
							</c>
							<c>
								<p>Inside NBR2 </p>
							</c>
						</r>
						<r>
							<c>
								<p>BRCA1 (Mouse)</p>
							</c>
							<c>
								<p>NBR1</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>259</p>
							</c>
							<c>
								<p>No CpG</p>
							</c>
						</r>
						<r>
							<c>
								<p>BRCA2 (Human)</p>
							</c>
							<c>
								<p>DR731263 </p>
							</c>
							<c>
								<p>EST</p>
							</c>
							<c>
								<p>955</p>
							</c>
							<c>
								<p>Overlaps First Exon of BRCA2 </p>
							</c>
						</r>
						<r>
							<c>
								<p>BRCA2 (Mouse)</p>
							</c>
							<c>
								<p>CO801197</p>
							</c>
							<c>
								<p>EST</p>
							</c>
							<c>
								<p>2505</p>
							</c>
							<c>
								<p>Overlaps First Exon of BRCA2</p>
							</c>
						</r>
						<r>
							<c>
								<p>CHK2 (Human) </p>
							</c>
							<c>
								<p>HSC20 </p>
							</c>
							<c>
								<p>KG, EST </p>
							</c>
							<c>
								<p>32</p>
							</c>
							<c>
								<p>Overlaps First Exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>CHK2 (Mouse)</p>
							</c>
							<c>
								<p>AW049829</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>276</p>
							</c>
							<c>
								<p>Across/First exon of both</p>
							</c>
						</r>
						<r>
							<c>
								<p>ERBB2 (Human) </p>
							</c>
							<c>
								<p>Perld1</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>60</p>
							</c>
							<c>
								<p>Overlaps First Exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>ERBB2 (Mouse)</p>
							</c>
							<c>
								<p>Perld1</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>11,994</p>
							</c>
							<c>
								<p>CpG at first exon of both</p>
							</c>
						</r>
						<r>
							<c>
								<p>P53 (Human) </p>
							</c>
							<c>
								<p>AK001247 </p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>491</p>
							</c>
							<c>
								<p>Overlaps First Exon of P53 Partner </p>
							</c>
						</r>
						<r>
							<c>
								<p>P53 (Mouse)</p>
							</c>
							<c>
								<p>WDR79</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>657</p>
							</c>
							<c>
								<p>Across/First exon of both</p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCA (Human) </p>
							</c>
							<c>
								<p>Spisre2</p>
							</c>
							<c>
								<p>mRNA, EST</p>
							</c>
							<c>
								<p>1,533</p>
							</c>
							<c>
								<p>Overlaps First Exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCA (Mouse)</p>
							</c>
							<c>
								<p>Spisre2</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>14,137</p>
							</c>
							<c>
								<p>CpG at first exon of both</p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCB (Human) </p>
							</c>
							<c>
								<p>MOSPD2 </p>
							</c>
							<c>
								<p>KG, mRNA, EST </p>
							</c>
							<c>
								<p>372</p>
							</c>
							<c>
								<p>Across/First exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCB (Mouse)</p>
							</c>
							<c>
								<p>AK035985</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>257</p>
							</c>
							<c>
								<p>No CpG</p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCD2 (Human)</p>
							</c>
							<c>
								<p>BC043599 </p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>64</p>
							</c>
							<c>
								<p>Across/First exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCD2 (Mouse)</p>
							</c>
							<c>
								<p>Tmem111</p>
							</c>
							<c>
								<p>KG, mRNA, EST</p>
							</c>
							<c>
								<p>47</p>
							</c>
							<c>
								<p>Across/First exon of both</p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCF (Human)</p>
							</c>
							<c>
								<p>GAS2</p>
							</c>
							<c>
								<p>mRNA, EST</p>
							</c>
							<c>
								<p>-199</p>
							</c>
							<c>
								<p>Across/First exon of both </p>
							</c>
						</r>
						<r>
							<c>
								<p>FANCF (Mouse)</p>
							</c>
							<c>
								<p>AK014509</p>
							</c>
							<c>
								<p>mRNA</p>
							</c>
							<c>
								<p>1,966</p>
							</c>
							<c>
								<p>Overlaps First Exon of FANCF partner</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<tbl id="T2" hint_layout="single">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>Molecular function (<it>P</it> &lt; 0.05) of human genes having a unique bidirectional promoter not detected in mouse</p>
					</caption>
					<tblbdy cols="2">
						<r>
							<c>
								<p>Go ID</p>
							</c>
							<c>
								<p>Molecular Function</p>
							</c>
						</r>
						<r>
							<c cspan="2">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0004004</p>
							</c>
							<c>
								<p>ATP-dependent RNA helicase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0008186</p>
							</c>
							<c>
								<p>RNA-dependent adenosinetriphosphatase</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0047804</p>
							</c>
							<c>
								<p>ATP-dependent RNA helicase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0004042</p>
							</c>
							<c>
								<p>N-acetylglutamate synthase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0019145</p>
							</c>
							<c>
								<p>aminobutyraldehyde dehydrogenase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0000250</p>
							</c>
							<c>
								<p>oxidosqualene-lanosterol cyclase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0008321</p>
							</c>
							<c>
								<p>Ral guanyl-nucleotide exchange factor activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0031559</p>
							</c>
							<c>
								<p>oxidosqualene cyclase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0047316</p>
							</c>
							<c>
								<p>glutamine-phenylpyruvate aminotransferase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0008176</p>
							</c>
							<c>
								<p>tRNA (guanine-N7-)-methyltransferase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0008609</p>
							</c>
							<c>
								<p>alkyl-DHAP synthase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0047105</p>
							</c>
							<c>
								<p>4-trimethylammoniobutyraldehyde dehydrogenase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0004961</p>
							</c>
							<c>
								<p>TXA(2) receptor activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0047787</p>
							</c>
							<c>
								<p>delta4-3-oxosteroid 5beta-reductase activity</p>
							</c>
						</r>
						<r>
							<c>
								<p>GO:0003991</p>
							</c>
							<c>
								<p>acetylglutamate kinase activity</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<p>Although bidirectional promoters are orthologous between humans and mice, they exhibit sparse conservation signals in multi-species alignments. This is a slightly surprising result, given that sequence conservation is a reliable marker for functional elements. Nevertheless, it is possible that alternative methods may reveal similarities in bidirectional promoters across species.</p>
				<p>To test for similarity in sequence characteristics that may reveal subtle similarities between the sets of human and mouse bidirectional promoters, we calculated a log-likelihood score called Regulatory Potential (RP). The RP score was used in ESPERR (Evolutionary and Sequence Pattern Extraction through Reduced Representations) <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> to capture information in sequence alignments over seven vertebrate species. This method has been shown to discriminate regulatory regions from nonfunctional regions with an accuracy of 80% <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
				<p>The RP score cumulative distribution functions plotted in Figure <figr fid="F3">3</figr> reveal that regulatory potential scores are similar for bidirectional promoters defined by Known Genes, ESTs, and mRNA in both human and mouse. The similarity in profiles exhibited by all three datasets for each species indicates that sequence characteristics are similar in bidirectional promoter regions, both across species (human vs. mouse) and across datasets (Known Genes, mRNA, and ESTs). The strategy used to map these gene pairs across species strongly identifies orthologous genes that are characterized by name. Therefore the conclusions should not change as more data is added.</p>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>RP score cumulative distribution functions for bidirectional promoters in human and mouse. Bidirectional promoters identified from Known Genes (KG), mRNA, and ESTs all yield similar scores in both human and mouse genomes. RP scores were calculated based on genome assemblies hg17 (human) and mm8 (mouse)</p>
					</caption>
					<text>
						<p>RP score cumulative distribution functions for bidirectional promoters in human and mouse. Bidirectional promoters identified from Known Genes (KG), mRNA, and ESTs all yield similar scores in both human and mouse genomes. RP scores were calculated based on genome assemblies hg17 (human) and mm8 (mouse).</p>
					</text>
					<graphic file="1471-2164-9-S1-S2-3"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Discriminating functional elements based on RP scores</p>
				</st>
				<p>Having established the orthology of bidirectional promoters between human and mouse, we now shift our attention to the problem of discriminating functional elements in the human genome. We again make use of RP scores, which have proven useful for discriminating functional elements from nonfunctional elements, yet their ability to discriminate among types of functional elements remains unknown.</p>
				<p>To test the hypothesis that sequence characteristics differ between classes of functional elements, thereby allowing these classes to be discriminated, we compared RP scores for human bidirectional promoters to those for other functional regions, including enhancers, unidirectional promoters, unbounded promoters, non-promoters (i.e. tail-to-tail regions), coding regions, and neutral regions.</p>
				<p>The cumulative distribution functions of RP score for the different functional classes are shown in Figure <figr fid="F4">4</figr>. We observe that:</p>
				<p>&#8226; As expected, neutral regions (represented by ancestral repeats) separated very distinctly from functional regions such as enhancers.</p>
				<p>&#8226; Despite the fact that bidirectional promoters do not have a strong signal for sequence conservation, they have slightly higher RP scores than enhancers. This is significant because the enhancers used in this analysis are enhancers of genes involved in essential developmental processes, such as neurogenesis <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, which are characterized by strong signals of sequence conservation known as Multi-species Conserved Sequences (MCSs) <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>.</p>
				<p>&#8226; Bidirectional promoters have high RP scores, similar to unidirectional promoters, which are promoter regions that are defined by two genes in a head-to-tail configuration. Like bidirectional promoters, unidirectional promoters are bounded on both sides by exons.</p>
				<p>&#8226; High scores are not a feature of all promoter regions. For example, unbounded promoters, which are promoters having no neighboring upstream gene, tend not to have high RP scores. We examined unbounded promoter regions with no upstream gene within 1000, 5,000, and 10,000 bp and found moderately low RP scores for all three classes. Furthermore, the range of these scores was indistinguishable from non-promoter regions.</p>
				<p>&#8226; Coding regions score nearly as well as bidirectional promoters. This suggests that the types of nucleotide substitutions and the &#8220;word&#8221; content of bidirectional promoters and coding regions may be governed by the same rules, despite that fact that coding regions are strongly conserved and bidirectional promoters are not.</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>Cumulative distribution functions of RP scores for different functional classes. These include bidirectional promoters (red, green, blue), non-bidirectional promoters (purple) and unbounded promoters (light blue, pink, light green). Other functional elements are coding regions (aqua), tail-to-tail regions (yellow) and enhancers (maroon). The nonfunctional elements are represented by ancestral repeats (black)</p>
					</caption>
					<text>
						<p>Cumulative distribution functions of RP scores for different functional classes. These include bidirectional promoters (red, green, blue), non-bidirectional promoters (purple) and unbounded promoters (light blue, pink, light green). Other functional elements are coding regions (aqua), tail-to-tail regions (yellow) and enhancers (maroon). The nonfunctional elements are represented by ancestral repeats (black).</p>
					</text>
					<graphic file="1471-2164-9-S1-S2-4"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Prediction of bidirectional promoters from RP scores</p>
				</st>
				<p>On the basis of Figure <figr fid="F4">4</figr>, it is apparent that bidirectional promoter regions tend to have higher RP scores than either non-promoter or unbounded promoter regions. Another way to see this is to plot the class-conditional density functions <it>p</it>(<it>x</it>|<it>C</it>), where <it>x</it> is the RP score, and <it>C</it> is a functional class; this is simply the probability density function of RP scores, restricted to the functional class <it>C</it>. Given the class-conditional density functions <it>p</it>(<it>x</it>|<it>C</it><sub>1</sub>) and <it>p</it>(<it>x</it>|<it>C</it><sub>2</sub>) for classes <it>C</it><sub>1</sub> and <it>C</it><sub>2</sub>, respectively, we can construct a likelihood ratio classifier that maps an RP score <it>x</it> to a functional class using the rule:</p>
				<p>
					<display-formula>
						<m:math name="1471-2164-9-S1-S2-i1" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mtext>If</m:mtext>
									<m:mtext>&#8201;</m:mtext>
									<m:malignmark/>
									<m:mfrac>
										<m:mrow>
											<m:mi>p</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>x</m:mi>
													<m:mo>|</m:mo>
													<m:msub>
														<m:mi>C</m:mi>
														<m:mn>1</m:mn>
													</m:msub>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
										<m:mrow>
											<m:mi>p</m:mi>
											<m:mrow>
												<m:mo>(</m:mo>
												<m:mrow>
													<m:mi>x</m:mi>
													<m:mo>|</m:mo>
													<m:msub>
														<m:mi>C</m:mi>
														<m:mn>2</m:mn>
													</m:msub>
												</m:mrow>
												<m:mo>)</m:mo>
											</m:mrow>
										</m:mrow>
									</m:mfrac>
									<m:mtext>&#8201;</m:mtext>
									<m:mrow>
										<m:mo>{</m:mo>
										<m:mtable columnalign="left">
											<m:mtr>
												<m:mtd>
													<m:mo>&gt;</m:mo>
													<m:mtext>&#8201;</m:mtext>
													<m:mi>&#956;</m:mi>
													<m:mtext>&#8201;</m:mtext>
													<m:mtext>Decide&#160;class</m:mtext>
													<m:mtext>&#8201;</m:mtext>
													<m:msub>
														<m:mi>C</m:mi>
														<m:mn>1</m:mn>
													</m:msub>
												</m:mtd>
											</m:mtr>
											<m:mtr>
												<m:mtd>
													<m:mo>&lt;</m:mo>
													<m:mtext>&#8201;</m:mtext>
													<m:mi>&#956;</m:mi>
													<m:mtext>&#8201;</m:mtext>
													<m:mtext>Decide&#160;class</m:mtext>
													<m:mtext>&#8201;</m:mtext>
													<m:msub>
														<m:mi>C</m:mi>
														<m:mn>2</m:mn>
													</m:msub>
												</m:mtd>
											</m:mtr>
										</m:mtable>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGcbaqcLbuacqqGjbqscqqGMbGzcaaMe8UaaCjaVNqbaoaalaaakeaajugqbiabdchaWPWaaeWaaeaajugqbiabdIha4jabcYha8jabdoeadPWaaSbaaSqaaiabigdaXaqabaaakiaawIcacaGLPaaaaeaajugqbiabdchaWPWaaeWaaeaajugqbiabdIha4jabcYha8jabdoeadPWaaSbaaSqaaiabikdaYaqabaaakiaawIcacaGLPaaaaaqcLbuacaaMe8Ecfa4aaiqaaKqzafabaeqakeaajugqbiabg6da+iaaysW7cqaH8oqBcaaMe8UaeeiraqKaeeyzauMaee4yamMaeeyAaKMaeeizaqMaeeyzauMaeeiiaaIaee4yamMaeeiBaWMaeeyyaeMaee4CamNaee4CamNaaGjbVlabdoeadPWaaSbaaSqaaiabigdaXaqabaaakeaajugqbiabgYda8iaaysW7cqaH8oqBcaaMe8UaeeiraqKaeeyzauMaee4yamMaeeyAaKMaeeizaqMaeeyzauMaeeiiaaIaee4yamMaeeiBaWMaeeyyaeMaee4CamNaee4CamNaaGjbVlabdoeadPWaaSbaaSqaaiabikdaYaqabaaaaOGaay5Eaaaaaa@81BD@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>The performance of this classifier for different values of the threshold &#956; is summarized by a Receiver Operating Characteristic (ROC), which is a plot of sensitivity against (1&#8212;specificity). We constructed two such classifiers: one to discriminate bidirectional promoters from non-promoters, and the other to discriminate bidirectional promoters from unbounded promoters.</p>
				<sec>
					<st>
						<p>Distinguishing bidirectional promoters from non-promoters</p>
					</st>
					<p>We constructed a likelihood-based classifier to distinguish bidirectional promoters from non-promoters; this is a two-class classification problem, in which the two classes are:</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S2-i2" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mtable columnalign="left">
										<m:mtr>
											<m:mtd>
												<m:msub>
													<m:mi>C</m:mi>
													<m:mn>1</m:mn>
												</m:msub>
												<m:mo>=</m:mo>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>{</m:mo>
												<m:mtext>bidirectional&#160;promoters</m:mtext>
												<m:mo>}</m:mo>
											</m:mtd>
										</m:mtr>
										<m:mtr>
											<m:mtd>
												<m:msub>
													<m:mi>C</m:mi>
													<m:mn>2</m:mn>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>=</m:mo>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>{</m:mo>
												<m:mtext>non-promoters}</m:mtext>
											</m:mtd>
										</m:mtr>
									</m:mtable>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaqabeaajugqbiabdoeadPWaaSbaaSqaaiabigdaXaqabaGccqGH9aqpjugqbiaaysW7cqGG7bWEcqqGIbGycqqGPbqAcqqGKbazcqqGPbqAcqqGYbGCcqqGLbqzcqqGJbWycqqG0baDcqqGPbqAcqqGVbWBcqqGUbGBcqqGHbqycqqGSbaBcqqGGaaicqqGWbaCcqqGYbGCcqqGVbWBcqqGTbqBcqqGVbWBcqqG0baDcqqGLbqzcqqGYbGCcqqGZbWCcqGG9bqFaOqaaKqzafGaem4qamKcdaWgaaWcbaGaeGOmaidabeaajugqbiaaysW7cqGH9aqpcaaMe8Uaei4EaSNaeeOBa4Maee4Ba8MaeeOBa4Maeeyla0IaeeiCaaNaeeOCaiNaee4Ba8MaeeyBa0Maee4Ba8MaeeiDaqNaeeyzauMaeeOCaiNaee4CamNaeeyFa0haaaa@747D@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>The class-conditional probability distributions <it>p</it>(<it>x</it>|BP) and <it>p</it>(<it>x</it>|NP) are shown in Figure <figr fid="F5">5</figr>(a) (here &#8220;BP&#8221; denotes the class of bidirectional promoters, and &#8220;NP&#8221; denotes the class of non-promoters). The corresponding ROC curve is shown in Figure <figr fid="F6">6</figr>(a). A Maximum Likelihood classification rule (obtained by setting &#956; = 1 in the likelihood ratio classifier (1)) yielded a test set accuracy of 74%, a specificity of 92% (relatively high), and a sensitivity of 65% (relatively low), as shown in Table <tblr tid="T3">3</tblr>. The ROC curve reveals that the sensitivity can be boosted above 80% by trading off for a specificity below 80%.</p>
					<fig id="F5">
						<title>
							<p>Figure 5</p>
						</title>
						<caption>
							<p>(a) Class-conditional probability density functions <it>p</it>(<it>x</it>|BP) (bidirectional promoters) and <it>p</it>(<it>x</it>|NP) (non-promoters). (b) Class-conditional probability density functions <it>p</it>(<it>x</it>|BP) (bidirectional promoters) and <it>p</it>(<it>x</it>|UBP1000) (unbounded promoters)</p>
						</caption>
						<text>
							<p>(a) Class-conditional probability density functions <it>p</it>(<it>x</it>|BP) (bidirectional promoters) and <it>p</it>(<it>x</it>|NP) (non-promoters). (b) Class-conditional probability density functions <it>p</it>(<it>x</it>|BP) (bidirectional promoters) and <it>p</it>(<it>x</it>|UBP1000) (unbounded promoters).</p>
						</text>
						<graphic file="1471-2164-9-S1-S2-5"/>
					</fig>
					<fig id="F6">
						<title>
							<p>Figure 6</p>
						</title>
						<caption>
							<p>(a) Receiver operating characteristic (ROC) for classifier that discriminates bidirectional promoters from non-promoters. (b) Receiver operating characteristic (ROC) for classifier that discriminates bidirectional promoters from unbounded promoters</p>
						</caption>
						<text>
							<p>(a) Receiver operating characteristic (ROC) for classifier that discriminates bidirectional promoters from non-promoters. (b) Receiver operating characteristic (ROC) for classifier that discriminates bidirectional promoters from unbounded promoters.</p>
						</text>
						<graphic file="1471-2164-9-S1-S2-6"/>
					</fig>
					<tbl id="T3" hint_layout="single">
						<title>
							<p>Table 3</p>
						</title>
						<caption>
							<p>Performance of classifiers on test data</p>
						</caption>
						<tblbdy cols="4">
							<r>
								<c>
									<p>Classifier</p>
								</c>
								<c>
									<p>Accuracy (%)</p>
								</c>
								<c>
									<p>Sensitivity (%)</p>
								</c>
								<c>
									<p>Specificity (%)</p>
								</c>
							</r>
							<r>
								<c cspan="4">
									<hr/>
								</c>
							</r>
							<r>
								<c>
									<p>Bidirectional promoter vs. Non-promoter</p>
								</c>
								<c>
									<p>74.54</p>
								</c>
								<c>
									<p>65.53</p>
								</c>
								<c>
									<p>92.16</p>
								</c>
							</r>
							<r>
								<c>
									<p>Bidirectional promoter vs. Unbounded promoter</p>
								</c>
								<c>
									<p>80.37</p>
								</c>
								<c>
									<p>67.94</p>
								</c>
								<c>
									<p>81.10</p>
								</c>
							</r>
						</tblbdy>
					</tbl>
				</sec>
				<sec>
					<st>
						<p>Distinguishing bidirectional from unbounded promoters</p>
					</st>
					<p>We constructed a likelihood-based classifier to distinguish bidirectional promoters from unbounded promoters (specifically, the class of promoters with no upstream gene within 1000 base pairs); this is a two-class classification problem, in which the two classes are:</p>
					<p>
						<display-formula>
							<m:math name="1471-2164-9-S1-S2-i3" xmlns:m="http://www.w3.org/1998/Math/MathML">
								<m:semantics>
									<m:mtable columnalign="left">
										<m:mtr>
											<m:mtd>
												<m:msub>
													<m:mi>C</m:mi>
													<m:mn>1</m:mn>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>=</m:mo>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>{</m:mo>
												<m:mtext>bidirectional&#160;promoters}</m:mtext>
											</m:mtd>
										</m:mtr>
										<m:mtr>
											<m:mtd>
												<m:msub>
													<m:mi>C</m:mi>
													<m:mn>2</m:mn>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>=</m:mo>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>{</m:mo>
												<m:mtext>unbounded&#160;promoters</m:mtext>
												<m:mtext>&#8201;</m:mtext>
												<m:mtext>(1000</m:mtext>
												<m:mtext>&#8201;</m:mtext>
												<m:mtext>bp)}</m:mtext>
											</m:mtd>
										</m:mtr>
									</m:mtable>
									<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaqabeaajugqbiabdoeadPWaaSbaaSqaaiabigdaXaqabaqcLbuacaaMe8Uaeyypa0JaaGjbVlabcUha7jabbkgaIjabbMgaPjabbsgaKjabbMgaPjabbkhaYjabbwgaLjabbogaJjabbsha0jabbMgaPjabb+gaVjabb6gaUjabbggaHjabbYgaSjabbccaGiabbchaWjabbkhaYjabb+gaVjabb2gaTjabb+gaVjabbsha0jabbwgaLjabbkhaYjabbohaZjabb2ha9bGcbaqcLbuacqWGdbWqkmaaBaaaleaacqaIYaGmaeqaaKqzafGaaGjbVlabg2da9iaaysW7cqGG7bWEcqqG1bqDcqqGUbGBcqqGIbGycqqGVbWBcqqG1bqDcqqGUbGBcqqGKbazcqqGLbqzcqqGKbazcqqGGaaicqqGWbaCcqqGYbGCcqqGVbWBcqqGTbqBcqqGVbWBcqqG0baDcqqGLbqzcqqGYbGCcqqGZbWCcaaMe8UaeeikaGIaeeymaeJaeeimaaJaeeimaaJaeeimaaJaaGjbVlabbkgaIjabbchaWjabbMcaPiabb2ha9baaaa@891C@</m:annotation>
								</m:semantics>
							</m:math>
						</display-formula>
					</p>
					<p>The class-conditional probability distributions <it>p</it>(<it>x</it>|BP) and <it>p</it>(<it>x</it>|UBP1000) are shown in Figure <figr fid="F5">5</figr>(b) (here &#8220;BP&#8221; denotes the class of bidirectional promoters, and &#8220;UBP1000&#8221; denotes the class of promoters with no upstream gene within 1000 base pairs). The corresponding ROC curve is shown in Figure <figr fid="F6">6</figr>(b). A Maximum Likelihood classification rule (obtained by setting &#956; = 1 in the likelihood ratio classifier (1)) yielded a test set accuracy of 80%, a specificity of 81% (relatively high), and a sensitivity of 67% (relatively low), as shown in Table <tblr tid="T3">3</tblr>. The ROC curve reveals that the sensitivity can be boosted above 80% by trading off for a specificity below 75%.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Multiple Class Prediction</p>
				</st>
				<p>We then tackled a more challenging problem&#8212;to construct a classifier that distinguishes the following four classes:</p>
				<p>
					<display-formula>
						<m:math name="1471-2164-9-S1-S2-i4" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mtable columnalign="left">
									<m:mtr>
										<m:mtd>
											<m:msub>
												<m:mi>C</m:mi>
												<m:mn>1</m:mn>
											</m:msub>
											<m:mo>=</m:mo>
											<m:mo>{</m:mo>
											<m:mtext>bidirectional&#160;promoters</m:mtext>
											<m:mo>}</m:mo>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:msub>
												<m:mi>C</m:mi>
												<m:mn>2</m:mn>
											</m:msub>
											<m:mo>=</m:mo>
											<m:mo>{</m:mo>
											<m:mtext>unbounded&#160;promoters&#160;(1000&#160;bp)</m:mtext>
											<m:mo>}</m:mo>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:msub>
												<m:mi>C</m:mi>
												<m:mn>3</m:mn>
											</m:msub>
											<m:mo>=</m:mo>
											<m:mo>{</m:mo>
											<m:mtext>enhancers</m:mtext>
											<m:mo>}</m:mo>
										</m:mtd>
									</m:mtr>
									<m:mtr>
										<m:mtd>
											<m:msub>
												<m:mi>C4</m:mi>
												<m:mn/>
											</m:msub>
											<m:mo>=</m:mo>
											<m:mo>{</m:mo>
											<m:mtext>non-promoters</m:mtext>
											<m:mo>}</m:mo>
										</m:mtd>
									</m:mtr>
								</m:mtable>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakqaabeqaaKqzafGaem4qamKcdaWgaaWcbaqcLbuacqaIXaqmaSqabaqcLbuacqGH9aqpcqGG7bWEcWaJagOyaiMamWiGbMgaPjadmcyGKbazcWaJagyAaKMamWiGbkhaYjadmcyGLbqzcWaJag4yamMamWiGbsha0jadmcyGPbqAcWaJag4Ba8MamWiGb6gaUjadmcyGHbqycWaJagiBaWMaeeiiaaIamWiGbchaWjadmcyGYbGCcWaJag4Ba8MamWiGb2gaTjadmcyGVbWBcWaJagiDaqNamWiGbwgaLjadmcyGYbGCcWaJag4CamNaeiyFa0hakeaajugqbiabdoeadPWaaSbaaSqaaKqzafGaeGOmaidaleqaaKqzafGaeyypa0Jaei4EaSNaeeyDauNaeeOBa4MaeeOyaiMaee4Ba8MaeeyDauNaeeOBa4MaeeizaqMaeeyzauMaeeizaqMaeeiiaaIaeeiCaaNaeeOCaiNaee4Ba8MaeeyBa0Maee4Ba8MaeeiDaqNaeeyzauMaeeOCaiNaee4CamNaeeiiaaIaeeikaGIaeeymaeJaeeimaaJaeeimaaJaeeimaaJaeeiiaaIaeeOyaiMaeeiCaaNaeeykaKIaeiyFa0hakeaajugqbiabdoeadPWaaSbaaSqaaKqzafGaeG4mamdaleqaaKqzafGaeyypa0Jaei4EaSNaeeyzauMaeeOBa4MaeeiAaGMaeeyyaeMaeeOBa4Maee4yamMaeeyzauMaeeOCaiNaee4CamNaeiyFa0hakeaajugqbiabdoeadPWaaSbaaSqaaKqzafGaeGOmaidaleqaaKqzafGaeyypa0Jaei4EaSNaeeOBa4Maee4Ba8MaeeOBa4Maeeyla0IaeeiCaaNaeeOCaiNaee4Ba8MaeeyBa0Maee4Ba8MaeeiDaqNaeeyzauMaeeOCaiNaee4CamNaeiyFa0haaaa@D656@</m:annotation>
							</m:semantics>
						</m:math>
					</display-formula>
				</p>
				<p>It turns out that bidirectional promoters and unbounded promoters are enriched in CpG islands, while enhancers and non-promoters are depleted in CpG islands. Furthermore, bidirectional promoters and enhancers tend to have relatively high RP scores as compared to unbounded promoters and non-promoters. It follows that by making use of both features (presence of CpG islands and RP score), we may be able to separate the four classes. We therefore implemented a two-stage hierarchical classifier (Figure <figr fid="F7">7</figr>). The first stage only looks at the CpG island feature: if CpG islands are present, the instance is passed to the left child at level 2 (node N2), while if CpG islands are not present, the instance is passed to the right child at level 2 (node N3). There is also a classification outcome <it>Z</it><sub>1</sub> of the first stage; if the instance was passed to the left child, then <it>Z</it><sub>1</sub> = 1, else <it>Z</it><sub>1</sub> = 0. Ideally, instances that end up in node N2 should be either bidirectional or unbounded promoters, while instances that end up in node N3 should be either enhancers or non-promoters. The next stage of the classifier then refines the classification further. Node N2 uses a support vector machine to separate bidirectional from unbounded promoters based on two features&#8212;the presence of CpG islands and RP score, while node N3 uses a decision tree to separate enhancers from non-promoters based on one feature&#8212;RP score (it turns out that these two classes cannot be distinguished based on the presence of CpG islands, so this feature would not be helpful). A decision tree was used at node N3 because it gave better results that a support vector machine. There is a classification outcome <it>Z</it><sub>2</sub> associated to each node at level 2. For node N2, <it>Z</it><sub>2</sub> = 1 implies that the instance is classified as a bidirectional promoter, while <it>Z</it><sub>2</sub> = 0 implies that the instance is classified as an unbounded promoter. For node N3, <it>Z</it><sub>2</sub> = 1 implies that the instance is classified as an enhancer, while <it>Z</it><sub>2</sub> = 0 implies that the instance is classified as a non-promoter. The overall classification is then given by the pair (<it>Z</it><sub>1</sub>, <it>Z</it><sub>2</sub>) as follows:</p>
				<tbl id="T5" hint_layout="single">
					<title>
						<p/>
					</title>
					<tblbdy cols="2">
						<r>
							<c>
								<p>Class</p>
							</c>
							<c>
								<p>(<it>Z</it><sub>1</sub>, <it>Z</it><sub>2</sub>)</p>
							</c>
						</r>
						<r>
							<c>
								<p>Bidirectional promoters</p>
							</c>
							<c>
								<p>(1,1)</p>
							</c>
						</r>
						<r>
							<c>
								<p>Unbounded promoters</p>
							</c>
							<c>
								<p>(1,0)</p>
							</c>
						</r>
						<r>
							<c>
								<p>Enhancers</p>
							</c>
							<c>
								<p>(0,1)</p>
							</c>
						</r>
						<r>
							<c>
								<p>Non-promoters</p>
							</c>
							<c>
								<p>(0,0)</p>
							</c>
						</r>
					</tblbdy>
				</tbl>
				<fig id="F7">
					<title>
						<p>Figure 7</p>
					</title>
					<caption>
						<p>Algorithm for classifying regions into one of four classes: bidirectional promoter, unbounded promoter, non-promoter, or enhancer</p>
					</caption>
					<text>
						<p>Algorithm for classifying regions into one of four classes: bidirectional promoter, unbounded promoter, non-promoter, or enhancer.</p>
					</text>
					<graphic file="1471-2164-9-S1-S2-7"/>
				</fig>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>Bidirectional promoters aid in the analysis of promoter regions, as they are bounded on both sides by other functional elements, and thus precisely delineate the promoter region. Moreover, despite a lack of strong sequence conservation, bidirectional promoters exhibit conserved structure across species, which will undoubtedly be helpful in tracing evolutionary and species-specific events.</p>
			<p>Predictive approaches to classifying functional elements in the human genome are frequently based on a variety of experimental characteristics (e.g. <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp>). Here we have demonstrated that machine learning approaches can be effective without experimental data; this is the first evidence that different types of promoters can be discriminated from one another through machine learning approaches.</p>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<p>Bidirectional promoters from the mouse genome were mapped to annotated transcripts in mouse assemblies mm5 and mm8 using the approach outlined in <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Comparison to CAGE data was accomplished by extracting all promoters from the RIKEN database and comparing genomic coordinates (from the assembly mm5). Any coordinates within 50 bp of each other on the same strand of DNA were considered to be a match. RP scores were collected over the range of each functional element using tools developed by David King of Penn State University (manuscript in preparation). Scores are available for the mouse mm8 assembly. Conserved occurrences of bidirectional promoters were identified by mapping the gene name from human to mouse and searching the Known Gene annotations for the 5&#8242; end of a neighboring gene that falls within 1000 bp.</p>
			<p>From the Known Gene track of the human genome, we identified approximately 1006 bidirectional promoters, 525 non-promoters, 275 enhancers, and over 15,000 unbounded promoters. This data was used to train and test both our two-class classifiers and our four-class classifier.</p>
			<p>The accuracy, sensitivity, and specificity values for the two-class case (Table <tblr tid="T3">3</tblr>) were calculated using:</p>
			<p>
				<display-formula>
					<m:math name="1471-2164-9-S1-S2-i5" xmlns:m="http://www.w3.org/1998/Math/MathML">
						<m:semantics>
							<m:mtable>
								<m:mtr>
									<m:mtd>
										<m:mtext>Overall&#160;Accuracy&#160;=</m:mtext>
										<m:mtext>&#8201;</m:mtext>
										<m:mfrac>
											<m:mrow>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mn>11</m:mn>
													</m:mrow>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
												<m:mo>+</m:mo>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mn>22</m:mn>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mrow>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>2</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>j</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>2</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mi>j</m:mi>
													</m:mrow>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
											</m:mrow>
										</m:mfrac>
									</m:mtd>
								</m:mtr>
								<m:mtr>
									<m:mtd>
										<m:mtext>Sensitivity&#160;=</m:mtext>
										<m:mtext>&#8201;</m:mtext>
										<m:mfrac>
											<m:mrow>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mn>1</m:mn>
														<m:mn>1</m:mn>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mrow>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>2</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mn>1</m:mn>
													</m:mrow>
												</m:msub>
											</m:mrow>
										</m:mfrac>
									</m:mtd>
								</m:mtr>
								<m:mtr>
									<m:mtd>
										<m:mtext>Specificity</m:mtext>
										<m:mtext>&#8201;</m:mtext>
										<m:mtext>=</m:mtext>
										<m:mtext>&#8201;</m:mtext>
										<m:mfrac>
											<m:mrow>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mn>22</m:mn>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mrow>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>2</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mn>2</m:mn>
													</m:mrow>
												</m:msub>
											</m:mrow>
										</m:mfrac>
									</m:mtd>
								</m:mtr>
							</m:mtable>
							<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXafv3ySLgzGmvETj2BSbqeeuuDJXwAKbsr4rNCHbGeaGqipu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaaeaabaWaaaGceaabbeaajugqbiabb+eapjabbAha2jabbwgaLjabbkhaYjabbggaHjabbYgaSjabbYgaSjabbccaGiabbgeabjabbogaJjabbogaJjabbwha1jabbkhaYjabbggaHjabbogaJjabbMha5jabbccaGiabb2da9iaaysW7juaGdaWcaaGcbaqcLbuacqWGobGtkmaaBaaaleaacqaIXaqmcqaIXaqmaeqaaKqzafGaaGjbVlabgUcaRiabd6eaoPWaaSbaaSqaaiabikdaYiabikdaYaqabaaakeaajugqbiabggHiLRWaa0baaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabikdaYaaajugqbiaaysW7cqGHris5kmaaDaaaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqaIYaGmaaqcLbuacaaMe8UaemOta4KcdaWgaaWcbaGaemyAaKMaemOAaOgabeaajugqbiaaysW7aaaakeaajugqbiabbofatjabbwgaLjabb6gaUjabbohaZjabbMgaPjabbsha0jabbMgaPjabbAha2jabbMgaPjabbsha0jabbMha5jabbccaGiabb2da9iaaysW7juaGdaWcaaGcbaacbiqcLbuacqWFobGtkmaaBaaaleaacqWFXaqmcqWFXaqmaeqaaaGcbaqcLbuacqGHris5kmaaDaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWFYaGmaaqcLbuacaaMe8UaemOta4ucfa4aaSbaaSqaaiabdMgaPjabigdaXaqabaaaaaGcbaqcLbuacqqGtbWucqqGWbaCcqqGLbqzcqqGJbWycqqGPbqAcqqGMbGzcqqGPbqAcqqGJbWycqqGPbqAcqqG0baDcqqG5bqEcaaMe8Uaeeypa0JaaGjbVNqbaoaalaaakeaacqWGobGtdaWgaaWcbaGaeGOmaiJaeGOmaidabeaaaOqaaKqzafGaeyyeIuUcdaqhaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaeGOmaidaaKqzafGaaGjbVlabd6eaoLqbaoaaBaaaleaacqWGPbqAcqaIYaGmaeqaaaaaaaaa@B50B@</m:annotation>
						</m:semantics>
					</m:math>
				</display-formula>
			</p>
			<p>where <it>N<sub>ij</sub></it> be the number of class <it>C<sub>j</sub></it> instances classified to class <it>C<sub>i</sub></it> and for the purpose of calculating sensitivity and specificity we have taken the positive class to be <it>C</it><sub>1</sub> and the negative class to be <it>C</it><sub>2</sub>.</p>
			<p>For the four class case (Table <tblr tid="T4">4</tblr>), the overall accuracy and the accuracy over a specific class are given by</p>
			<p>
				<display-formula>
					<m:math name="1471-2164-9-S1-S2-i6" xmlns:m="http://www.w3.org/1998/Math/MathML">
						<m:semantics>
							<m:mtable columnalign="left">
								<m:mtr>
									<m:mtd>
										<m:mtext>Overall&#160;Accuracy&#160;=</m:mtext>
										<m:mtext>&#8201;</m:mtext>
										<m:mfrac>
											<m:mrow>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>4</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mi>i</m:mi>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mrow>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>4</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>j</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>4</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mi>j</m:mi>
													</m:mrow>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
											</m:mrow>
										</m:mfrac>
									</m:mtd>
								</m:mtr>
								<m:mtr>
									<m:mtd>
										<m:mtext>Accuracy&#160;over&#160;class&#160;</m:mtext>
										<m:msub>
											<m:mi>C</m:mi>
											<m:mi>j</m:mi>
										</m:msub>
										<m:mtext>&#160;=</m:mtext>
										<m:mtext>&#8201;</m:mtext>
										<m:mfrac>
											<m:mrow>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>j</m:mi>
														<m:mi>j</m:mi>
													</m:mrow>
												</m:msub>
											</m:mrow>
											<m:mrow>
												<m:msubsup>
													<m:mo>&#8721;</m:mo>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mo>=</m:mo>
														<m:mn>1</m:mn>
													</m:mrow>
													<m:mn>4</m:mn>
												</m:msubsup>
												<m:mtext>&#8201;</m:mtext>
												<m:msub>
													<m:mi>N</m:mi>
													<m:mrow>
														<m:mi>i</m:mi>
														<m:mi>j</m:mi>
													</m:mrow>
												</m:msub>
												<m:mtext>&#8201;</m:mtext>
											</m:mrow>
										</m:mfrac>
									</m:mtd>
								</m:mtr>
							</m:mtable>
							<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegm0B1jxALjhiov2DaeHbuLwBLnhiov2DGi1BTfMBaebbnrfifHhDYfgasaacH8qrps0lbbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGacaGaaeqabaWaaqaafaaakqaabeqaaKqzafGaee4ta8KaeeODayNaeeyzauMaeeOCaiNaeeyyaeMaeeiBaWMaeeiBaWMaeeiiaaIaeeyqaeKaee4yamMaee4yamMaeeyDauNaeeOCaiNaeeyyaeMaee4yamMaeeyEaKNaeeiiaaIaeeypa0JaaGjbVNqbaoaalaaakeaajugqbiabggHiLRWaa0baaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabisda0aaajuaGcaaMc8UaemOta40aaSbaaeaacqWGPbqAcqWGPbqAaeqaaaGcbaqcLbuacqGHris5kmaaDaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqaI0aanaaqcLbuacaaMe8UaeyyeIuUcdaqhaaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaeGinaqdaaKqbakaaykW7cqWGobGtdaWgaaqaaiabdMgaPjabdQgaQbqabaqcLbuacaaMe8oaaaGcbaqcLbuacqqGbbqqcqqGJbWycqqGJbWycqqG1bqDcqqGYbGCcqqGHbqycqqGJbWycqqG5bqEcqqGGaaicqqGVbWBcqqG2bGDcqqGLbqzcqqGYbGCcqqGGaaicqqGJbWycqqGSbaBcqqGHbqycqqGZbWCcqqGZbWCcqqGGaaicqWGdbWqjuaGdaWgaaWcbaqcLbuacqWGQbGAaSqabaqcLbuacqqGGaaicqqG9aqpcaaMe8Ecfa4aaSaaaOqaaiabd6eaonaaBaaaleaacqWGQbGAcqWGQbGAaeqaaaGcbaqcLbuacqGHris5kmaaDaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqaI0aanaaqcfaOaaGPaVlabd6eaonaaBaaabaGaemyAaKMaemOAaOgabeaajugqbiaaysW7aaaaaaa@B0D6@</m:annotation>
						</m:semantics>
					</m:math>
				</display-formula>
			</p>
			<tbl id="T4" hint_layout="single">
				<title>
					<p>Table 4</p>
				</title>
				<caption>
					<p>Performance of four-class hierarchical classifier based on three-fold cross-validation</p>
				</caption>
				<tblbdy cols="3">
					<r>
						<c>
							<p>Class</p>
						</c>
						<c>
							<p>(<it>Z</it><sub>1</sub>, <it>Z</it><sub>2</sub>)</p>
						</c>
						<c>
							<p>Accuracy (%)</p>
						</c>
					</r>
					<r>
						<c cspan="3">
							<hr/>
						</c>
					</r>
					<r>
						<c>
							<p>Bidirectional promoters</p>
						</c>
						<c>
							<p>(1,1)</p>
						</c>
						<c>
							<p>71.31</p>
						</c>
					</r>
					<r>
						<c>
							<p>Unbounded promoters</p>
						</c>
						<c>
							<p>(1,0)</p>
						</c>
						<c>
							<p>62.26</p>
						</c>
					</r>
					<r>
						<c>
							<p>Enhancers</p>
						</c>
						<c>
							<p>(0,1)</p>
						</c>
						<c>
							<p>66.13</p>
						</c>
					</r>
					<r>
						<c>
							<p>Non-promoters</p>
						</c>
						<c>
							<p>(0,0)</p>
						</c>
						<c>
							<p>81.41</p>
						</c>
					</r>
					<r>
						<c>
							<p>Overall</p>
						</c>
						<c>
							<p/>
						</c>
						<c>
							<p>70.56</p>
						</c>
					</r>
				</tblbdy>
			</tbl>
			<p>By the way the four-class hierarchical classifier is constructed, any promoters lacking CpG islands will be diverted down the left child of node N1, and thus will be misclassified. It follows that the performance of the algorithm is acutely sensitive to the fraction of promoters with CpG islands in the test set. Since it is known that CpG islands are present in roughly 70% of promoters, we constructed our test set using a stratified sampling approach that guaranteed that 70% of promoters in the test set contained CpG islands; this helped to reduce the variation in the performance due to sampling.</p>
		</sec>
		<sec>
			<st>
				<p>Competing interests</p>
			</st>
			<p>The authors declare that they have no competing interests.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>LE conceived of the study. MQY implemented the software and performed the analyses. Both authors contributed to writing the manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>We gratefully acknowledge discussions with faculty of National Human Genome Research Institute for improvement of this manuscript. This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.</p>
				<p>This article has been published as part of <it>BMC Genomics</it> Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics &amp; Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2164/9?issue=S1</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>Bidirectional gene organization: a common architectural feature of the human genome</p>
				</title>
				<aug>
					<au>
						<snm>Adachi</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Lieber</snm>
						<fnm>MR</fnm>
					</au>
				</aug>
				<source>Cell</source>
				<pubdate>2002</pubdate>
				<volume>109</volume>
				<issue>7</issue>
				<fpage>807</fpage>
				<lpage>9</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1016/S0092-8674(02)00758-4</pubid>
						<pubid idtype="pmpid" link="fulltext">12110178</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>An Abundance of Bidirectional Promoters in the Human Genome</p>
				</title>
				<aug>
					<au>
						<snm>Trinklein</snm>
						<fnm>ND</fnm>
					</au>
					<au>
						<snm>Aldred</snm>
						<fnm>SF</fnm>
					</au>
					<au>
						<snm>Hartman</snm>
						<fnm>SJ</fnm>
					</au>
					<au>
						<snm>Schroeder</snm>
						<fnm>DI</fnm>
					</au>
					<au>
						<snm>Otillar</snm>
						<fnm>RP</fnm>
					</au>
					<au>
						<snm>Myers</snm>
						<fnm>RM</fnm>
					</au>
				</aug>
				<source>Genome Res.</source>
				<pubdate>2004</pubdate>
				<volume>14</volume>
				<fpage>62</fpage>
				<lpage>66</lpage>
				<url>http://www.genome.org/cgi/content/abstract/14/1/62</url>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">314279</pubid>
						<pubid idtype="pmpid" link="fulltext">14707170</pubid>
						<pubid idtype="doi">10.1101/gr.1982804</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>A computational study of bidirectional promoters in the human genome</p>
				</title>
				<aug>
					<au>
						<snm>Yang</snm>
						<fnm>MQ</fnm>
					</au>
					<au>
						<snm>Elnitski</snm>
						<fnm>LL</fnm>
					</au>
				</aug>
				<source>In Lecture Notes in Bioinformatics</source>
				<publisher>Springer-Verlag</publisher>
				<pubdate>2007</pubdate>
			</bibl>
			<bibl id="B4">
				<title>
					<p>Comprehensive annotation of human bidirectional promoters identifies co-regulatory relationships among somatic breast and ovarian cancer genes</p>
				</title>
				<aug>
					<au>
						<snm>Yang</snm>
						<fnm>MQ</fnm>
					</au>
					<au>
						<snm>Koehly</snm>
						<fnm>LM</fnm>
					</au>
					<au>
						<snm>Elnitski</snm>
						<fnm>LL</fnm>
					</au>
				</aug>
				<source>PLoS Computational Biology</source>
				<pubdate>2007</pubdate>
				<volume>3</volume>
				<issue>4</issue>
				<note>[(E72.eor)]</note>
			</bibl>
			<bibl id="B5">
				<title>
					<p>The UCSC Known Genes</p>
				</title>
				<aug>
					<au>
						<snm>Hsu</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Kent</snm>
						<fnm>WJ</fnm>
					</au>
					<au>
						<snm>Clawson</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Kuhn</snm>
						<fnm>RM</fnm>
					</au>
					<au>
						<snm>Diekhans</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>22</volume>
				<issue>9</issue>
				<fpage>1036</fpage>
				<lpage>1046</lpage>
				<url>http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/9/1036</url>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/btl048</pubid>
						<pubid idtype="pmpid" link="fulltext">16500937</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B6">
				<title>
					<p>GenBank: update</p>
				</title>
				<aug>
					<au>
						<snm>Benson</snm>
						<fnm>DA</fnm>
					</au>
					<au>
						<snm>Karsch-Mizrachi</snm>
						<fnm>I</fnm>
					</au>
					<au>
						<snm>Lipman</snm>
						<fnm>DJ</fnm>
					</au>
					<au>
						<snm>Ostell</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Wheeler</snm>
						<fnm>DL</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2004</pubdate>
				<volume>32</volume>
				<issue>Database issue</issue>
				<fpage>D23</fpage>
				<lpage>D26</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/nar/gkh045</pubid>
						<pubid idtype="pmpid" link="fulltext">14681350</pubid>
						<pubid idtype="pmcid">308779</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis</p>
				</title>
				<aug>
					<au>
						<snm>Kawaji</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Kasukawa</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Fukuda</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Katayama</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Kai</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Kawai</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Carninci</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Hayashizaki</snm>
						<fnm>Y</fnm>
					</au>
				</aug>
				<source>Nucleic Acids Res</source>
				<pubdate>2006</pubdate>
				<volume>34</volume>
				<issue>Database issue</issue>
				<fpage>D632</fpage>
				<lpage>D636</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/nar/gkj034</pubid>
						<pubid idtype="pmpid" link="fulltext">16381948</pubid>
						<pubid idtype="pmcid">1347397</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>Systematic Analysis of Head-to-Head Gene Organization: Evolutionary Conservation and Potential Biological Relevance</p>
				</title>
				<aug>
					<au>
						<snm>Li</snm>
						<fnm>YY</fnm>
					</au>
					<au>
						<snm>Yu</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Guo</snm>
						<fnm>ZM</fnm>
					</au>
					<au>
						<snm>Guo</snm>
						<fnm>TQ</fnm>
					</au>
					<au>
						<snm>Tu</snm>
						<fnm>K</fnm>
					</au>
					<au>
						<snm>Li</snm>
						<fnm>YX</fnm>
					</au>
				</aug>
				<source>PLoS Computational Biology</source>
				<pubdate>2006</pubdate>
				<volume>2</volume>
				<issue>7</issue>
				<note>[E74]</note>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Rigorous Mapping of Orthologous Bidirectional Promoters in Vertebrates Defines their Evolutionary History</p>
				</title>
				<aug>
					<au>
						<snm>Yang</snm>
						<fnm>MQ</fnm>
					</au>
					<au>
						<snm>Taylor</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Elnitsk</snm>
						<fnm>LL</fnm>
					</au>
				</aug>
				<source>In Proceedings of International Multi-Symposiums on Computer and Computational Sciences</source>
				<pubdate>2007</pubdate>
				<note>in press</note>
			</bibl>
			<bibl id="B10">
				<title>
					<p>ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements</p>
				</title>
				<aug>
					<au>
						<snm>Taylor</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Tyekucheva</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>King</snm>
						<fnm>DC</fnm>
					</au>
					<au>
						<snm>Hardison</snm>
						<fnm>RC</fnm>
					</au>
					<au>
						<snm>Miller</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Chiaromonte</snm>
						<fnm>F</fnm>
					</au>
				</aug>
				<source>Genome Res.</source>
				<pubdate>2006</pubdate>
				<volume>16</volume>
				<fpage>1596</fpage>
				<lpage>1604</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1665643</pubid>
						<pubid idtype="pmpid" link="fulltext">17053093</pubid>
						<pubid idtype="doi">10.1101/gr.4537706</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Predicting tissue-specific enhancers in the human genome</p>
				</title>
				<aug>
					<au>
						<snm>Pennacchio</snm>
						<fnm>LA</fnm>
					</au>
					<au>
						<snm>Loots</snm>
						<fnm>GG</fnm>
					</au>
					<au>
						<snm>Nobrega</snm>
						<fnm>MA</fnm>
					</au>
					<au>
						<snm>Ovcharenko</snm>
						<fnm>I</fnm>
					</au>
				</aug>
				<source>Genome Res.</source>
				<pubdate>2007</pubdate>
				<volume>17</volume>
				<issue>2</issue>
				<fpage>201</fpage>
				<lpage>211</lpage>
				<url>http://www.genome.org/cgi/content/abstract/17/2/201</url>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1781352</pubid>
						<pubid idtype="pmpid" link="fulltext">17210927</pubid>
						<pubid idtype="doi">10.1101/gr.5972507</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>Identification and Characterization of Multi-Species Conserved Sequences</p>
				</title>
				<aug>
					<au>
						<snm>Margulies</snm>
						<fnm>EH</fnm>
					</au>
					<au>
						<snm>Blanchette</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Haussler</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Green</snm>
						<fnm>ED</fnm>
					</au>
				</aug>
				<source>Genome Res.</source>
				<pubdate>2003</pubdate>
				<volume>13</volume>
				<issue>12</issue>
				<fpage>2507</fpage>
				<lpage>2518</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">403793</pubid>
						<pubid idtype="pmpid" link="fulltext">14656959</pubid>
						<pubid idtype="doi">10.1101/gr.1602203</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome</p>
				</title>
				<aug>
					<au>
						<snm>Heintzman</snm>
						<fnm>ND</fnm>
					</au>
					<au>
						<snm>Stuart</snm>
						<fnm>RK</fnm>
					</au>
					<au>
						<snm>Hon</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Fu</snm>
						<fnm>Y</fnm>
					</au>
					<au>
						<snm>Ching</snm>
						<fnm>CW</fnm>
					</au>
					<au>
						<snm>Hawkins</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Barrera</snm>
						<fnm>LO</fnm>
					</au>
					<au>
						<snm>Calcar</snm>
						<fnm>SV</fnm>
					</au>
					<au>
						<snm>Qu</snm>
						<fnm>C</fnm>
					</au>
					<au>
						<snm>Ching</snm>
						<fnm>KA</fnm>
					</au>
					<au>
						<snm>Wang</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Weng</snm>
						<fnm>Z</fnm>
					</au>
					<au>
						<snm>Green</snm>
						<fnm>RD</fnm>
					</au>
					<au>
						<snm>Crawford</snm>
						<fnm>GE</fnm>
					</au>
					<au>
						<snm>Ren</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Nature Genetics</source>
				<pubdate>2007</pubdate>
				<volume>39</volume>
				<fpage>311</fpage>
				<lpage>318</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/ng1966</pubid>
						<pubid idtype="pmpid" link="fulltext">17277777</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>The gateway to transcription: identifying, characterizing and understanding promoters in the eukaryotic genome</p>
				</title>
				<aug>
					<au>
						<snm>Heintzman</snm>
						<fnm>ND</fnm>
					</au>
					<au>
						<snm>Ren</snm>
						<fnm>B</fnm>
					</au>
				</aug>
				<source>Cell Mol Life Sci</source>
				<pubdate>2007</pubdate>
				<volume>64</volume>
				<issue>4</issue>
				<fpage>386</fpage>
				<lpage>400</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1007/s00018-006-6295-0</pubid>
						<pubid idtype="pmpid" link="fulltext">17171231</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
