<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>gb-2005-6-4-r33</ui>
	<ji>GBJ</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>Promoter features related to tissue specificity as measured by Shannon entropy</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Schug</snm>
					<fnm>Jonathan</fnm>
					<insr iid="I1"/>
					<email>jschug@pcbi.upenn.edu</email>
				</au>
				<au id="A2">
					<snm>Schuller</snm>
					<fnm>Winfried-Paul</fnm>
					<insr iid="I2"/>
				</au>
				<au id="A3">
					<snm>Kappen</snm>
					<fnm>Claudia</fnm>
					<insr iid="I2"/>
				</au>
				<au id="A4">
					<snm>Salbaum</snm>
					<fnm>J Michael</fnm>
					<insr iid="I2"/>
				</au>
				<au id="A5">
					<snm>Bucan</snm>
					<fnm>Maja</fnm>
					<insr iid="I3"/>
				</au>
				<au id="A6">
					<snm>Stoeckert</snm>
					<mi>J</mi>
					<fnm>Christian</fnm>
					<suf>Jr</suf>
					<insr iid="I1"/>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA</p>
				</ins>
				<ins id="I2">
					<p>Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA</p>
				</ins>
				<ins id="I3">
					<p>Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA</p>
				</ins>
			</insg>
			<source>Genome Biology</source>
			<issn>1465-6906</issn>
			<pubdate>2005</pubdate>
			<volume>6</volume>
			<issue>4</issue>
			<fpage>R33</fpage>
			<url>http://genomebiology.com/2005/6/4/R33</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">15833120</pubid><pubid idtype="doi">10.1186/gb-2005-6-4-r33</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<rec>
				<date>
					<day>16</day>
					<month>11</month>
					<year>2004</year>
				</date>
			</rec>
			<revrec>
				<date>
					<day>27</day>
					<month>1</month>
					<year>2005</year>
				</date>
			</revrec>
			<acc>
				<date>
					<day>16</day>
					<month>2</month>
					<year>2005</year>
				</date>
			</acc>
			<pub>
				<date>
					<day>29</day>
					<month>3</month>
					<year>2005</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2005</year>
			<collab>Schug et al.; licensee BioMed Central Ltd.</collab>
			<note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<shorttitle>
			<p>Promoter features related to tissue-specific expression</p>
		</shorttitle>
		<shortabs>
			<p>A genome-wide analysis of promoters was carried out in the context of gene expression patterns in tissue surveys using human microarray and EST-based expression data. The study revealed that most genes show statistically significant tissue-dependent variations of expression level and identified components of promoters that distinguish tissue-specific from ubiquitous genes.</p>
		</shortabs>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p>The regulatory mechanisms underlying tissue specificity are a crucial part of the development and maintenance of multicellular organisms. A genome-wide analysis of promoters in the context of gene-expression patterns in tissue surveys provides a means of identifying the general principles for these mechanisms.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>We introduce a definition of tissue specificity based on Shannon entropy to rank human genes according to their overall tissue specificity and by their specificity to particular tissues. We apply our definition to microarray-based and expressed sequence tag (EST)-based expression data for human genes and use similar data for mouse genes to validate our results. We show that most genes show statistically significant tissue-dependent variations in expression level. We find that the most tissue-specific genes typically have a TATA box, no CpG island, and often code for extracellular proteins. As expected, CpG islands are found in most of the least tissue-specific genes, which often code for proteins located in the nucleus or mitochondrion. The class of genes with no CpG island or TATA box are the most common mid-specificity genes and commonly code for proteins located in a membrane. Sp1 was found to be a weak indicator of less-specific expression. YY1 binding sites, either as initiators or as downstream sites, were strongly associated with the least-specific genes.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusions</p>
					</st>
					<p>We have begun to understand the components of promoters that distinguish tissue-specific from ubiquitous genes, to identify associations that can predict the broad class of gene expression from sequence data alone.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<meta>
		<classifications>
			<classification type="BMC" subtype="man_spc_id" id="30010016">Molecular biology</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010009">Genetics</classification>
			<classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
		</classifications>
	</meta>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>The development of an adult from the single cell of a fertilized egg requires a complex orchestration of genes to be expressed at the right time, place, and level. Basic cellular functions require the expression of certain genes in all cells and tissues (that is, in a ubiquitous manner) while specialized functions require restricted expression of other genes in a single or small number of cells and tissues (that is, tissue specific). Both types of genes may be needed for embryonic development as well as for the function of adult cells and tissues. While the details of regulatory mechanisms will vary for individual genes, general features of promoters (and here we will restrict our focus to RNA polymerase II (Pol II) promoters) are likely to facilitate whether a gene will be expressed widely or in a restricted manner. For example, based on the limited number of genes available at the time of the analysis, promoters with CpG islands have been associated with housekeeping genes <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. It is desirable to re-examine this finding in the context of complete genomes for human and mouse and to place it in context with subsequent findings such as the association of CpG islands with embryonic expression <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
			<p>Furthermore, it would also be informative to examine the relationship of CpG islands to the base composition of promoters, and the distribution of motifs thought to be bound by factors closely involved with (or part of) the basal transcription complex. The distribution of major components of the core promoter, the TATA box (TBP/TFIID binding site) and initiator element (Pol II binding site, Inr) <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, and proximal elements such as Yin-Yang 1 (YY1) site <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>, among genes is not yet well understood. In addition, the functional correlations with tissue specificity and promoter structure are largely unknown beyond the CpG island association. Our goal is to place these components together in general models for tissue specificity using genome-wide surveys of expression in many tissues.</p>
			<p>Investigators have searched for combinations of transcription-factor-binding sites that confer tissue-specific expression on particular cell types such as muscle <abbrgrp><abbr bid="B9">9</abbr></abbrgrp> or liver <abbrgrp><abbr bid="B10">10</abbr></abbrgrp> in mammals, or in body plan specification in the fruit fly <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> (see <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> for a review). In support of these efforts, analyses of genome-wide expression data have largely focused on identifying common patterns for particular tissues, disease states or signaling inputs. For microarray data, investigators have begun defining these patterns, largely through the application of clustering algorithms <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. Our approach is to rank genes in the spectrum of tissue specificity that runs from expression restricted to one tissue to uniform ubiquitous expression. We can study in detail the distribution of human and mouse genes across the spectrum of tissue specificity and use this to identify commonalities and differences in their promoters with the available complete genome sequences <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, libraries enriched for full-length cDNAs <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp> and genome-wide surveys of gene expression using microarrays <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp>, SAGE <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, mRNAs <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> and expressed sequence tags (ESTs) <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. We validate patterns discovered in human sequence and expression data by comparison to similar mouse data.</p>
			<p>Measures have been developed for overall tissue specificity <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp> that amount to counting the number of tissues that express a gene. These are really measuring tissue restriction, as they do not consider any bias in the expression levels across the tissues that express the gene. Most specificity measures for a particular tissue are equivalent to the relative expression in a tissue compared to the total expression in all tissues considered, (see, for example <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>). We assert that overall tissue specificity measures should take into account the levels of expression in different tissues, not just presence and absence, and that specificity measures for particular tissues should consider the distribution of expression among all tissues in addition to the tissue of interest. Such measures would enable the correct identification of genes as specific for a tissue when that tissue is not the primary site of expression but there are only a few other tissues where the gene is expressed.</p>
			<p>A metric for characterizing the breadth and uniformity of the expression pattern of a gene that meets our criteria is the Shannon information theoretic measure entropy. Although entropy has been used previously to identify potential drug targets <abbrgrp><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr></abbrgrp> by considering the entropy of the variation of expression levels and to cluster microarray data <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, our direct application of entropy to measuring tissue specificity is unique. Entropy (<it>H</it>) measures the degree of overall tissue specificity of a gene, but does not indicate whether it is specific to a particular tissue. To quantify categorical tissue specificity, we introduce a new statistic (<it>Q</it>) that incorporates overall tissue specificity and relative expression level. We demonstrate that <it>H </it>and <it>Q </it>are effective metrics for ranking and selecting genes according to tissue specificity and then proceed to use them to investigate promoter features (CpG islands, base composition, transcription factor motifs) that may be used distinguish tissue-specific genes from nonspecific genes. The association of promoter features with a quantitative assessment of tissue specificity using <it>H </it>and <it>Q </it>is an important step towards developing models for promoter function.</p>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<sec>
				<st>
					<p>Defining tissue specificity</p>
				</st>
				<p>We begin by defining the measurement of two kinds of tissue specificity, 'overall' tissue specificity and 'categorical' tissue specificity. (To avoid confusion we will always use the words 'specificity' and 'specific' to refer to the degree of tissue-restricted expression a gene exhibits and never as a synonym for the word 'particular'.) Overall tissue specificity ranks a gene according to the degree to which its expression pattern differs from ubiquitous uniform expression. We use the term 'ubiquitous' expression to mean expression at any level above background in all tissues. Categorical tissue specificity places special emphasis on a particular tissue of interest and ranks a gene according to the degree to which its expression pattern is skewed toward expression in only that particular tissue. In both cases, a gene's specificity to a tissue, cell type or other condition is decreased as the gene is more uniformly expressed in a wider variety of conditions. In addition, the categorical tissue specificity should decrease as the tissue of interest becomes a smaller component of the overall expression pattern of the gene.</p>
				<p>Given a static multi-tissue expression profile for a gene, there are at least two dimensions along which we can assess the profile to measure tissue specificity. The first dimension is the number of tissues that express the gene above some background level. It can be argued that this dimension measures tissue restriction, that is, a gene shows restricted expression if it is expressed in only a subset of tissues. The second dimension is the uniformity of expression over all tissues that express the gene. A gene that shows significant non-uniform expression is exhibiting tissue-dependent regulation, in addition to any tissue restriction that may be occurring. We assume that a gene that exhibits no tissue-specific regulation will be expressed at the same level in every tissue. We do not assert that such genes are not regulated, only that they are regulated in a way that is not sensitive to tissue.</p>
				<p>The term 'most tissue-specific' will refer to the range of genes that are closer to the extreme of expression in a single tissue than to the extreme of ubiquitous uniform expression. We will refer to genes close to the uniform and ubiquitous end as either 'least tissue-specific' or 'nonspecific' though the latter term may not be strictly true. The range in the middle will be termed 'semi-tissue specific'. The term 'housekeeping' has been applied to genes that are widely expressed and may show little tissue-specific changes in expression level. We can use such genes as an example of genes that will tend to be ubiquitously and uniformly expressed and thus ought to be nonspecific on average. We will use the phrase 'gene sharing' to refer to the situation that occurs when a gene is tissue-specific, and is expressed in a small number of tissues that can be said to share the gene.</p>
			</sec>
			<sec>
				<st>
					<p>Measuring tissue specificity with entropy</p>
				</st>
				<p>We used two gene-expression datasets to evaluate our methods; Affymetrix-based data from the GNF Gene Expression Atlas (GNF-GEA) <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and the distribution of source tissues for EST libraries in the clusters and assemblies of ESTs in the DoTS mouse and human gene index <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. As described in Materials and methods, the GNF-GEA data were used as provided; EST counts in the DoTS gene index were adjusted with pseudocounts and normalized to account for the different number of ESTs sampled from each tissue across all libraries. Given expression levels of a gene in <it>N </it>tissues, we defined the relative expression of a gene <it>g </it>in a tissue <it>t </it>as <it>p</it><sub><it>t</it>|<it>g </it></sub>= <it>w</it><sub><it>g</it>,<it>t</it></sub>/&#8721;<sub>1 &#8804; <it>t </it>&#8804; <it>N</it></sub><it>w</it><sub><it>g</it>,<it>t </it></sub>where <it>w</it><sub><it>g</it>,<it>t </it></sub>is the expression level of the gene in the tissue. The entropy <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> of a gene's expression distribution is <it>H</it><sub><it>g </it></sub>= &#8721;<sub>1 &#8804; <it>t </it>&#8804; <it>N </it></sub>- <it>p</it><sub><it>t</it>|<it>g </it></sub>log<sub>2</sub>(<it>p</it><sub><it>t</it>|<it>g</it></sub>). <it>H</it><sub><it>g </it></sub>has units of bits and ranges from zero for genes expressed in a single tissue to log<sub>2</sub>(<it>N</it>) for genes expressed uniformly in all tissues considered. The maximum value of <it>H</it><sub><it>g </it></sub>depends on the number of tissues considered so we will report this number when appropriate. Because we use relative expression the entropy of a gene is not sensitive to the absolute expression levels. To measure categorical tissue specificity we define <it>Q</it><sub><it>g</it>|<it>t </it></sub>= <it>H</it><sub><it>g </it></sub>- log<sub>2</sub>(<it>p</it><sub><it>t</it>|<it>g</it></sub>). The quantity -log<sub>2</sub>(<it>p</it><sub><it>t</it>|<it>g</it></sub>) also has units of bits and has a minimum of zero that occurs when a gene is expressed in a single tissue and grows unboundedly as the relative expression level drops to zero. Thus <it>Q</it><sub><it>g</it>|<it>t </it></sub>is near its minimum of zero bits when a gene is relatively highly expressed in a small number of tissues including the tissue of interest, and becomes higher as either the number of tissues expressing the gene becomes higher, or as the relative contribution of the tissue to the gene's overall pattern becomes smaller. By itself, the term -log<sub>2</sub>(<it>p</it><sub><it>t</it>|<it>g</it></sub>) is equivalent to <it>p</it><sub><it>t</it>|<it>g</it></sub>. Adding the entropy term serves to favor genes that are not expressed highly in the tissue of interest, but are expressed only in a small number of other tissues. As described earlier, we want to consider such genes as categorically tissue-specific since their expression pattern is very restricted. Figure <figr fid="F1">1</figr> shows examples of patterns of GNF-GEA expression data for different values of <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t</it></sub>. The top five genes specific to mouse amygdala, lymph node, and liver as assessed by this data are listed in Table <tblr tid="T1">1</tblr>. Tables of <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t </it></sub>values for all genes in all tissues in the GNF-GEA datasets are available in Additional data files 1 and 2.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>Examples of GNF-GEA expression patterns for mouse genes at selected <it>H</it><sub><it>g </it></sub>and <it>Q</it></p>
					</caption>
					<text>
						<p>Examples of GNF-GEA expression patterns for mouse genes at selected <it>H</it><sub><it>g </it></sub>and <it>Q</it>. Liver, indicated in red, is the tissue of interest for <it>Q </it>values. <b>(a) </b>Serum albumin (94777_at <it>Alb1</it>) shows very specific liver expression: <it>H </it>= 1.3 bits and <it>Q</it><sub>liver </sub>= 2.1 bits. <b>(b) </b>For liver-specific bHLH-Zip transcription factor (99452_at <it>Lisch7</it>), liver is a strong but not dominant part of the expression pattern: <it>H </it>= 3.7 bits and <it>Q</it><sub>liver </sub>= 6.8 bits. <b>(c) </b>For chloride channel 7 (104391_s_at <it>Clcn7</it>) there is near uniform expression: <it>H </it>= 4.3 bits and <it>Q</it><sub>liver </sub>= 10.2 bits. <b>(d) </b>Gelsolin (93750_at <it>Gsn</it>) is an otherwise widely expressed gene but is expressed at a very low level in the liver: <it>H </it>= 4.4 bits and <it>Q</it><sub>liver </sub>= 15.1 bits.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-1"/>
				</fig>
				<tbl id="T1">
					<title>
						<p>Table 1</p>
					</title>
					<caption>
						<p>The top five most tissue-specific genes for representative tissues</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c ca="left">
								<p>Tissue</p>
							</c>
							<c ca="left">
								<p>Probe set ID</p>
							</c>
							<c ca="left">
								<p>
									<it>H</it>
								</p>
							</c>
							<c ca="left">
								<p>
									<it>Q</it>
								</p>
							</c>
							<c ca="left">
								<p>RefSeq</p>
							</c>
							<c ca="left">
								<p>Description</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Amygdala</p>
							</c>
							<c ca="left">
								<p>96055_at</p>
							</c>
							<c ca="left">
								<p>3.2</p>
							</c>
							<c ca="left">
								<p>5.8</p>
							</c>
							<c ca="left">
								<p>NM_031161</p>
							</c>
							<c ca="left">
								<p>Cholecystokinin</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>93178_at</p>
							</c>
							<c ca="left">
								<p>2.7</p>
							</c>
							<c ca="left">
								<p>5.8</p>
							</c>
							<c ca="left">
								<p>NM_019867</p>
							</c>
							<c ca="left">
								<p>Neuronal guanine nucleotide exchange factor</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>93273_at</p>
							</c>
							<c ca="left">
								<p>3.7</p>
							</c>
							<c ca="left">
								<p>5.8</p>
							</c>
							<c ca="left">
								<p>NM_009221</p>
							</c>
							<c ca="left">
								<p>Synuclein, alpha</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>92943_at</p>
							</c>
							<c ca="left">
								<p>3.5</p>
							</c>
							<c ca="left">
								<p>6.0</p>
							</c>
							<c ca="left">
								<p>NM_008165</p>
							</c>
							<c ca="left">
								<p>Glutamate receptor, ionotropic, AMPA1 (alpha 1)</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>95436_at</p>
							</c>
							<c ca="left">
								<p>3.3</p>
							</c>
							<c ca="left">
								<p>6.1</p>
							</c>
							<c ca="left">
								<p>NM_009215</p>
							</c>
							<c ca="left">
								<p>Somatostatin</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Lymph node</p>
							</c>
							<c ca="left">
								<p>98406_at</p>
							</c>
							<c ca="left">
								<p>2.7</p>
							</c>
							<c ca="left">
								<p>4.0</p>
							</c>
							<c ca="left">
								<p>NM_013653</p>
							</c>
							<c ca="left">
								<p>Chemokine (C-C motif) ligand 5</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>98063_at</p>
							</c>
							<c ca="left">
								<p>1.6</p>
							</c>
							<c ca="left">
								<p>4.1</p>
							</c>
							<c ca="left">
								<p>-</p>
							</c>
							<c ca="left">
								<p>Glycosylation dependent cell adhesion molecule 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>99446_at</p>
							</c>
							<c ca="left">
								<p>2.5</p>
							</c>
							<c ca="left">
								<p>4.1</p>
							</c>
							<c ca="left">
								<p>NM_007641</p>
							</c>
							<c ca="left">
								<p>Membrane-spanning 4-domains, subfamily A, member 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>92741_g_at</p>
							</c>
							<c ca="left">
								<p>3.3</p>
							</c>
							<c ca="left">
								<p>4.5</p>
							</c>
							<c ca="left">
								<p>-</p>
							</c>
							<c ca="left">
								<p>Immunoglobulin heavy chain 4 (serum IgG1)</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>102940_at</p>
							</c>
							<c ca="left">
								<p>2.8</p>
							</c>
							<c ca="left">
								<p>4.6</p>
							</c>
							<c ca="left">
								<p>NM_008518</p>
							</c>
							<c ca="left">
								<p>Lymphotoxin B</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Liver</p>
							</c>
							<c ca="left">
								<p>94777_at</p>
							</c>
							<c ca="left">
								<p>1.3</p>
							</c>
							<c ca="left">
								<p>2.1</p>
							</c>
							<c ca="left">
								<p>-</p>
							</c>
							<c ca="left">
								<p>Albumin 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>101287_s_at</p>
							</c>
							<c ca="left">
								<p>1.6</p>
							</c>
							<c ca="left">
								<p>2.2</p>
							</c>
							<c ca="left">
								<p>NM_010005</p>
							</c>
							<c ca="left">
								<p>Cytochrome P450, 2d10</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>99269_g_at</p>
							</c>
							<c ca="left">
								<p>1.5</p>
							</c>
							<c ca="left">
								<p>2.2</p>
							</c>
							<c ca="left">
								<p>NM_019911</p>
							</c>
							<c ca="left">
								<p>Tryptophan 2,3-dioxygenase</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>100329_at</p>
							</c>
							<c ca="left">
								<p>1.4</p>
							</c>
							<c ca="left">
								<p>2.3</p>
							</c>
							<c ca="left">
								<p>NM_009246</p>
							</c>
							<c ca="left">
								<p>Serine protease inhibitor 1-4</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>94318_at</p>
							</c>
							<c ca="left">
								<p>1.6</p>
							</c>
							<c ca="left">
								<p>2.3</p>
							</c>
							<c ca="left">
								<p>NM_013475</p>
							</c>
							<c ca="left">
								<p>Apolipoprotein H</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>Genes must express at 200 AU in one or more tissues. A full list of all genes is available in the <supplr sid="S1">Additional data files 1</supplr> and <supplr sid="S2">2</supplr>.</p>
					</tblfn>
				</tbl>
				<p>To compare results from microarray and EST-based expression data we mapped the tissues from the GNF-GEA study to the hierarchical controlled vocabulary of anatomical terms used by DoTS and chose a set of 45 tissue terms grouped into 32 groups shown in Table <tblr tid="T2">2</tblr>. In both cases, the vast majority of genes are widely expressed as measured by <it>H</it><sub><it>g </it></sub>as shown in Figure <figr fid="F2">2a</figr>. Of the 7,714 probe sets in the GNF-GEA data with an average normalized intensity value above 50 arbitrary units (AU), 6,167 (80%) of genes had <it>H</it><sub><it>g </it></sub>&#8805; 4 bits, which implies expression in at least 16 tissues and typically corresponds to wider, but uneven, expression. Only 87 (2%) of genes had <it>H</it><sub><it>g </it></sub>&#8804; 1.5 bits, which corresponds to expression in as few as three tissues. Both microarray- and EST-based data yielded similar overall curves. The EST curve peaked at a lower <it>H</it><sub><it>g </it></sub>than the microarray curve. This was due to the small numbers of EST sequences in some of the tissues we considered; EST counts for tissues ranged from 1,933 in the adrenal gland to 331,582 in the central nervous system (CNS). Genes that are ubiquitously expressed may not have ESTs from several of the lightly sequenced tissues, making them appear to have more restricted expression, and hence a lower entropy, than they really do. Figure <figr fid="F2">2b</figr> shows the correlation between estimates of <it>H</it><sub><it>g </it></sub>derived from microarray and EST data. Visual inspection of the plot reveals that while there are no strong contradictions between the two methods, quantitative agreement is limited. Detailed analysis shows that the standard deviation of the difference of paired <it>H</it><sub><it>g </it></sub>values is 0.61 bits. Under the null hypothesis that the estimates from the two data sources are totally uncorrelated the average standard deviation was found to be 0.91 bits. We can reject the null hypothesis (<it>P </it>&lt; 10<sup>-5 </sup>as estimated by Monte Carlo methods). The distribution of <it>Q</it><sub><it>g</it>|<it>t </it></sub>for selected tissues is shown in Figure <figr fid="F2">2c</figr>. These curves can be used to characterize tissues in terms of the number of tissue-specific genes and the amount of gene sharing; for example, liver has a relatively large number of genes shared with a small number of other tissues. In contrast, there were no genes in this set that are uniquely expressed in the amygdala.</p>
				<tbl id="T2">
					<title>
						<p>Table 2</p>
					</title>
					<caption>
						<p>The list of tissues used in this study</p>
					</caption>
					<tblbdy cols="3">
						<r>
							<c ca="left">
								<p>GNF+GEA tissues</p>
							</c>
							<c ca="left">
								<p>Comparison to EST</p>
							</c>
							<c ca="left">
								<p>Hierarchical clustering</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>DRG</p>
							</c>
							<c ca="left">
								<p>PNS</p>
							</c>
							<c ca="left">
								<p>Nervous system</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Trigeminal</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Hippocampus</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Amygdala</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Frontal_cortex</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cortex</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Striatum</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Olfactory_bulb</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Hypothalamus</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Spinal_cord_lower</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Spinal_cord_upper</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Cerebellum</p>
							</c>
							<c ca="left">
								<p>CNS</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Eye</p>
							</c>
							<c ca="left">
								<p>Eye</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Spleen</p>
							</c>
							<c ca="left">
								<p>Spleen</p>
							</c>
							<c ca="left">
								<p>Immune System + trachea</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Lymph_node</p>
							</c>
							<c ca="left">
								<p>Lymph_node</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Trachea</p>
							</c>
							<c ca="left">
								<p>Trachea</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Thymus</p>
							</c>
							<c ca="left">
								<p>Thymus</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Bone_marrow</p>
							</c>
							<c ca="left">
								<p>Bone</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Bone</p>
							</c>
							<c ca="left">
								<p>Bone</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Lung</p>
							</c>
							<c ca="left">
								<p>Lung</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Uterus</p>
							</c>
							<c ca="left">
								<p>Uterus</p>
							</c>
							<c ca="left">
								<p>Reproductive organs</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Umbilical cord</p>
							</c>
							<c ca="left">
								<p>Umbilical_cord</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Placenta</p>
							</c>
							<c ca="left">
								<p>Plancenta</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Ovary</p>
							</c>
							<c ca="left">
								<p>Ovary</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Epidermis, snout_epidermis</p>
							</c>
							<c ca="left">
								<p>Epidermis</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Heart</p>
							</c>
							<c ca="left">
								<p>Heart</p>
							</c>
							<c ca="left">
								<p>Muscle</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Skeletal_muscle</p>
							</c>
							<c ca="left">
								<p>Skeletal_muscle</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Adipose_tissue, brown_fat</p>
							</c>
							<c ca="left">
								<p>Fat</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Adrenal_gland</p>
							</c>
							<c ca="left">
								<p>Adrenal_gland</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Stomach</p>
							</c>
							<c ca="left">
								<p>Stomach</p>
							</c>
							<c ca="left">
								<p>Digestive tract</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Bladder</p>
							</c>
							<c ca="left">
								<p>Bladder</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Small_intestine</p>
							</c>
							<c ca="left">
								<p>Small_intestine</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Large_intestine</p>
							</c>
							<c ca="left">
								<p>Large_intestine</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Gall bladder</p>
							</c>
							<c ca="left">
								<p>Gall_bladder</p>
							</c>
							<c ca="left">
								<p>Gall bladder, liver, and kidney</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Liver</p>
							</c>
							<c ca="left">
								<p>Liver</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Kidney</p>
							</c>
							<c ca="left">
								<p>Kidney</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Salivary_gland</p>
							</c>
							<c ca="left">
								<p>Salivary_gland</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Thyroid</p>
							</c>
							<c ca="left">
								<p>Thyroid</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Mammary_gland</p>
							</c>
							<c ca="left">
								<p>Mammary_gland</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Prostate</p>
							</c>
							<c ca="left">
								<p>Prostate</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Testis</p>
							</c>
							<c ca="left">
								<p>Testis</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Tongue</p>
							</c>
							<c ca="left">
								<p>Tongue</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Digits</p>
							</c>
							<c ca="left">
								<p>Digits</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The list of tissues available in the mouse GNF+GEA survey, groupings of tissues used to compare microarray and EST-based entropy estimates, and tissue groups discovered by clustering tissues on the basis of genes expressed in common.</p>
					</tblfn>
				</tbl>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>Distributions of <it>H </it>and <it>Q </it>for different data sources and tissues</p>
					</caption>
					<text>
						<p>Distributions of <it>H </it>and <it>Q </it>for different data sources and tissues. <b>(a) </b>Distribution of <it>H </it>as estimated from GNF-GEA (red line) and DoTS (blue line). The DoTS curve was generated from genes with at least six ESTs. <b>(b) </b>Correlation of <it>H </it>estimates from GNF-GEA and DoTS. Genes with at least 30 ESTs are shown in red; those with more than 100 ESTs in blue. <b>(c) </b>Cumulative distribution of <it>Q </it>values for selected mouse tissues and the average for all 39 tissues. Mammary gland, liver, muscle and the amygdala have decreasing numbers of highly tissue-specific genes. Liver has a very large number of relatively specific genes. All distributions peak at 2 log<sub>2</sub>(39) = 10.6 bits and have a tail at high <it>Q </it>(not shown) that corresponds to genes that are ubiquitously expressed except in the tissue of interest.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-2"/>
				</fig>
				<p>It is important to determine how well the <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t </it></sub>statistics can be estimated from a dataset to determine the smallest meaningful difference in scores and to guide interpretation of gene rankings. To assess the standard deviations of and <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t</it></sub>, we sampled from the replicates in the GNF-GEA microarray data to compute a large number of <it>H</it><sub><it>g </it></sub>values for each probe set. We found that the standard deviation for <it>H</it><sub><it>g </it></sub>was less than 0.2 bits for 97% of genes. <it>Q</it><sub><it>g</it>|<it>t </it></sub>was not estimated as well; the standard deviation was 1 bit or less for 95% of gene and tissue pairs. This was probably due to the high standard deviation of the -log<sub>2</sub>(<it>p</it><sub><it>t</it>|<it>g</it></sub>) term for low expressing gene-tissue pairs. We found much more variation when we measure reproducibility by considering genes that have two or more probe sets (and therefore two or more different transcripts) in the microarray data. In this case, the standard deviation of <it>H</it><sub><it>g </it></sub>estimates was as high as 1 bit for 97% of the genes but less than 0.3 bits for about 70-80% of the genes. We chose a minimum of 1 bit for <it>H</it><sub><it>g </it></sub>bins and 2 bits for <it>Q </it>bins in the rest of the analyses that require binning. This bin size ensured that most of the genes are in the proper bin and thus the bin could be reliably used to determine associations with the tissue specificity of a class of genes.</p>
			</sec>
			<sec>
				<st>
					<p>Evaluating a set of housekeeping genes</p>
				</st>
				<p>A test of the <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t </it></sub>statistics is to determine values for a set of nonspecific genes such as housekeeping genes. A list of 797 human housekeeping genes <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> was evaluated using these statistics based on the GNF-GEA dataset using RefSeq accession numbers to identify appropriate probe sets. The housekeeping genes had a mean <it>H</it><sub><it>g </it></sub>= 4.6 &#177; 0.27 bits in a set of 27 tissues with a maximum <it>H </it>= lg(27) = 4.75 bits; thus they are nonspecific as expected. Interestingly, a small number of these genes did show some degree of tissue specificity yet were ubiquitously expressed. For example, the median expression of NM_021983 the major histocompatibility complex, class II DR beta 4 gene (32035_at) is approximately 200 AU, but it shows much higher expression in a small set of tissues (spleen, thymus, lung, heart and whole blood), which lowered its entropy. A more extreme case is NM_001502 glycoprotein 2 (zymogen granule membrane protein 2), which is expressed between 250 and 1,000 AU in all tissues except pancreas, where it is expressed at 34,183 AU. This is a ubiquitously expressed gene that entropy categorizes as specific since it showed such extreme tissue-specific induction. The housekeeping genes had a mean <it>Q</it><sub><it>g</it>|<it>t </it></sub>= 9.5 &#177; 0.14 bits in the same set of tissues. The expected <it>Q </it>value for a uniformly and ubiquitously expressed gene is 2 lg(27) = 9.5 bits. Thus, the <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t </it></sub>statistics successfully captured the expected expression properties of housekeeping genes.</p>
			</sec>
			<sec>
				<st>
					<p>Most genes are regulated in a tissue-dependent manner</p>
				</st>
				<p>Although the housekeeping genes assessed above have relatively high entropies, they do show some small degree of overall tissue specificity. We therefore sought to determine how many genes show evidence of tissue-dependent regulation. Since random biological and experimental variation introduce fluctuations in the expression levels of genes, we made a probability model of the effect of these fluctuations on the observed entropy. The experimental variability was estimated from the GNF-GEA data using all normal tissues. The random tissue-to-tissue biological variability was modeled by assuming that each gene has an average expression level across all tissues and that the log base 2 of the tissue-dependent fold changes from the average level follow a normal distribution with mean equal to zero and some unknown, but 'small', standard deviation(s). We obtain a conservative estimate of the number of genes showing evidence of tissue-dependent regulation by using <it>s </it>= 0.5, which allows for a relatively large amount of variation; up to 1.4-fold tissue-to-tissue variation around the mean expression level in about 63% of tissues and larger changes in the remaining tissues. As a threshold for selecting genes with tissue-dependent expression, we choose <it>H</it><sub><it>g </it></sub>= 4.52 bits which has a <it>p</it>-value of 0.005 under the null hypothesis that all genes are uniform. We then find that 5,837/8,703 (67%) of human genes have entropies less than this and so are probably regulated in a tissue-dependent manner. If we use a more stringent definition of uniform expression that allows half as much variation in tissue-to-tissue expression levels (<it>s </it>= 0.25), then the threshold is <it>H</it><sub><it>g </it></sub>= 4.62 bits and we find that 7,584/8,703 (87%) of human genes show evidence of tissue-dependent regulation. Similar results are found in mouse using all 42 distinct tissues, where the corresponding thresholds are <it>H</it><sub><it>g </it></sub>= 5.24 bits (<it>s </it>= 0.5) and <it>H</it><sub><it>g </it></sub>= 5.35 bits (<it>s </it>= 0.25) and the fractions of genes showing tissue-dependent expression are 5,467/7,913 (69%) and 7,482/7,913 (94%) respectively. Thus we conclude that most genes show evidence of tissue-dependent expression levels.</p>
			</sec>
			<sec>
				<st>
					<p>Clustering tissues using <it>Q</it></p>
				</st>
				<p>A test of <it>Q</it><sub><it>g</it>|<it>t </it></sub>with respect to specific genes is to evaluate the tissues in which they rank highly (that is, have low <it>Q</it>) for consistency. This was accomplished by clustering tissues with similar tissue-specific genes and inspecting the clusters formed. We used 27 normal human tissues and, separately, 39 tissues from the GNF-GEA data for mouse and selected the genes (<it>N </it>= 3,768 human and <it>N</it> = 1786 mouse) that express at least 200 AU in at least one tissue and have <it>Q</it><sub><it>g</it>|<it>t </it></sub>= 7 in at least one tissue. With these genes, we made a consensus hierarchical clustering of the tissues as shown in Figure <figr fid="F3">3</figr>. We found that the tissues in the nervous system, reproductive structures (excluding testis), immune system, and digestive system reliably cluster together in both species. In addition, skeletal muscle and heart clustered in mouse; the human survey did not have skeletal muscle. These results suggest that <it>Q</it><sub><it>g</it>|<it>t </it></sub>is correctly identifying tissue-specific genes. Interestingly, testis is an outlier in both trees, indicating that the collection of genes expressed in testis are distinct from any other tissue or organ. Furthermore, <it>H</it><sub><it>g </it></sub>and <it>Q</it><sub><it>g</it>|<it>t </it></sub>can also be used in conjunction with a tissue hierarchy to answer more complex questions about the tissue distribution of genes such as 'what genes are specific to the brain but are widely expressed throughout the brain?' In Table <tblr tid="T3">3</tblr> we list the top five mouse genes expressed specifically but uniformly across three of the highlighted groups in Figure <figr fid="F3">3b</figr>.</p>
				<tbl id="T3">
					<title>
						<p>Table 3</p>
					</title>
					<caption>
						<p>The top five most group-specific mouse genes for selected tissue groups</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c ca="left">
								<p>Tissue cluster</p>
							</c>
							<c ca="center">
								<p>Probe Set ID</p>
							</c>
							<c ca="center">
								<p>
									<it>H</it>
								</p>
							</c>
							<c ca="center">
								<p>
									<it>Q</it>
								</p>
							</c>
							<c ca="left">
								<p>RefSeq</p>
							</c>
							<c ca="left">
								<p>Description</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Nervous system</p>
							</c>
							<c ca="center">
								<p>100047_at</p>
							</c>
							<c ca="center">
								<p>3.3</p>
							</c>
							<c ca="center">
								<p>3.4</p>
							</c>
							<c ca="left">
								<p>NM_011428</p>
							</c>
							<c ca="left">
								<p>Synaptosomal-associated protein, 25 kDa</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>103030_at</p>
							</c>
							<c ca="center">
								<p>3.5</p>
							</c>
							<c ca="center">
								<p>3.6</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Dynamin</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>97983_s_at</p>
							</c>
							<c ca="center">
								<p>3.7</p>
							</c>
							<c ca="center">
								<p>3.8</p>
							</c>
							<c ca="left">
								<p>NM_009295</p>
							</c>
							<c ca="left">
								<p>Syntaxin binding protein 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>98339_at</p>
							</c>
							<c ca="center">
								<p>3.7</p>
							</c>
							<c ca="center">
								<p>3.8</p>
							</c>
							<c ca="left">
								<p>NM_018804</p>
							</c>
							<c ca="left">
								<p>Synaptotagmin 11</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>94545_at</p>
							</c>
							<c ca="center">
								<p>3.7</p>
							</c>
							<c ca="center">
								<p>3.8</p>
							</c>
							<c ca="left">
								<p>NM_153457</p>
							</c>
							<c ca="left">
								<p>Reticulon 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Immune system</p>
							</c>
							<c ca="center">
								<p>96648_at</p>
							</c>
							<c ca="center">
								<p>2.807</p>
							</c>
							<c ca="center">
								<p>2.882</p>
							</c>
							<c ca="left">
								<p>NM_009898</p>
							</c>
							<c ca="left">
								<p>Coronin, actin binding protein 1a</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>93584_at</p>
							</c>
							<c ca="center">
								<p>3.373</p>
							</c>
							<c ca="center">
								<p>3.622</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Immunoglobulin heavy chain 6 (heavy chain of IgM)</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>101048_at</p>
							</c>
							<c ca="center">
								<p>3.541</p>
							</c>
							<c ca="center">
								<p>3.876</p>
							</c>
							<c ca="left">
								<p>NM_011210</p>
							</c>
							<c ca="left">
								<p>Protein tyrosine phosphatase, receptor type, C</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>94278_at</p>
							</c>
							<c ca="center">
								<p>3.495</p>
							</c>
							<c ca="center">
								<p>3.923</p>
							</c>
							<c ca="left">
								<p>NM_008879</p>
							</c>
							<c ca="left">
								<p>Lymphocyte cytosolic protein 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>100156_at</p>
							</c>
							<c ca="center">
								<p>3.609</p>
							</c>
							<c ca="center">
								<p>4.039</p>
							</c>
							<c ca="left">
								<p>NM_008566</p>
							</c>
							<c ca="left">
								<p>Mini chromosome maintenance deficient 5</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Liver and gall bladder</p>
							</c>
							<c ca="center">
								<p>94777_at</p>
							</c>
							<c ca="center">
								<p>1.280</p>
							</c>
							<c ca="center">
								<p>1.326</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Albumin 1</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>100329_at</p>
							</c>
							<c ca="center">
								<p>1.394</p>
							</c>
							<c ca="center">
								<p>1.464</p>
							</c>
							<c ca="left">
								<p>NM_009246</p>
							</c>
							<c ca="left">
								<p>Serine protease inhibitor 1-4</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>99269_g_at</p>
							</c>
							<c ca="center">
								<p>1.471</p>
							</c>
							<c ca="center">
								<p>1.561</p>
							</c>
							<c ca="left">
								<p>NM_019911</p>
							</c>
							<c ca="left">
								<p>Tryptophan 2,3-dioxygenase</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>99862_at</p>
							</c>
							<c ca="center">
								<p>1.503</p>
							</c>
							<c ca="center">
								<p>1.595</p>
							</c>
							<c ca="left">
								<p>NM_013465</p>
							</c>
							<c ca="left">
								<p>Alpha-2-HS-glycoprotein</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>96846_at</p>
							</c>
							<c ca="center">
								<p>1.515</p>
							</c>
							<c ca="center">
								<p>1.607</p>
							</c>
							<c ca="left">
								<p>NM_080844</p>
							</c>
							<c ca="left">
								<p>Serine (or cysteine) proteinase inhibitor, clade C (antithrombin), member 1</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The tissue groups were identified in a consensus clustering of tissues based on common tissue-specific genes. The <it>Q </it>value is for the gene and tissue group. To ensure uniform expression across the tissue group, genes were required to have an entropy on the tissue group that was 90% of the maximum possible for the group.</p>
					</tblfn>
				</tbl>
				<fig id="F3">
					<title>
						<p>Figure 3</p>
					</title>
					<caption>
						<p>Consensus tissue tree of tissues from human and mouse data</p>
					</caption>
					<text>
						<p>Consensus tissue tree of tissues from human and mouse data. Trees are the consensus of trees created from 5,000 random samples of sets of 1,000 genes from <b>(a) </b>3,768 (human) or <b>(b) </b>1,786 (mouse) genes with <it>Q</it><sub><it>g</it>|<it>t </it></sub>&#8804; 7 bits in at least one tissue. The length of the line leading into a node indicates how many trees did not include the set of tissues to the right of the node. The shortest lines correspond to unanimous subgroups. We have highlighted all maximal subgroups that occurred in at least half of the sampled trees. The nervous system is indicated in red, immune system in blue, reproductive tissue in yellow, digestive organs in purple and magenta, muscle tissue in cyan, and glandular tissue in brown. All maximal subgroups that occurred in at least half of the sampled trees. The tissues not included in a highlighted subgroup typically have statistically significant overlap with many of the highlighted tissues as estimated using the hypergeometric distribution.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-3"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>CpG islands are associated with the least tissue-specific genes</p>
				</st>
				<p>It has been proposed that CpG islands are predominantly associated with promoters of housekeeping genes <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. We performed a quantitative test of this hypothesis using the GNF-GEA data and determining the frequency of CpG islands in promoters as a function of <it>H</it><sub><it>g</it></sub>. We considered only predicted CpG islands that span the start of transcription (see <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> for a justification of this definition), and genes that expressed at least at the median level of 200 AU (that is, were moderately expressed) in at least one tissue, and were represented by a single probe set on the Affymetrix chip used in the GNF-GEA experiments. Promoter sequences were obtained from DBTSS and were based on the 5' ends of full-length transcripts <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. We found that there is a strong, roughly linear, correlation between a gene's entropy <it>H</it><sub><it>g </it></sub>and the probability that the gene will have a predicted start CpG island as shown in Figure <figr fid="F4">4</figr>. Start CpG islands were associated with only nine of the 100 most tissue-specific human genes as compared to 80% of the least tissue-specific genes. Similar numbers were found for mouse (7% start CpG island frequency for the 100 most tissue-specific genes; about 64% for the least tissue-specific genes). A comparison of CpG islands from the most and least tissue-specific genes did not reveal any significant difference in the overall base composition, or ratio of observed to expected CpG dinucleotides. The distribution of the position of the 5' end point of CpG islands was also very similar for the most and least tissue-specific genes though CpG islands tend to start further upstream in the least tissue-specific genes (data not shown).</p>
				<fig id="F4">
					<title>
						<p>Figure 4</p>
					</title>
					<caption>
						<p>The fraction of start CpG islands in genes ranked by entropy <it>H</it><sub><it>g </it></sub>increases with entropy</p>
					</caption>
					<text>
						<p>The fraction of start CpG islands in genes ranked by entropy <it>H</it><sub><it>g </it></sub>increases with entropy. Each point represents the fraction of genes in consecutive groups of 100 genes ranked by entropy <it>H</it><sub><it>g </it></sub>computed from GNF-GEA data. Genes in this set are expressed above 200 AU in at least one tissue. The human dataset (diamonds) has 26 tissues (maximum <it>H </it>= 4.7 bits), the mouse dataset (squares) has 42 tissues (maximum <it>H </it>= 5.3 bits).</p>
					</text>
					<graphic file="gb-2005-6-4-r33-4"/>
				</fig>
				<p>Another group of genes observed to be associated with CpG islands are those expressed in the early embryo <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> from the fertilized egg to the blastocyst. The question arises as to whether there is an association of genes having start CpG islands and the developmental stage of expression (that is, embryonic versus adult) in addition to the one for tissue specificity. We investigated this possibility in the mouse using DoTS <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> EST and mRNA assemblies by tabulating the number of DoTS genes that contain at least two ESTs from a mouse early embryo library as shown in Table <tblr tid="T4">4</tblr>. We considered 933 genes with start CpG islands (CGI+) and 1,007 genes without start CpG islands (CGI-) that were expressed in the adult. If there were no developmental bias, this distribution of CpG+ and CpG- genes should be maintained in genes expressed in the embryo. However, only 139 (14%) of the CGI- genes were expressed in the early embryo in contrast to 365 (39%) CGI+ genes (<it>P </it>= 3 &#215; 10<sup>-70 </sup>exact binomial). Therefore, a gene expressed in the adult was 2.8 (= 0.39/0.14) times more likely to be expressed in the early embryo if it contained a start CpG island. Furthermore, the most tissue-specific genes expressed in the adult were four times more likely to have been expressed in the early embryo if their promoter contained a start CpG island. These results strongly suggest that CpG islands are promoter features for both embryonic and the least tissue-specific genes.</p>
				<tbl id="T4">
					<title>
						<p>Table 4</p>
					</title>
					<caption>
						<p>CpG islands are correlated with embryonic expression even for tissue-specific genes</p>
					</caption>
					<tblbdy cols="6">
						<r>
							<c ca="left">
								<p>Gene type</p>
							</c>
							<c ca="center">
								<p>CpG island state</p>
							</c>
							<c ca="center">
								<p>Total genes considered</p>
							</c>
							<c ca="center">
								<p>Expressed genes</p>
							</c>
							<c ca="center">
								<p>Fraction</p>
							</c>
							<c ca="center">
								<p>Fraction ratio</p>
							</c>
						</r>
						<r>
							<c cspan="6">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Embryo</p>
							</c>
							<c ca="center">
								<p>CGI+</p>
							</c>
							<c ca="center">
								<p>933</p>
							</c>
							<c ca="center">
								<p>365</p>
							</c>
							<c ca="center">
								<p>39%</p>
							</c>
							<c ca="center">
								<p>2.8</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>CGI-</p>
							</c>
							<c ca="center">
								<p>1007</p>
							</c>
							<c ca="center">
								<p>139</p>
							</c>
							<c ca="center">
								<p>14%</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c cspan="6">
								<p>&#160;</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>Adult-specific</p>
							</c>
							<c ca="center">
								<p>CGI+</p>
							</c>
							<c ca="center">
								<p>29</p>
							</c>
							<c ca="center">
								<p>8</p>
							</c>
							<c ca="center">
								<p>29%</p>
							</c>
							<c ca="center">
								<p>4</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="center">
								<p>CGI-</p>
							</c>
							<c ca="center">
								<p>180</p>
							</c>
							<c ca="center">
								<p>12</p>
							</c>
							<c ca="center">
								<p>7%</p>
							</c>
							<c>
								<p/>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>We determined the fraction of genes with (39%) and without (14%) start CpG islands that are expressed in the early embryo. A gene is 2.8 (= 0.39/0.14) times more likely to be expressed in the early embryo if it has a start CpG island. If we then consider genes that go on to be specific in the adult, we find the ratio of CGI+/CGI- genes is now 4 = 0.28/0.07. The differences in rates between CpG island status within each stage are significant (<it>P </it>&lt; 0.0005; binomial). Of the between-stage comparisons, only the CGI- adult-specific/embryo change is significant (<it>P </it>= 0.0009; hypergeometric).</p>
					</tblfn>
				</tbl>
			</sec>
			<sec>
				<st>
					<p>Base composition of promoters depends on specificity</p>
				</st>
				<p>Analysis of base-composition profiles of promoters provides clues to common features, including motifs associated with promoter categories. We examined the base composition profiles of human promoters of high (0 &#8804; <it>H</it><sub><it>g </it></sub>&#8804; 3.5 bits) and low (4.4 &#8804; <it>H</it><sub><it>g </it></sub>&#8804; 4.71 bits) tissue-specificity genes. We considered CGI+ and CGI- genes separately, as it is clear the presence of a CpG island will strongly influence the base composition and that the fraction of start CpG islands varies with entropy. In addition, the presence of a start CpG island may indicate a different regulation mechanism related to either tissue specificity or embryonic expression (or both). The number of promoters from DBTSS in these four classes that were used in the analysis were: 310 CGI- and 129 CGI+ high specificity; 342 CGI- and 1,501 CGI+ low specificity. Genes that have only non-start CpG islands represented a minor component and were not included in this analysis. We used the full set of normal tissues in the first GNF-GEA microarray study for human and mouse. Base composition profiles with 10 base-pair (bp) windows are shown in Figure <figr fid="F5">5</figr> for human genes. Each of the features we report were observed in human and mouse (unless noted otherwise) and compare G to C or A to T over spans of at least 10 positional bins; the probability of observing a feature at least this long by chance is less than 0.5<sup>10 </sup>which is equivalent to 0.001. Promoters of CGI+ genes (Figure <figr fid="F5">5a,b</figr>) shared features but could also be distinguished on the basis of tissue specificity. A common feature of CGI+ promoters was the increase in C+G content that starts at 1,000 bp upstream of the transcription start site and continues at 200 bp downstream. The C+G bias reached p(C+G) = 0.7 at the start of transcription and continued into the 5' UTR. Nonspecific (Figure <figr fid="F5">5c</figr>) and tissue-specific (Figure <figr fid="F5">5d</figr>) CGI- genes still showed a C+G bias around the start of transcription, but it was much smaller in magnitude at p(C+G) = 0.54. The low specificity CGI+ genes (Figure <figr fid="F5">5a</figr>) showed upstream base composition biases that were not found in any of the other three gene classes. There was a preference for C over G (p(C) &gt; p(G)) in the (-350, -150) region and also a preference for p(A) &gt; p(T) in the -600, -200 region in human (this region is located (-400, -150) in mouse). In tissue-specific CGI+ (Figure <figr fid="F5">5b</figr>) genes the strong C+G bias held but p(C) = p(G), except for the (+50, +100) region where p(C) &gt; p(G). These base-composition differences observed between nonspecific and tissue-specific promoters over regions of hundreds of base-pairs, even in the context of a CpG island, suggest different structural features and regulatory mechanisms for these CGI+ classes.</p>
				<fig id="F5">
					<title>
						<p>Figure 5</p>
					</title>
					<caption>
						<p>Base-composition profiles for ubiquitous and tissue-specific genes with and without start CpG islands</p>
					</caption>
					<text>
						<p>Base-composition profiles for ubiquitous and tissue-specific genes with and without start CpG islands. Data is for human genes; similar patterns were observed in mouse. <b>(a) </b>Ubiquitous genes with a CpG island; <b>(b) </b>tissue-specific genes with a CpG island; <b>(c) </b>ubiquitous genes with no CpG island; and <b>(d) </b>tissue-specific genes with no CpG island. Note differences in upstream C+G content, peak sizes at TATA box (-35 bp) and initiator positions, and downstream C versus G differences.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-5"/>
				</fig>
				<p>Most striking were differences between nonspecific and tissue-specific promoters that are independent of the presence of a CpG island. A sharp spike in the proportion of A and T was seen in the (-50,-1) region for all classes but was most pronounced in the tissue-specific promoters (Figure <figr fid="F5">5b,d</figr>). These spikes correspond to the presence of a TATA box and suggest a correlation of this motif with tissue-specific genes (explored more fully later). Conversely, all low-specificity genes (Figure <figr fid="F5">5a,c</figr>) shared a common feature in the (+1, +200) region where p(G) &gt; p(C) and p(T) &gt; p(A) that was not seen in tissue-specific genes (Figure <figr fid="F5">5b,d</figr>). As shown later, this low-specificity feature could be partially explained by the presence of a YY1 motif. These base-composition differences observed between nonspecific and tissue-specific promoters are likely to indicate motifs that distinguish the two classes.</p>
			</sec>
			<sec>
				<st>
					<p>Selected transcription factor motifs in the core promoter</p>
				</st>
				<p>We next examined the distribution of basic core promoter features: the TATA box, the initiator element, and two binding sites for selected ubiquitous transcription factors, Sp1 and YY1, to see if their presence in the proximal promoter was correlated with the tissue specificity of a gene. Two approaches were taken using different datasets and motif-searching methods that gave similar results, providing independent confirmation of results. First, we searched for core motifs using weight matrix hits in promoters of genes selected using <it>H</it><sub><it>g </it></sub>calculated from the GNF-GEA data. Second, we searched for core motif consensus sites in promoters of genes selected using <it>Q</it><sub><it>g</it>|<it>t </it></sub>calculated from EST data.</p>
			</sec>
			<sec>
				<st>
					<p>TATA boxes are associated with tissue-specific genes</p>
				</st>
				<p>We grouped the human genes that expressed at least 200 AU (average value) in the GNF-GEA data by entropy and start CpG island status. The number of genes in each category is shown in Table <tblr tid="T5">5</tblr> along with a summary of results. We used alignments of position-specific scoring matrices and scoring thresholds included in the Eukaryotic Promoter Database (EPD) <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> to identify the TATA box and initiator element. Matches to these motifs were preferentially located at the expected positions relative to the transcription start site based on the ratio of the number of observed set to the expected number using a set of random sequences with the same position-dependent base composition as each of the promoters.</p>
				<tbl id="T5">
					<title>
						<p>Table 5</p>
					</title>
					<caption>
						<p>The most significant indicators of the degree of tissue specificity: start CpG island, TATA box, and YY1 site</p>
					</caption>
					<tblbdy cols="7">
						<r>
							<c cspan="3" ca="left">
								<p>Features</p>
							</c>
							<c ca="left">
								<p>Total fraction</p>
							</c>
							<c ca="left">
								<p><it>H </it>0-3</p>
							</c>
							<c ca="left">
								<p><it>H </it>3-4</p>
							</c>
							<c ca="left">
								<p><it>H </it>4-5</p>
							</c>
						</r>
						<r>
							<c cspan="3">
								<hr/>
							</c>
							<c>
								<p/>
							</c>
							<c cspan="3">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI</p>
							</c>
							<c ca="left">
								<p>TATA</p>
							</c>
							<c ca="left">
								<p>YY1</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Most specific</p>
							</c>
							<c ca="left">
								<p>Semi-specific</p>
							</c>
							<c ca="left">
								<p>Least specific</p>
							</c>
						</r>
						<r>
							<c cspan="7">
								<hr/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>3,552</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>271</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>602</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>2679</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>1.00</p>
							</c>
							<c ca="left">
								<p>0.08</p>
							</c>
							<c ca="left">
								<p>0.17</p>
							</c>
							<c ca="left">
								<p>0.75</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI+</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>2,434</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>56</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>306</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>2072</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.69</p>
							</c>
							<c ca="left">
								<p>0.02</p>
							</c>
							<c ca="left">
								<p>0.13</p>
							</c>
							<c ca="left">
								<p>0.85</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.30</p>
							</c>
							<c ca="left">
								<p>0.74</p>
							</c>
							<c ca="left">
								<p>1.13</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-</p>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>1,118</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>215</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>296</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>607</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.31</p>
							</c>
							<c ca="left">
								<p>0.19</p>
							</c>
							<c ca="left">
								<p>0.26</p>
							</c>
							<c ca="left">
								<p>0.54</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>2.52</p>
							</c>
							<c ca="left">
								<p>1.56</p>
							</c>
							<c ca="left">
								<p>0.72</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>TATA+</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>604</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>136</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>175</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>293</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.17</p>
							</c>
							<c ca="left">
								<p>0.23</p>
							</c>
							<c ca="left">
								<p>0.29</p>
							</c>
							<c ca="left">
								<p>0.49</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>2.95</p>
							</c>
							<c ca="left">
								<p>1.71</p>
							</c>
							<c ca="left">
								<p>0.64</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>TATA-</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>2,949</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>135</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>427</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>2,387</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.83</p>
							</c>
							<c ca="left">
								<p>0.05</p>
							</c>
							<c ca="left">
								<p>0.14</p>
							</c>
							<c ca="left">
								<p>0.81</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.60</p>
							</c>
							<c ca="left">
								<p>0.85</p>
							</c>
							<c ca="left">
								<p>1.07</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI+</p>
							</c>
							<c ca="left">
								<p>TATA+</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>284</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>19</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>82</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>183</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.08</p>
							</c>
							<c ca="left">
								<p>0.07</p>
							</c>
							<c ca="left">
								<p>0.29</p>
							</c>
							<c ca="left">
								<p>0.64</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.88</p>
							</c>
							<c ca="left">
								<p>1.70</p>
							</c>
							<c ca="left">
								<p>0.85</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-</p>
							</c>
							<c ca="left">
								<p>TATA+</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>320</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>117</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>93</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>110</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.09</p>
							</c>
							<c ca="left">
								<p>0.37</p>
							</c>
							<c ca="left">
								<p>0.29</p>
							</c>
							<c ca="left">
								<p>0.34</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>4.79</p>
							</c>
							<c ca="left">
								<p>1.71</p>
							</c>
							<c ca="left">
								<p>0.46</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI+</p>
							</c>
							<c ca="left">
								<p>TATA-</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>2,150</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>37</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>224</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>1,889</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.61</p>
							</c>
							<c ca="left">
								<p>0.02</p>
							</c>
							<c ca="left">
								<p>0.10</p>
							</c>
							<c ca="left">
								<p>0.88</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.23</p>
							</c>
							<c ca="left">
								<p>0.61</p>
							</c>
							<c ca="left">
								<p>1.16</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-</p>
							</c>
							<c ca="left">
								<p>TATA-</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>798</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>98</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>203</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>497</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.22</p>
							</c>
							<c ca="left">
								<p>0.12</p>
							</c>
							<c ca="left">
								<p>0.25</p>
							</c>
							<c ca="left">
								<p>0.62</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>1.61</p>
							</c>
							<c ca="left">
								<p>1.50</p>
							</c>
							<c ca="left">
								<p>0.83</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>YY1+</p>
							</c>
							<c ca="left">
								<p>
									<b>293</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>1</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>16</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>276</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.08</p>
							</c>
							<c ca="left">
								<p>0.00</p>
							</c>
							<c ca="left">
								<p>0.05</p>
							</c>
							<c ca="left">
								<p>0.94</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.04</p>
							</c>
							<c ca="left">
								<p>0.32</p>
							</c>
							<c ca="left">
								<p>1.25</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI+</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>YY1+</p>
							</c>
							<c ca="left">
								<p>
									<b>261</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>1</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>10</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>250</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.07</p>
							</c>
							<c ca="left">
								<p>0.00</p>
							</c>
							<c ca="left">
								<p>0.04</p>
							</c>
							<c ca="left">
								<p>0.96</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.05</p>
							</c>
							<c ca="left">
								<p>0.23</p>
							</c>
							<c ca="left">
								<p>1.27</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI+</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>YY1-</p>
							</c>
							<c ca="left">
								<p>
									<b>2,173</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>55</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>296</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>1,822</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.61</p>
							</c>
							<c ca="left">
								<p>0.03</p>
							</c>
							<c ca="left">
								<p>0.14</p>
							</c>
							<c ca="left">
								<p>0.84</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.33</p>
							</c>
							<c ca="left">
								<p>0.80</p>
							</c>
							<c ca="left">
								<p>1.11</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>YY1-</p>
							</c>
							<c ca="left">
								<p>
									<b>1,086</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>215</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>290</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>581</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.31</p>
							</c>
							<c ca="left">
								<p>0.20</p>
							</c>
							<c ca="left">
								<p>0.27</p>
							</c>
							<c ca="left">
								<p>0.53</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>2.59</p>
							</c>
							<c ca="left">
								<p>1.58</p>
							</c>
							<c ca="left">
								<p>0.71</p>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-</p>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>YY1+</p>
							</c>
							<c ca="left">
								<p>
									<b>32</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>0</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>6</b>
								</p>
							</c>
							<c ca="left">
								<p>
									<b>26</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.01</p>
							</c>
							<c ca="left">
								<p>0.00</p>
							</c>
							<c ca="left">
								<p>0.19</p>
							</c>
							<c ca="left">
								<p>0.81</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>0.00</p>
							</c>
							<c ca="left">
								<p>1.11</p>
							</c>
							<c ca="left">
								<p>1.08</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>The three columns on the left indicate the combination of features considered; empty cells indicate that the feature is not considered. The 'Total fraction' column indicates the number of promoters with each feature combination (in bold) and the corresponding fraction of all genes considered. The three columns on the right give statistics for matching genes in three bands of tissue specificity. The top two lines give the number and corresponding fraction of all genes considered for each band. For each feature combination, the numbers indicate the number (top, bold), fraction (middle), and enrichment ratio (bottom) of matching genes. The enrichment ratio is the fraction of promoters of genes in the entropy band that contain a feature divided by the band's fraction among all genes considered. For example, specific genes are best recognized by a combination of TATA box (TATA+) and lack of a CpG island (CGI-), which enriches the fraction of such genes from 8% to 37% - a factor of 4.79. Nonspecific genes are most specifically recognized by CpG islands and YY1 sites, which returns a set that is 96% nonspecific genes, but only matches 7%/75% = 10% of the nonspecific genes.</p>
					</tblfn>
				</tbl>
				<p>We searched for the TATA box in the (-45, -10) region where the average observed/expected ratio for the TATA box was 3.1. As shown in Table <tblr tid="T5">5</tblr>, the most-specific CGI- genes were six times more likely to have a TATA box than the least-specific CGI+ genes (117/215 (54%) versus 183/2072 (9%), <it>P </it>&#8776; 0 exact binomial). Similar numbers are found in mouse (52%/11% = 4.7) This trend also holds within CGI- genes and CGI+ genes. The most specific CGI- genes were three times more likely to have a TATA box than the least specific CGI- genes (117/215 versus 110/607, <it>P </it>&#8776; 0 exact binomial). While less common in CGI+ genes, TATA boxes were still almost four times as likely to be found in the most specific CGI+ genes than the least specific CGI+ genes (19/56 versus 183/2,072, <it>P </it>= 2 &#215; 10<sup>-7 </sup>exact binomial). Thus TATA boxes are clearly associated with tissue-specific genes and provide a second axis (with CpG islands) for distinguishing between the most and least specific genes.</p>
				<p>In contrast, the frequency of occurrences of the initiator element (Pol II binding site) was roughly constant across all tissue-specificity classes for both CGI+ and CGI- genes. We searched for the initiator element in the (-10, +10) region. It occurred in 762 of 1,118 (68%) of CGI- genes and 1,273 of 2,434 (52%) of CGI+ genes. Similarly, it occurred in 149 of 215 (69%) of the most specific genes and 388 of 607 (64%) of CGI+ genes. The observed frequency of TATA+/Inr+ promoters was not significantly different from the expected rate assuming independence of the two individual features (data not shown).</p>
			</sec>
			<sec>
				<st>
					<p>Sp1-binding sites are weakly associated with the least tissue-specific genes</p>
				</st>
				<p>Sp1 <abbrgrp><abbr bid="B37">37</abbr><abbr bid="B38">38</abbr></abbrgrp> is a ubiquitous transcription factor with a G-rich binding site with consensus sequence GGGCGGG that might explain the observed G-richness of the 5' UTR in non-specific genes. We used the GC-box weight matrix and scoring threshold from EPD <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> to identify Sp1 sites. We found that Sp1 sites are preferentially located in the (-150, +1) region in all sets of genes where they occurred on average at twice the expected rate in agreement with previous findings <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. In both human and mouse, Sp1 sites were rarely found in the 5' UTR despite the G-richness of this region; they occurred at the expected rate of between 2 and 5%. Thus Sp1 sites were not the cause of the G-richness in the 5' UTR.</p>
				<p>Sp1 sites are associated with CpG islands but are an important component of GGI- promoters as well. Considering just the (-150, +1) region, Sp1 sites occurred in 1,105/2,434 (45%) of human CGI+ gene promoters, and 316/1,118 (28%) of CGI- genes at about 2.5 to 3.0 times the expected frequency in both cases. Frequencies in mouse are 927/2075 (45%) of CGI+ promoters and 464/1652 (28%) CGI- promoters. Sp1 sites were also weakly associated with the least specific genes occurring in 1,105/2,679 (41%) of these genes as compared to 94/271 (32%) in the most tissue-specific genes (<it>P </it>= 0.016). Similar numbers are found in the mouse; 38% of the least specific and 26% of the most specific promoters have Sp1 sites. Thus, although Sp1 shows a preference for the least tissue-specific promoters, it is not a strong predictor of the tissue specificity of a gene.</p>
			</sec>
			<sec>
				<st>
					<p>YY1 binding sites are associated with low-specificity genes</p>
				</st>
				<p>The transcription factor YY1 <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp> is also ubiquitously expressed and is thought to bind close to <abbrgrp><abbr bid="B39">39</abbr></abbrgrp> and downstream of the transcription start site. There is evidence that the function of YY1 depends on its orientation <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. The location and G-richness of the reverse complement consensus sequence (AANATGGCG) make YY1 a candidate for explaining the prominent G &gt; C feature in the (+1, +200) region of low-specificity genes. We consider YY1 because a YY1-like motif was frequently included among the most statistically significant motifs identified by the motif discovery programs AlignACE <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> and MEME <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> in the (+1, +60) region of nonspecific CGI+ promoters (Figure <figr fid="F6">6a</figr>). Our form is most similar to the activating form <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>, which may be associated with low-specificity genes. Because of the demonstrated functional sensitivity to the orientation of binding sites we considered each orientation separately. Indeed, as shown in Figure <figr fid="F6">6b</figr> we found each orientation exhibits different position preferences. Sites in the reverse orientation (YY1<sub>r</sub>) were preferentially located in the (+1, +25) region but with some elevated levels to +80 bp. Start positions of sites in the forward orientation (YY1<sub>f</sub>) showed a very sharp preference for -3 bp, which probably represents a YY1-like initiator sequence reviewed elsewhere <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>. Both orientations were found predominantly in the least specific genes (Table <tblr tid="T5">5</tblr>). YY1<sub>f </sub>initiator sites are rare; only 55/2,679 (2%) were found above background in human low-specificity genes. The rate in mouse, 22/2,832 (0.8%) of low-specificity promoters, is even lower. The YY1<sub>r </sub>sites are more common and were found above background in 217 (8%) of the 2,679 least specific genes. YY1<sub>r </sub>sites were more common in CGI+ genes than in CGI- genes (202/2,072 (10%) versus 15/607 (2%) <it>P </it>= 3.7 &#215; 10<sup>-9 </sup>two-population binomial). The corresponding rates in mouse confirm these observations; 178/2,832 (6%) for all low-specificity genes and 152/1,779 (9%) in CGI+ and 26/1,053 (2%) of CGI- low-specificity promoters. These YY1-like sites therefore constitute a feature strongly associated with the least specific genes and may partially explain the observed G &gt; C ratio in the (+1, +200) region.</p>
				<fig id="F6">
					<title>
						<p>Figure 6</p>
					</title>
					<caption>
						<p>YY1 motifs are found downstream of the transcription start site, depending on their orientation</p>
					</caption>
					<text>
						<p>YY1 motifs are found downstream of the transcription start site, depending on their orientation. <b>(a) </b>The top image shows a logo [69] representation of the YY1 motif in the (+10, +20) region of human CGI+ promoters identified using AlignACE. It is based on 102 sequences. The other two logos are for weight matrices contained in TRANSFAC v7.3 that represent activating and repressing YY1 binding sites. <b>(b) </b>Plot of the positional distribution of predicted YY1 sites and the fraction of genes with a predicted YY1 sites in the (+1, +60) region. YY1 sites were predicted using a weight matrix generated using AlignACE. YY1 sites are more than almost three times (<it>P </it>&#8804; 2 &#215; 10<sup>-7</sup>) as common in genes with nonspecific CGI+ genes (11%; <it>N </it>= 2,072) than in CGI- genes (4%; <it>N </it>= 607) and occur at more than 10 times the expected rate. Similar trends are observed in genes with 3 &#8804; <it>H </it>&#8804; 4 though with lower absolute and relative rates. The difference between CGI+ and CGI- genes is not statistically significant for genes in the 3 &#8804; <it>H </it>&#8804; 4 bin. Essentially no YY1 sites where observed in specific genes with <it>H </it>&#8804; 3 bits whether or not they had a CpG island.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-6"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p><it>Q</it>-based analysis of core promoter motifs</p>
				</st>
				<p>A second analysis of TATA box and Inr motifs was done to determine if the association of the TATA box with tissue-specific genes is also found in genes ranked by <it>Q </it>and is robust to using EST data as well as promoters that did not specifically rely on full-length cDNA clones. The definition of <it>Q </it>implies that genes with a particular <it>Q</it>-value can have a variety of <it>H</it><sub><it>g </it></sub>values and thus it may be more difficult to identify features related to tissue specificity. We tabulated all DoTS genes that contained at least two ESTs from an islet-cell library then ranked the genes by <it>Q</it><sub>pancreas </sub>computed using EST counts. We used <it>Q</it><sub>pancreas </sub>&#8804; 7 bits as the criterion for selecting pancreas-specific genes which we grouped into 2-bit Q intervals. For comparison we selected 50 genes with <it>Q</it><sub>pancreas </sub>= 8.5 bits, and 50 genes with 10 &#8804; <it>Q</it><sub>pancreas </sub>&#8804; 10.6 bits. Genes with high specificity for the pancreas (0 &#8804; <it>Q</it><sub>pancreas </sub>&#8804; 2 bits, <it>N </it>= 9) preferentially had TATA boxes (8 of 9) with half of these also having an initiation element (4 of 9; Figure <figr fid="F7">7a</figr>). With decreasing specificity, the fraction of genes containing TATA boxes drops with only18 of 81 (2/9) genes with <it>Q </it>&gt; 6 bits having TATA boxes. Thus, the strong correlation of TATA boxes with specific genes found with <it>H</it><sub><it>g </it></sub>and microarray data was also seen with <it>Q </it>and EST data for pancreas-expressed genes. Also consistent is the observation that initiator elements were found at similar frequencies (around 60%) across all specificity classes (Figure <figr fid="F7">7b</figr>). Similar patterns were observed in other tissues (data not shown).</p>
				<fig id="F7">
					<title>
						<p>Figure 7</p>
					</title>
					<caption>
						<p>The distribution of TATA box and initiator element (Inr) in pancreas-specific genes</p>
					</caption>
					<text>
						<p>The distribution of TATA box and initiator element (Inr) in pancreas-specific genes. One hundred and sixty pancreas genes were divided into bins according to their <it>Q</it>-value. Genes that have a TATA box, an initiator with the motif YYANWYY, both, or none of these two motifs, are shown. <b>(a) </b>Absolute numbers of genes with core promoter motifs. Red bars, TATA only; blue bars, TATA and Inr; green bars, Inr only; purple bars, none. The <it>p</it>-values for pairwise comparison of distributions (TATA/total) are given below the graph. <it>P</it>-values were calculated for the sum of genes with TATA box (with and without initiator). <b>(b) </b>Results from (a) plotted as fractions of genes with each motif status within a bin. <b>(c) </b>Number of TATA boxes found in orthologous human and mouse gene pairs. Statistical significance of differences between <it>Q </it>bins are indicated.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-7"/>
				</fig>
				<p>The consistency of findings for the TATA box with human islet genes based on <it>Q </it>and ESTs was next tested with orthologous genes in mouse. This test provides a measure for whether the global pattern observed (TATA box with tissue-specific genes) is also found for the same set of genes in another mammal. We also added bins of genes with higher <it>Q</it>-values that represent more widely expressed genes. For each human gene, the orthologous mouse gene was determined (see Materials and methods for details) and analyzed as described above. Overall, 18.8% of the human genes and 22.9% of the mouse genes that were analyzed carry the TATA box motif. Except for the last group (<it>Q </it>&gt;10 bits) the percentage of the genes with TATA box motifs decreases with the increase in the <it>Q</it>-value. This is to be expected since genes with high <it>Q </it>may be specific to other tissues and hence are more likely to have a TATA box. Discrepancies between human and mouse promoters were noted for only about 10% of all human-mouse pairs analyzed and may reflect sequence differences and possible annotation discrepancies for the transcription start site. Nevertheless, there is overall excellent agreement for the presence of TATA motifs in human and mouse genes. Thus, our assessment of preferential presence of transcription regulatory motifs in the human pancreas-expressed genes also applies to their mouse orthologs. We conclude that genes expressed with restricted tissue-distribution may be preferentially regulated via TATA-mediated transcription, and that genes with broader expression profiles are more likely to be regulated by non-TATA mediated mechanisms (such as YY1).</p>
			</sec>
			<sec>
				<st>
					<p>Promoter classes</p>
				</st>
				<p>Since the presence or absence of a start CpG island and a TATA box appear to be the primary sequence feature that correlate with tissue specificity, we consider them in more detail. We observe that CpG islands and TATA boxes are not mutually exclusive features of promoters and so we consider all possible combinations of these features.</p>
			</sec>
			<sec>
				<st>
					<p>Frequency of promoter classes</p>
				</st>
				<p>Figure <figr fid="F8">8</figr> shows the cumulative fraction of each class of promoter as a function of increasing <it>H</it><sub><it>g </it></sub>in human (Figure <figr fid="F8">8a</figr>) and mouse (Figure <figr fid="F8">8b</figr>). The data from human and mouse follow similar trends even though the mouse has a lower proportion of CGI+ genes. Overall, CGI+/TATA- genes are the most common, at 50-60% depending on the species. Interestingly, the CGI-/TATA- class is the second most common overall, comprising 20-30% of genes, depending on the species. Genes in this promoter class are roughly equally common across the entire entropy range and are the most common promoters in the mid-specificity range in both species. The classes CGI-/TATA+ and CGI+/TATA+ are the least common (8 to 12% overall). CGI-/TATA+ genes are concentrated in the most specific genes. CGI+/TATA+ are found relatively uniformly across all but the most specific genes. Although the TATA box and CpG islands are strongly predictive of a gene's entropy, Figure <figr fid="F8">8</figr> also illustrates the limitations of the promoter classes as an explanation for expression patterns. First, although the CGI-/TATA+ and CGI+/TATA- classes are strongly associated with the most and least tissue-specific genes (respectively), instances of genes in each class cover virtually the entire range of tissue specificities. Second, the CGI-/TATA- class is the second most common, illustrating that any degree of tissue specificity can be obtained without these sequence features.</p>
				<fig id="F8">
					<title>
						<p>Figure 8</p>
					</title>
					<caption>
						<p>The cumulative distribution of promoter classes as a function of entropy is similar in human and mouse</p>
					</caption>
					<text>
						<p>The cumulative distribution of promoter classes as a function of entropy is similar in human and mouse. The cumulative fractions of genes with all possible combinations of CGI and TATA box features for <b>(a) </b>human and <b>(b) </b>mouse as a function of entropy <it>H</it><sub><it>g </it></sub>as computed from GNF-GEA data is shown. For example, in human about 50% of the genes with <it>H</it><sub><it>g </it></sub>&#8804; 2.5 have a CGI-/TATA+ promoter. The gray bars indicate the entropy range that is not significantly different from uniform ubiquitous expression. Curves are compiled from genes that express above 200 AU in at least one tissue. As expected, CGI+/TATA- genes are most common in less specific genes and CGI-/TATA+ genes are most common in tissue-specific genes. CGI-/TATA- genes are very common and are found nearly uniformly at every level of specificity. Furthermore, CGI+/TATA- and CGI-/TATA+ genes are both common in mid-specificity (3 &#8804; <it>H</it><sub><it>g </it></sub>&#8804; 4) genes showing that specificity is not determined by these features alone. The trends in human and mouse data are nearly identical despite the lower rate of CpG islands in mouse. The large variations in the graph at low entropy are due to the noise inherent in the small number of genes in this range.</p>
					</text>
					<graphic file="gb-2005-6-4-r33-8"/>
				</fig>
			</sec>
			<sec>
				<st>
					<p>Functional assessment of promoter classes using Gene Ontology terms</p>
				</st>
				<p>To try to understand the functional correlates of the four promoter classes, we looked for trends in the cellular localization and biological process of the products of genes from each promoter class. We used the DAVID system <abbrgrp><abbr bid="B45">45</abbr><abbr bid="B46">46</abbr></abbrgrp>, which identifies over-represented Gene Ontology (GO) <abbrgrp><abbr bid="B47">47</abbr></abbrgrp> terms in a set of genes. A summary of the results for human and mouse genes are shown in Table <tblr tid="T6">6</tblr>. In each case the set of genes in each promoter class were compared to all genes on the corresponding Affymetrix chip.</p>
				<tbl id="T6">
					<title>
						<p>Table 6</p>
					</title>
					<caption>
						<p>Over-represented Gene Ontology (GO) terms for cellular component and biological process of genes by promoter class</p>
					</caption>
					<tblbdy cols="4">
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Cellular component/biological process</p>
							</c>
							<c ca="left">
								<p>Human only</p>
							</c>
							<c ca="left">
								<p>Mouse only</p>
							</c>
						</r>
						<r>
							<c cspan="4">
								<hr/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-/TATA+</p>
							</c>
							<c ca="left">
								<p>Extracellular, extracellular space</p>
							</c>
							<c ca="left">
								<p>
									<b>-</b>
								</p>
							</c>
							<c ca="left">
								<p>Intermediate filament (cytoskeleton)</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Response to stimulus</p>
							</c>
							<c ca="left">
								<p>Cell-cell signaling, organismal physiological process, inflammatory response, innate immune response, response to pest/pathogen/parasite</p>
							</c>
							<c ca="left">
								<p>-</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI+/TATA-</p>
							</c>
							<c ca="left">
								<p>Cell, cytoplasm, intracellular, mitochondrion</p>
							</c>
							<c ca="left">
								<p>Nucleus, ribonucleoprotein complex</p>
							</c>
							<c ca="left">
								<p>
									<b>-</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>
									<b>-</b>
								</p>
							</c>
							<c ca="left">
								<p>Nucleobase, nucleoside, nucleotide and nucleic acid metabolism, intracellular transport, metabolism, protein transport, intracellular protein transport, RNA processing, RNA metabolism, cell cycle, mitotic cell cycle</p>
							</c>
							<c ca="left">
								<p>
									<b>-</b>
								</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
							<c>
								<p/>
							</c>
						</r>
						<r>
							<c ca="left">
								<p>CGI-/TATA-</p>
							</c>
							<c ca="left">
								<p>(Integral to) (plasma) membrane</p>
							</c>
							<c ca="left">
								<p>-</p>
							</c>
							<c ca="left">
								<p>Extracellular, extracellular space</p>
							</c>
						</r>
						<r>
							<c>
								<p/>
							</c>
							<c ca="left">
								<p>Organismal physiological process, defense response, immune response, response to biotic stimulus, response to stimulus, response to external stimulus</p>
							</c>
							<c ca="left">
								<p>Response to pest/pathogen/parasite, cell communication, response to wounding, cellular defense response, signal transduction</p>
							</c>
							<c ca="left">
								<p>Complement activation, complement activation (classical pathway), humoral defense mechanism (<it>sensu </it>Vertebrata), humoral immune response</p>
							</c>
						</r>
					</tblbdy>
					<tblfn>
						<p>All terms were selected using a <it>p</it>-value &#8804; 0.05 (corrected for multiple testing). Terms common to human and mouse are listed in the second column. The two columns on the right indicate any additional terms found in only one species. The CGI-/TATA+ terms are consistent with a model of strong condition-specific induction, CGI+/TATA- terms are consistent with housekeeping functions. CGI-/TATA- terms indicate support for cell sensing and communication functions. No significant results were found for CGI+/TATA+ genes.</p>
					</tblfn>
				</tbl>
				<p>Products of genes in the CGI-/TATA+ class were often (70/198) located extracellularly. Examples of such genes are the insulin-like growth factor family, serum albumin and chymotrypsin. Some extracellular CGI-/TATA+ genes, such as luteinizing hormone beta (LHB) and bone morphogenetic protein 10 (Bmp10) in the mouse, have a high <it>H</it><sub><it>g </it></sub>because they are not induced in the tissues or at the developmental stages surveyed, but otherwise fit the pattern of secreted proteins. Gene products that are secreted from the cell must be produced at high level to be effective. Indeed we found the maximum expression level of TATA+ genes is higher than TATA- genes; 454/745 (61%) of TATA+ genes express at least 1,000 AU in one or more tissues, whereas only 1,321/3,773 (35%) of TATA- genes express that highly (<it>p</it>-value = 0; two-sample binomial population). A second group of CGI-/TATA+ that is common, but with a <it>p</it>-value just over the <it>p</it>-value cutoff are the muscle contraction-related genes, actin, troponin and members of the myosin family. Products of these genes are also required in large amounts to create the contractile apparatus but are only produced in a few cell types. The biological processes that are enriched in the CGI-/TATA+ class differ between human and mouse, but nearly all of them are descendants of the GO term 'response to stimulus' (GO:0050896).</p>
				<p>The CGI+/TATA- promoters produce proteins that are typically located in the cell, especially in the cytoplasm and mitochondrion. These locations are consistent with many housekeeping functions. The human results for biological process suggests a large number of housekeeping processes, but these were not confirmed in the mouse using all CGI+/TATA- genes. When we consider just the least specific CGI+/TATA- mouse genes (4.45 &#8804; <it>H</it><sub><it>g </it></sub>&#8804; 5.57 bits), we find cellular locations (including the nucleus) and biological processes that match the human results.</p>
				<p>No significant concentrations of cellular locations or biological processes were found among the CGI+/TATA+ genes. A manual examination of genes in both human and mouse identifies a number of heat-shock proteins, histones and ribosomal proteins although these are not statistically significant as a result of the multiple testing correction. Many of these genes fit the expected expression pattern in that they are widely expressed and at high levels.</p>
				<p>Interestingly, the products of CGI-/TATA- genes are often located in the plasma membrane (244/499 of human genes with a cellular location) and support signaling and response to the environment. Such products, for example, bradykinin receptor B2, prolactin receptor or protocadherin 9, may be expressed in a tissue-specific pattern, but not at the high levels required for secreted proteins. The exact biological process GO terms that are statistically significant vary between mouse and human, but a common core includes defense response (GO:0006952), immune response (GO:0006955) and response to stimulus (GO:0050896). Thus these genes are similar to CGI-/TATA+ genes in that they are involved in response, but are not (typically) required to be expressed at such high levels.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Discussion</p>
			</st>
			<p>We have applied Shannon entropy as a novel measure of overall tissue specificity of gene expression and have created a new statistic <it>Q </it>to assess the categorical specificity of a gene for a particular tissue. We have evaluated the performance of entropy on microarray-and EST-based estimates of tissue-specific expression and found that it correctly identifies both tissue-specific and housekeeping genes. Ranking and binning genes by entropy allowed us to begin to deconstruct core promoters into features directing when and where the gene will be expressed. We verified and extended previous observations <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> about the correlation of CpG islands with housekeeping genes and embryonic genes. We then identified differences in the base composition profile of promoters of tissue-specific and nonspecific genes. Next, we identified correlations between, on the one hand, the TATA box and tissue-specific genes, and on the other hand, the YY1 site and nonspecific genes. Finally, we identified trends in promoter classes based on CpG island and TATA box status and associated them with common cellular locations and biological processes. Similar observations were also observed for TATA box and Q-selected genes in pancreas.</p>
			<p>The identification of an association between promoter type and cellular location and biological function, while an important step in a fundamental understanding of biology, also has practical significance, as the genes in the CGI-/TATA+ and CGI-/TATA- classes are enriched for tissue-specific extracellular and cell surface proteins. Such genes are likely to be useful drug targets. Thus entropy <it>H</it><sub><it>g </it></sub>and <it>Q </it>have allowed us to discover fundamental properties of mammalian Pol II promoters and should allow serve to aid understanding of expression in particular tissues of interest.</p>
			<p>The validity of our approach is supported by findings in other work and by the fact that they are robust with respect to the algorithm used to process the expression data. Our finding that most genes are regulated in a tissue-dependent manner is consistent with another analysis of gene expression <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, which found that housekeeping genes cluster in a tissue-specific manner. Thus, it appears, even the most basic biological functions are subject to regulation. The tissue trees we produced contain relationships similar to those in an analysis <abbrgrp><abbr bid="B48">48</abbr></abbrgrp> of mid-specificity genes, including the close relation between lung, and the immune system-related organs spleen and thymus. That analysis is based on a different method and a different set of expression data gives us confidence that <it>Q</it><sub><it>g</it>|<it>t </it></sub>is properly identifying genes that are specific to a tissue. The GNF-GEA expression data we analyzed was processed with the MAS4 <abbrgrp><abbr bid="B49">49</abbr></abbrgrp> algorithm. We reanalyzed the data from this study after reprocessing it with the more recent Robust Multichip Average (RMA) algorithm <abbrgrp><abbr bid="B50">50</abbr></abbrgrp>. This algorithm tends to suppress low-level signals and we found that most genes appeared to be more tissue specific, that is, had lower <it>H</it>, in the RMA-processed data compared to the reported values. Although this affects some of the precise values of numbers we have reported it does not alter any of the fundamental trends or results. We include tissue specificities based on both analyses in Additional data files 1 and 2.</p>
			<p>Our analysis focused on only a few sequence features and although we found good correlations, two aspects of our results indicate that there are other regulatory mechanisms not yet identified. First, there is a gradual transition in the frequency of the TATA box and CpG islands between the most and least tissue-specific genes. Second, while these features are strong indicators of high and low specificity, they are far from perfect predictors. Indeed, the middle range of entropies contains a mix of all promoter classes in large numbers, indicating that it is possible to achieve tissue-specific expression with any promoter class. YY1 may be an example of such a supplementary mechanism. While occurring in only 16% of genes, it is very strictly confined to low-specificity genes and is a better indicator of low specificity than CpG islands. We expect that other such signals will be found.</p>
			<p>Anatomical resolution is an issue with the datasets used in this study. For example, the pancreas consists of exocrine cells, ductal cells and islet cells of several types. The bulk pancreas was used to generate the GNF-GEA data, so the reported expression level is the average mRNA concentrations weighted by the cell-type count. This approximation reduces the maximum possible entropy and, more significantly, can make the apparent entropy different from the true entropy. Genes highly and specifically expressed in a cell type with a small population may currently appear to be ubiquitous with very low overall expression. Genes expressed in a few tissues may be revealed to be less tissue specific as more cell types are measured in detail. Genes that appear to be ubiquitously expressed may turn out to not to be expressed in a few cell types. It will be interesting to see whether data with higher anatomical resolution will help to increase the accuracy of the rules we have identified here for identifying tissue-specific and nonspecific promoters.</p>
			<p>Our method can be also applied to other sources of expression data including SAGE, reverse transcription PCR (RT-PCR) and <it>in situ </it>hybridization data. SAGE has the advantage of sensitivity, as these studies generally sequence to much greater depths than EST libraries <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. <it>In situ </it>hybridization data may increase the anatomical resolution of the data. Qualitative intensities, for example, '0', '+', or '+++', can be converted to representative numeric values as appropriate. Our method can also be applied to other collections of conditions beside normal tissues, for example, different types of cancers or samples of the same cancer from multiple patients. Modification of our method to account for temporal changes in tissue specificity represents another direction for future work.</p>
			<p>The analysis presented here focuses on genes rather than on transcripts generated from different promoters from the same gene. The rate of the occurrence of alternative transcription start sites is at least 9% <abbrgrp><abbr bid="B52">52</abbr></abbrgrp> and may be as high as 25% <abbrgrp><abbr bid="B53">53</abbr></abbrgrp>. The promoters we used were specified by the DBTSS dataset but there may be alternative promoters with different characteristics and tissue-specific usage patterns. Analyses based on different RNA species can easily be incorporated into our approach and is an area of future research.</p>
			<p>Our results for CpG island frequency in very tissue-specific genes are lower than recent reports <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> that were based upon present/absent calls, that is, tissue counting, using ESTs to measure tissue specificity. This may be due to two reasons. First, as we described in Results, a significant fraction of genes will show no evidence of expression in poorly sampled tissues. A poorly sampled nonspecific gene will appear therefore more tissue specific than it actually is and this increases the number of apparently tissue-specific genes with CpG islands. Second, when we use microarray data and determine tissue specificity by counting tissues expressing above the median value of 200 AU, we see (data not shown) rates of CpG island occurrence in 'specific' genes similar to those reported in <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Thus, we conclude that including the variation of expression levels rather than mere presence/absence is important for identifying very tissue-specific genes as assessed by start CpG islands.</p>
			<p>These results present an initial look at the correlation between tissue specificity, CpG islands and binding sites for selected transcription factors that interact with the basal transcription apparatus. Using a novel approach with entropy-based metrics, we have begun to lay out the framework for promoter function by identifying strong correlations between tissue-specific or ubiquitous expression and a number of these sequence features. We plan to extend this work in several ways. First, we plan to identify correlations with other known transcription-factor-binding sites and novel motifs identified as over-represented in promoter regions <abbrgrp><abbr bid="B54">54</abbr></abbrgrp>. Second, these results will help to understand regulation by combinations of multiple upstream transcription factors in genes specific to particular tissues or clusters of tissues.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusions</p>
			</st>
			<p>We have used Shannon entropy to quantify and rank the tissue specificity of genes using tissue-survey data. First, this has allowed us to assess the prevalence of tissue-specific regulation; we find that most genes show evidence of some degree of tissue-dependent variation in expression levels. It has also allowed us to find and evaluate associations between promoter features and tissue specificity. We have verified and extended understanding of known associations between, on the one hand, CpG islands and the least tissue-specific genes and, on the other hand, the TATA box and the most tissue-specific genes. However, they are not the sole determinants of tissue-specific expression, as indicated by mid-specificity genes that exhibit a mix of all promoter classes. The class of CGI-/TATA- promoters has emerged as the second most common class of promoter overall and the most common promoter class in mid-specificity genes. Therefore, additional determinants of tissue specificity remain to be found. We have identified one potential determinant, a downstream YY1 site, which is very strongly associated with the least tissue-specific genes but is a relatively rare feature of these promoters. Finally, we have also been able to associate trends in the localization and function of protein products of genes according to their promoter class. Many of the CGI-/TATA+ genes code for highly expressed, very tissue specific, extracellular proteins involved in a cell's response to the environment. CGI-/TATA- genes are also involved in response to the environment, but are found more uniformly across the spectrum of tissue specificity, are not as highly expressed as CGI-/TATA+ genes, and very often code for membrane-bound proteins. CGI+/TATA- genes are more likely to be located in the cytoplasm or nucleus and, as expected, carry out housekeeping functions. All of the results we report are found in both human and mouse and so may reflect general principles of all mammalian species.</p>
		</sec>
		<sec>
			<st>
				<p>Materials and methods</p>
			</st>
			<sec>
				<st>
					<p>Processing GNF-GEA <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> and DoTS <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> data</p>
				</st>
				<p>The GNF-GEA data are processed as described <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Given a set of <it>N </it>tissues we define <it>p</it><sub><it>t</it>|<it>g </it></sub>= <it>w</it><sub><it>g</it>,<it>t</it></sub>/&#8721;<sub>1 &#8804; <it>t </it>&#8804; <it>N</it></sub><it>w</it><sub><it>g</it>,<it>t </it></sub>where <it>w</it><sub><it>t </it></sub>is the expression level of the gene <it>g </it>in tissue <it>t</it>. DoTS, available through the AllGenes <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> site, contains ESTs and mRNAs assembled into transcripts that are then clustered into genes. We did not consider any transcript that contains only one EST as this may represent a spurious sequence and did not consider any gene with fewer than five ESTs because they provide a poor estimate of H<sub><it>g</it></sub>. To accommodate the great disparity in sampling depth across tissues we normalized EST counts by tissue. To avoid artificially low entropies for genes that contain relatively few ESTs we used pseudocounts to smooth the data. The expression level of a gene in a tissue is computed as <it>w</it><sub><it>g</it>,<it>t </it></sub>= (<it>n</it><sub><it>g</it>,<it>t </it></sub>+ 1)/(N<sub><it>t </it></sub>+ <it>N</it><sub><it>g</it></sub>) where <it>n</it><sub><it>g</it>,<it>t </it></sub>is the number of ESTs from libraries for a tissue included in a gene, <it>N</it><sub><it>t </it></sub>is the total number of ESTs from a tissue assembled into genes, and <it>N</it><sub><it>g </it></sub>is the number of genes. We used different sets of tissues depending on the task. <it>H</it><sub><it>g </it></sub>and <it>Q </it>measures in Figure <figr fid="F1">1</figr> used the full GNF-GEA mouse set with a few modifications; adipose tissue and brown fat were merged, epidermis and snout epidermis were merged, digits and tongue were not considered as they are both a combination of skeletal muscle and epidermis. The expression level for a set of merged tissues is the median of the individual tissue replicate medians. For comparison of microarray and EST data we used a set of 27 tissues that were common to both datasets and