Email updates

Keep up to date with the latest news and content from BMC Genomics and BioMed Central.

Open Access Research article

Human and mouse switch-like genes share common transcriptional regulatory mechanisms for bimodality

Adam Ertel and Aydin Tozeren*

Author Affiliations

Center for Integrated Bioinformatics, School of Biomedical Engineering, Science and Health Systems, Drexel University, 3141 Chestnut Street, Philadelphia PA 19104 USA

For all author emails, please log on.

BMC Genomics 2008, 9:628  doi:10.1186/1471-2164-9-628


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/9/628


Received:15 May 2008
Accepted:23 December 2008
Published:23 December 2008

© 2008 Ertel and Tozeren; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Gene expression is controlled over a wide range at the transcript level through complex interplay between DNA and regulatory proteins, resulting in profiles of gene expression that can be represented as normal, graded, and bimodal (switch-like) distributions. We have previously performed genome-scale identification and annotation of genes with switch-like expression at the transcript level in mouse, using large microarray datasets for healthy tissue, in order to study the cellular pathways and regulatory mechanisms involving this class of genes. We showed that a large population of bimodal mouse genes encoding for cell membrane and extracellular matrix proteins is involved in communication pathways. This study expands on previous results by annotating human bimodal genes, investigating their correspondence to bimodality in mouse orthologs and exploring possible regulatory mechanisms that contribute to bimodality in gene expression in human and mouse.

Results

Fourteen percent of the human genes on the HGU133A array (1847 out of 13076) were identified as bimodal or switch-like. More than 40% were found to have bimodal mouse orthologs. KEGG pathways enriched for bimodal genes included ECM-receptor interaction, focal adhesion, and tight junction, showing strong similarity to the results obtained in mouse. Tissue-specific modes of expression of bimodal genes among brain, heart, and skeletal muscle were common between human and mouse. Promoter analysis revealed a higher than average number of transcription start sites per gene within the set of bimodal genes. Moreover, the bimodal gene set had differentially methylated histones compared to the set of the remaining genes in the genome.

Conclusion

The fact that bimodal genes were enriched within the cell membrane and extracellular environment make these genes as candidates for biomarkers for tissue specificity. The commonality of the important roles bimodal genes play in tissue differentiation in both the human and mouse indicates the potential value of mouse data in providing context for human tissue studies. The regulation motifs enriched in the bimodal gene set (TATA boxes, alternative promoters, methlyation) have known associations with complex diseases, such as cancer, providing further potential for the use of bimodal genes in studying the molecular basis of disease.

Background

Our recent work applied an automated high-throughput method to classify genes with bimodal expression profiles within the mouse genome based on microarray experiments performed on healthy tissues using the Affymetrix MGU74Av2 microarray platform [1]. The identification of genes with bimodal expression is useful to identify the biological variation of genes that are tightly regulated around two discrete levels at the transcript level [2]. Many of the bimodal genes were expressed in "high" or "low" modes on a tissue-dependent basis. Enrichment analysis using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [3] and Gene Ontology (GO) annotation [4] within this set of bimodal genes revealed that they are utilized in cell-cell communication and communication with the extracellular environment. We had also evaluated the expression of the bimodal genes in disease states for diabetes types I and II to reveal some of these genes with altered modes of expression in the disease state, revealing the roles of these genes in cell communication and immune response. As a natural extension of this work, we have applied the same automated high-throughput method to classify genes with bimodal expression in the human genome and compared the list with human orthologs of mouse bimodal genes. Moreover, we looked into the transcript-level regulation of bimodal genes using a variety of bioinformatics databases.

The detection of bimodal genes in human is useful for determining a set of genes tightly controlled around two states at the transcript level. Additionally, the identification of these bimodal genes provides an indication of how well the previous methods extend across species and different microarray platforms. While it is expected that many orthologs between human and mouse would share patterns of regulation such as bimodality, the literature has also documented that many gene regulation promoters have changed over the course of evolution between human and mouse [5]. Genes with bimodal expression profiles in both organisms may indicate conservation of alternative promoters, which have been implicated in tissue-specific expression common among the bimodal genes. Alternately, genes with known orthologs that have been identified as bimodal in only one of these species, may illustrate the instability in mammalian promoters [5,6]. Investigation of the regulatory mechanisms at play in the expression of bimodal genes should provide insight into the stability of their expression as well as how these genes may lose regulation in the process of disease [7].

There are many factors contributing to the regulation of transcription, having varying impact in the difference in expression level and the time scale over which the expression level may change, either within a cell or over a course of cell divisions. Transcription factors may enhance or inhibit expression as they bind to regulatory gene promoters effecting transcription initiation [8]. Changes in transcription factor activity may account for bimodality in genes within a single tissue over time, such as in circadian rhythm pathways [9]. Transcript-level regulation may also be achieved through epigenetic modification, such as CpG island methylation, which inhibits transcription either at the promoter region or downstream [10,11]. Additional epigenetic mechanisms, including histone modifications such as methylation and acetylation were shown to be associated with transcription initiation and elongation [12]. These epigenetic regulatory mechanisms may be linked with bimodality resulting from differentiated tissues, where stable modifications maintain a high mode of expression in a certain number of tissues, and a low mode of expression in others. Regulatory mechanisms mentioned above may work in combination to produce a variety of expression profiles, where even in bimodal genes one mechanism may account for expression level changes within a single mode of expression and an alternative regulatory mechanism may account for expression level changes between modes of expression.

In this study, we extend our classification of bimodal genes to the human genome. Our results indicate that a sizable number of genes with bimodal expression in mouse are also bimodal in human, with recurring roles of cell-cell communication and communication with the extracellular environment. Furthermore, the set of bimodal genes identified by our method shows a strong connection to epigenetic regulation, namely methylation of histone tails in gene promoter regions.

Results

Identification of bimodal genes in the human genome

Microarray data for tissue types listed in table 1 (See Additional file 1 for dataset and sample accession numbers) were used to identify bimodal genes in the human genome. Two-component mixture analysis of the 13076 unique genes represented in both Affymetrix HGU133A and HGU133plus2 microarrays identified 1847 genes, or 14%, as bimodal, with p < 0.001. Among these genes with orthologs in mouse, 42% were identified in our previous study on MGU74Av2 microarrays. The probability of obtaining this overlap from a random selection of genes, estimated from the hypergeometric distribution, is indistinguishable from zero. Additional file 2 provides the list of bimodal genes accompanied by the mouse ortholog, as well as the p-value and threshold between high and low modes of expression, XT.

Additional file 1. Microarray samples used for the identification of human bimodal genes. This file contains a worksheet "samples used" listing the HGU133A and HGU133plus2 samples used for identification of bimodal genes, a worksheet "HGU133plus2 training set" listing all HGU133plus2 samples used for RMA quantile normalization and summarization, a worksheet "array quantiles" listing array quantiles, and a worksheet "row effects" listing row effects for 22,277 probesets calculated during the RMA summarization step. http://www.microsoft.com/downloads/details.aspx?FamilyID=1cd6acf9-ce06-4e1c-8dcf-f33f669dbc3a&DisplayLang=en webcite.

Format: ZIP Size: 10.5MB Download fileOpen Data

Additional file 2. Bimodal genes identified in the human genome. This table provides a comprehensive list of bimodal genes for the human genome including parameters used in the identification of bimodal gene expression. http://www.microsoft.com/downloads/details.aspx?FamilyID=1cd6acf9-ce06-4e1c-8dcf-f33f669dbc3a&DisplayLang=en webcite.

Format: ZIP Size: 248KB Download fileOpen Data

Table 1. Microarray datasets used in this study representing normal human tissue.

Functional enrichment analysis highlights themes of cell communication

Enrichment of KEGG pathways and GO terms extended the theme of communication with neighboring cells and the extracellular environment that was also evident from bimodal genes in mouse. Enriched KEGG pathways are presented in Table 2, while enriched GO terms are detailed in Table 3, including listings for cellular component, biological process, and molecular function terms. A majority of the KEGG pathways previously identified as enriched for bimodal genes in mouse were also significantly enriched in humans (Table 2). The most highly populated pathways common to bimodal human genes and human orthologs for bimodal mouse genes are the calcium signaling, focal adhesion and tight junction pathways. The cell communication pathways ECM-receptor interaction and focal adhesion, which were identified as significant within the last study, again appeared as highly populated with bimodal genes. The KEGG ECM-receptor interaction pathway is shown in Figure 1, with enriched nodes highlighted in orange, and nodes replicated in the mouse study outlined in bold. The figure shows that integrin subunits a7, B1, and B6 and a subset of their receptors including multiple collagen types I, II, and IV, fibronectin and laminin are bimodal, as would be expected since these proteins contribute to tissue specificity. Also identified bimodal are the cell membrane receptors CD44, SV2, CD36, and Syndecan. The KEGG focal adhesion pathway, which also interacts with the ECM, is depicted in Figure 2. The figure shows that the bimodal genes are not only positioned at ECM and cell membrane but also at different stages of cell signaling, indicating the extensive role that bimodal genes play in the processing of crosstalk between cells and the ECM. Proteins that are deemed as bimodal in this pathway include EGF, ELK1, FYN, HGF, vinculin, actinin and cyclins D1 and D2. Additional file 3 presents the list of bimodal genes along the enriched KEGG pathways. Further enrichment analysis for the sets of bimodal genes expressed in the "high" and "low" modes was performed on three tissue types with the most abundant samples. Enrichment of the "high" and "low" mode subsets revealed tissue-specific activation and deactivation, as may be expected for pathways and GO terms describing specialized functions such as muscle contraction or synaptic transmission. The KEGG pathways listed in Table 4 and the GO terms listed in Table 5 demonstrate the tissue-specific activation and deactivation of bimodal genes and show consistency between human and mouse for several terms.

Additional file 3. Bimodal genes within significant KEGG pathways. This file provides a list of bimodal genes organized into KEGG pathways that were identified as significantly enriched. http://www.microsoft.com/downloads/details.aspx?FamilyID=1cd6acf9-ce06-4e1c-8dcf-f33f669dbc3a&DisplayLang=en webcite.

Format: ZIP Size: 27KB Download fileOpen Data

Table 2. KEGG pathway enrichment for human switch-like genes.

Table 3. Gene ontology enrichment for human switch-like genes.

Table 4. KEGG pathways enriched for bimodal genes with "high" or "low" expression within brain, heart, and skeletal muscle tissues in human and mouse.

Table 5. Gene Ontology terms enriched for bimodal genes with "high" or "low" expression within brain, heart, and skeletal muscle tissues in human and mouse.

thumbnailFigure 1. ECM-receptor interaction pathway enriched by human switch-like genes. Nodes enriched for human bimodal genes are colored orange, while nodes also identified as bimodal in mouse orthologs are outlined in bold. In all, the overlap between bimodal human and bimodal mouse orthologs contains thirteen unique genes represented in seven unique nodes in the ECM-receptor pathway. Nodes colored in gray were not identified as bimodal, while white nodes are used for genes that are not represented on the HGU133A array.

thumbnailFigure 2. KEGG Focal adhesion pathway enriched by human switch-like genes. Nodes enriched for human bimodal genes are colored orange, while nodes also identified as bimodal in mouse orthologs are outlined in bold. In all, the overlap between human and mouse orthologs contains twenty-two unique genes represented in nine unique nodes in the focal adhesion pathway. Nodes colored in gray were not identified as bimodal, while white nodes are used for genes that are not represented on the HGU133A array.

Promoter analysis reveals bias for TATA boxes in bimodal genes

The mammalian promoter database (MPromDB) [13] was used to assess the distribution of common promoter types within the set of bimodal genes. MPromDB contained promoters for 840 genes represented on the HGU133 arrays and promoters for 536 genes represented on the MGU74Av2 array. The frequencies of the common promoter types AP-1, AP-2, SP1, TATA and CAAT are illustrated in Figure 3A and 3B for the sets of bimodal and non-bimodal genes in human and mouse. The remaining promoter types seldom appeared and were bundled together into an "other" category. The set of bimodal genes within human and mouse shows a statistically significant bias for TATA promoters, with significance of p = 9.5e-5 for human and p = 4.9e-7 for mouse, estimated from a hypergeometric distribution. The remaining promoter types present between bimodal and non-bimodal genes revealed no significant differences that were consistent between human and mouse, suggesting that bimodality in gene expression is largely independent of the regulatory promoter type. The SP1 and AP1 promoters appeared underrepresented in mouse bimodal genes, but this result is based on only a small subset of genes and was not consistent with the results in human.

thumbnailFigure 3. Summary of promoter usage between bimodal and non-bimodal subsets. The relative frequency of core promoter types cataloged in MPromDB [13] is shown for bimodal and non-bimodal gene subsets in A) human and B) mouse. The number of alternative promoters per gene is shown for bimodal genes and non-bimodal genes for C) human and D) mouse. For a subset of bimodal genes with multiple alternative promoters, tissue-dependent alternative promoters from DBTSS [14] corresponded to the mode of expression, as shown for glutamate receptors E) GRIA2 in human and F) Gria1 in mouse.

Alternative promoter sites more common in bimodal genes

The database of transcription start sites (DBTSS) [14] was used to evaluate the number of alternative promoters associated with genes in the bimodal and non-bimodal subsets for human and mouse. The distribution of alternative promoters was shifted towards a higher number of promoters per gene for those with bimodal distributions, as shown in Figure 3C and 3D, providing evidence for some contribution towards the dynamic range of gene expression required for bimodality. When compared against non-bimodal genes, two or more promoters are more common in bimodal genes for both human and mouse, with a respective significance level of p ≈ 0 and p = 1.9e-8, estimated from a hypergeometric distribution. Multiple promoters per gene may be prevalent but not required for bimodal expression. Alternative promoter sites in human and mouse were tested for tissue-selective usage corresponding to the mode of expression for bimodal genes. Alternative promoters for 168 bimodal genes in human were identified as corresponding with the mode of expression across within the nineteen tissues. Alternative promoter data were not available for skeletal muscle tissue in mouse, but data for the remaining 18 tissues identified 131 genes with at least one alternative promoter corresponding to the mode of expression. Random permutation of the tissue labels was used to estimate a median false discovery rate of 4%. Though there was no overlap between these tissue-selective promoter gene sets in human and mouse, there were several pathways in common for this comparison, including the neuroactive ligand-receptor interaction, gap junction, and calcium signaling pathway. The alternative promoters identified as corresponding to the mode of expression in both human and mouse were largely brain-specific. For example, the genes GRIA2 in human and Gria1 in mouse encoding for glutamate receptor proteins in the neuroactive ligand-receptor interaction pathway were associated with multiple alternative promoters specific to brain. Expression box plots across the nineteen tissues having at least one promoter specific to brain, is shown in Figure 3E and 3F for human GRIA2 and mouse Gria1 genes, respectively. These results indicate that multiple alternative promoters may provide redundancy and that a single mode of expression does not necessarily correspond with a unique alternative promoter.

DNA methylation shows a negligible contribution to bimodal gene expression

Cytosine methylation has been shown to provide a stable mechanism in mammals for altering DNA-protein interactions [10]. Genes can be transcribed from methylation-free promoters even when adjacent transcribed and nontranscribed regions are extensively methylated [10]. Methylation of CpG-rich promoters prevents transcriptional initiation and ensures the silencing of imprinted genes and genes in the X Chromosome [10]. Recent data given by Illingworth et al. [15] allowed us to investigate aspects of epigenetic regulation for their contribution to bimodal gene regulation. These authors surveyed methylation within blood, brain, muscle, and spleen and obtained lists of genes with methylated CpG islands in 5', intragenic, and 3' regions, which mapped to roughly 6–8% of human genes. The genes identified with intragenic DNA methylation were more common among the set of bimodal genes, suggesting that the inhibitory effect of DNA methylation on transcription elongation [11] may be a regulatory mechanism for bimodal genes. We also used the methylation data from Illingworth et al. [15] to test the relationship between DNA methylation status and the mode of expression within bimodal genes. The results varied for each tissue type, with the largest differences being decreased DNA methylation in bimodal genes with a "high" mode of expression in brain and increased DNA methylation in bimodal genes with a "high" mode of expression in muscle. DNA methylation is typically considered a gene silencing mechanism, which would correspond to low expression. However, the very small portion of genes represented in the CpG island methylation data for these four tissues may not be an adequate set to observe a consistent trend.

Histone methylation provides a switching mechanism for bimodal genes

Next, we considered the possible role of epigenetic regulation as a switching mechanism for bimodal genes. A recent dataset that mapped histone modifications across the human genome for three cell types, including human embryonic H9 stem cells (hES), liver cells (hepatocytes), and B-cell lymphocytes [12] was used to evaluate the enrichment of histone 3 lysine 4 trimethylation (H3K4me3) at the promoters of bimodal genes. The H3K4me3 enrichment based on each of these three tissue types did not suggest a role in the regulation of bimodal genes (Figure 4A). However, if histone methylation provided a switching mechanism for bimodal gene expression, this would be evident in the differential methylation between tissue types, and not methylation status pertaining to a single tissue type. We used the data from these three tissues to create lists of genes with differentially enriched H3K4me3 regions for liver versus H9 hES cells and for B-cells versus H9 hES cells. These sets of differentially enriched H3K4me3 regions appeared with 50 to 100% higher frequency in bimodal genes compared to non-bimodal genes, as seen in Figure 4A. To further investigate the correlation between histone methylation and bimodal gene expression, we gathered additional microarray samples corresponding to H9 stem cells (GEO dataset accession numbers GSE9865, GSE8884, and GSE2248) and evaluated the mode of expression for bimodal genes within those H9 stem cell samples as well as liver samples within our dataset. We identified a group of bimodal genes as I) "high" in liver but "low" in stem cells, II) "low" in liver but "high" in stem cells, and III) expressed in common modes between these two tissues ("high" in both or "low" in both). These results are plotted in Figure 4B. Group I (green "+" symbols in Figure 4B) had a corresponding increase in methylation enrichment for liver vs. stem cells for nearly 85% of the genes, while group I (blue "x" symbols in Figure 4B) had a corresponding decrease in liver vs. stem cells for 77% of the genes. Approximately, 65% of the remaining bimodal genes expressed in common modes between these two tissue types (black points in Figure 4B) were within the standard deviation around the line y = x. These results demonstrate a strong association between histone methylation status and the mode of expression for bimodal genes.

thumbnailFigure 4. Bimodal gene enrichment for promoter region methylation of lysine 4 of histone H3 (H3K4me3). A) Fraction of bimodal vs. non-bimodal genes enriched for histone methylation within their promoters as reported by Guenther et al. [12] for H9 hES cells, liver (hepatocytes), and B-cells. The fourth and fifth sets of bars represent the set of genes enriched at high confidence within one tissue but not the other liver versus hES cells and B-cell versus hES cells. B) H3K4me3 enrichment ratio from Guenther et al. [12] for liver vs. stem cells is shown for bimodal genes. Genes expressed with "high" mode in liver and "low" mode in H9 stem cells are shown with green "+" symbols, while bimodal genes expressed with "low" mode in liver and "high" mode in stem cells are shown with blue "x" symbols. Black points are used for the remaining bimodal genes expressed in common modes between these two tissue types. The standard deviation around the line y = x (solid red line) is shown as dashed red lines.

Discussion

In this study, using a large-scale microarray database, we have annotated 1847 human genes as having bimodal gene expression profiles. A recent study used again a large human microarray dataset for cancer samples to identify nearly 800 bimodal genes with the employment of a model-based clustering algorithm [7]. A comparison of their list against our list of bimodal genes resulted in 285 common elements, suggesting that bimodal genes in our list may not perform as bimodal in disease states in addition to possible switching of expression state in a disease state from one mode to another. Even in healthy tissue comparison, orthology argument did not entirely preserve bimodality in mouse and human data. Nearly 40% of the genes in this list corresponded to human orthologs of mouse bimodal genes that were annotated in our previous study [1]. Bimodality within 40% of human-mouse orthologs can be viewed as substantial overlap when considering that besides measurement noise and slightly different tissue types represented by datasets for each organism, differences exist in transcript sequences and transcript regions targeted by the microarray probes for orthologous genes among the two species. Further differences in gene expression between the two species arise from changes in regulatory sequences resulting from evolution [5,6]. This overlap demonstrates some degree of stability of bimodality in these datasets, even though we did not force identical tissue type quantities between the two organisms.

Our study shows that bimodal genes make a large contribution to the proteins composing the extracellular matrix as well as external membrane proteins. The enrichment within GO cellular component terms may at first appear contradictory, since the results include disparate terms related to the plasma membrane, cytoplasm, and extracellular matrix, while the KEGG pathway findings more highlighted extracellular communication. However, GO terms do not have a direct correspondence with KEGG pathways and gene membership is not exclusive to a single term. Mapping KEGG pathways to GO cellular component reveals that 66% of the bimodal genes in Focal Adhesion are contained in the cytoplasm; 40% of genes in both ECM-receptor interaction and Focal adhesion are contained in the GO cellular component "plasma membrane." Within the cell membrane side of the ECM-receptor interaction pathway, integrin subunits α7, β1, and β6 were identified as bimodal, while several others, including α2–α6 and β3–β5 were not. This finding suggests that integrin complexes are regulated by an interplay of transcript-level regulation as well as previously shown post-translational modifications [16,17]. In addition, several bimodal genes in the focal adhesion pathway are linked to phosporylation of β-catenin, a key element in the Wnt – signaling pathway, which plays a functional role in cell fate, proliferation, and apoptosis [18]. The Wnt-signaling pathway is active in development and is also a culprit in disease such as colorectal cancer and melanomas [18,19]. As such, bimodal genes upstream from these interactions provide potential markers for tissue-specific signaling as well as metabolic and chronic diseases.

Functional enrichment analysis of "high" and "low" expression subsets of the bimodal genes reveals that they play a role of activation and deactivation of tissue-specific pathways (table 4) and processes (Table 5). Bimodal gene sets involved in the Focal adhesion and ECM-receptor pathways demonstrated consistent modes of expression across human and mouse for brain, heart, and skeletal muscle, the three tissue types with the largest amount of samples in our investigation (Table 4). GO enrichment for "high" and "low" mode gene sets showed consistency between tissue-specific modes of expression in human and mouse, demonstrated by biological processes such as synaptic transmission in brain, and muscle contraction in skeletal muscle (Table 5). The consistency in expression modes of bimodal genes in the mouse and human is further reinforced by brain-specific expression for the cellular components of synapse, post-synaptic membrane, and muscle-specific expression for structural components of muscle, such as sarcoplasmic reticulum, collagen, and cytoskeleton (Table 5). Taken together, these findings indicate that the mode of expression for bimodal genes plays a role in the stable differentiation of specialized tissues, and pathway-specific usage of these genes is conserved across human and mouse in several cases.

Bimodality implies a high degree of transcript-level regulation, and bimodal genes may act as switches for the direction of signals and/or metabolic flow. Our study shows that bimodality appears to arise independently from the type of promoter present, even though we estimate the number of TATA boxes in bimodal genes is enriched, appearing in over 80% of bimodal genes with documented promoter sites. This may merely reflect a bias in gene annotation, as the involvement of these genes among pathways of interest, such as MAPK signaling and ECM-receptor interaction, may draw more focus for experimentation. The number of alternative (promoter) transcription start sites appears to have an influence on the bimodality of gene expression. Unlike the limited number of experimentally produced promoter binding sites, alternative promoter sites have been assessed by genome-wide mapping of transcript 5' ends [14]. While the number of known alternative promoter start sites for bimodal genes is shifted to a higher number than for non-bimodal genes, it is not sufficient to explain the phenomenon of bimodality. Additionally, previous studies investigating the usage of alternative promoters by gene ontology cellular component reveal that genes with several alternative promoters play a role in signaling, but do not contribute to the extracellular region, suggesting a difference from the set of bimodal genes [20]. This still allows for bimodal genes to include a subset of genes with a higher than average number of alternative promoters that work in concert with other regulatory mechanisms such as DNA methylation and histone modification.

We have shown that bimodal gene expression has a bias for multiple alternative promoters, as well as an association with histone methylation (H3K4me3), though a complete description of the links between all possible regulatory mechanisms cannot be made with currently available data. A recent study has shown that CpG-specific RNA polymerase II binding, associated with transcription initiation, is conserved among different tissue types [21]. A large portion of these may constitute the set of housekeeping genes, while others may appear at high modes of expression in some tissues, while silenced in other tissues via CpG methylation. A link has also been demonstrated between DNA methylation and histone methylation, where genes that undergo transcription initiation require H3K4 methylation as well as unmethylated DNA [22,23]. Consistent with our findings in this study, H3K4 trimethylation was previously associated with transcriptionally active genes [24]. The presence or absence of this modification can achieve switch-like regulation [25,26]. Histone methylation, along with DNA methylation, is a key player in cell differentiation during development and maintain cell lineage [15,27]. This stable regulation also maintains the balance between cell communication molecules and the extracellular environment [28]. Aberrant histone methylation patterns are among the epigenetic modifications that give rise to cancer [29]. As the mode of expression for bimodal genes is closely related to H3K4me3 status, gene expression levels may be used as a surrogate for detecting aberrant patterns of methylation associated with disease. While our knowledge of methylation associated with gene regulation may be incomplete, genes regulated through alternative promoters have an additional layer of complexity, as they can have largely different methylation status at individual promoters from tissue to tissue [30]. The regulatory mechanism for bimodal genes may therefore include a complex logic of DNA and histone methylation among alternative promoters, in addition to positive and negative regulation through transcription factor binding.

Alternative splicing events may present another explanation for bimodality in the expression of genes. Alternative-splicing isoforms have been identified as tissue-specific [31]. A substantial portion of alternative splicing isoforms are also associated with nonsense-mediated mRNA decay [32]. Three out of five genes identified with muscle-specific alternative splicing in Xu et al. [31] (PDLIM7, TPM2, and FHL1) were identified as bimodal and expressed in "high" mode in our microarray data for muscle but not other tissues. Five out of the twenty-two genes identified as having brain-specific alternative splicing in Xu et al. [31], were identified as having bimodal expression, but only one of these (CDC42) was expressed in a brain-specific manner. This indicates the possibility that stable transcript splice isoforms account for the high mode of expression in specific tissues, while alternative splice isoforms undergo nonsense-mediated decay. The methylation/promoter analysis presented here is a first step towards understanding the complex interplay of various molecular mechanisms affecting transcription in human and mouse genomes.

Conclusion

This research expanded our representation of "switch-like" gene expression by cataloging the bimodal genes evident in human microarray data for diverse tissue types. Results obtained from human data affirm that genes with bimodal, switch-like expression play a large role in cells communication with the extracellular environment. Tissue-specific modes of expression among the bimodal genes organized by KEGG pathways and GO cellular component revealed that they play a role in tissue specialization that is in common between human and mouse. Equally as important, our results indicate bimodal genes capture epigenetic aspects of gene regulation, indicative of gene expression levels that are stable across cell divisions. These findings verify that biologically relevant information can be inferred from bimodal distributions, much in the way that housekeeping genes have been used. Because the "high" mode of expression modes corresponds well with histone methylation enrichment in promoter regions, bimodal genes may serve as biomarkers for complex diseases such as cancer, where aberrant histone methylation is a known factor in disease progression. Through the identification of condition-specific modes of expression within healthy tissue and disease subtypes, the method presented allows for an alternative approach to differential gene expression analysis.

Methods

Data Selection

Human gene expression microarray datasets were obtained from both the Gene Expression Omnibus (GEO) [33,34] and ArrayExpress [35,36] online repositories. For the purpose of comparing a subset of human bimodal genes with those identified in mouse, we created a microarray dataset with comparable tissue samples to those used in the mouse study (Table 1) [1]. In order to adequately represent some tissue types, it was necessary to combine datasets from Affymetrix HGU133A and HGU133plus2 arrays, which have 22,277 probesets in common.

Microarray normalization

Affymetrix probe intensities were filtered to exclude probesets that are not shared between the HGU133A and HG133plus2 microarrays. The remaining probesets were normalized using Robust Multichip Average (RMA) background correction, quantile normalization, and summarization approach for large datasets described as the refRMA algorithm [37]. RMA background adjustment was performed on each chip. Quantile normalization was performed by computing probe-level quantiles from a 940-array (HGU133plus2) training set (Additional File 1) and applying these quantiles to additional HGU133A arrays as they were added to the dataset. RMA summarization was performed by median polishing on the 940-array training set and storing the row effects (probe effects) to be applied to additional arrays as they were added to the dataset. The quantile normalization and row effect vectors are provided in Additional File 1.

Gene annotation

Annotation for Entrez Gene ID, EMBL accession, gene symbol, and gene ontology biological process, cellular component, and molecular function were retrieved from the HGU133plus2 and MGU74Av2 annotation files updated March 2008 on the Affymetrix website [38]. KEGG pathway descriptions were retrieved April 29th, 2008 from the KEGG ftp site [3,39]. Orthologous gene pairs between mouse and human were identified using the EMBL accession number database from OMA browser [40,41], dated November 2007, in addition to matching official gene symbols. The data were then imported to Matlab R2007b (The Mathworks Inc., Natick, MA, USA), where all subsequent procedures were implemented.

Identification of bimodal genes in the human genome

Bimodal genes were identified in the human genome by fitting a two-component mixture model, as detailed in the methods of our previous work [1]. Briefly, we tested the hypothesis H1 that gene expression distribution follows a two-component (bimodal) mixture against the hypothesis H0 of a single normal distribution, adjusted for skewness using a box-cox transformation. The log likelihood ratio test statistic -2logλ was computed for the two-component mixture hypothesis H1 versus the null hypothesis H0 of a single component. Candidates for bimodal "switch-like" genes were selected as those with p-values no more than 0.001 based on a chi-square distribution with six degrees of freedom at the values of -2logλ. This subset of genes was further reduced by the imposing the requirement that the standardized area of intersection A was less than 0.10. We define A as the area of overlap between the two components, representing type I and type II error for the estimated bimodal distribution, divided by the total area. A typical two-component representation of bimodal gene expression is depicted in Figure 5. To verify that chip effect differences between the HGU133plus2 and HGU133A did not influence the identification of bimodal genes, this procedure was applied to the subset of samples both including and excluding HGU133A arrays. Muscle tissue was selected for this due to the large number of samples and the roughly equal portions of HGU133plus2 and HGU133A arrays for this tissue type. Bimodal expression patterns were identified within muscle tissue first for the HGU133plus2 samples, then for all muscle samples, which adds in potential chip effects as well as lab effects. Identification of bimodal profiles within these samples demonstrated consistency across array types, as classification was consistent for 97% of probesets.

thumbnailFigure 5. Bimodal gene expression. The histogram and normal mixture probability density function (pdf) are shown for a typical bimodal gene. The threshold between the high and low mode of expression is labeled as XT and darker shading is used to represent the misclassification region.

Functional enrichment analysis

KEGG pathway and GO annotations described above were used to compute functional enrichment scores for all switch-like genes. Functional enrichment analysis was performed in Matlab by calculating the ratio of genes belonging to a functional category within a gene set of interest against the total number of genes belonging to that functional category within the 22,277 common probesets between the HGU133A and HGU133plus2 arrays. Enrichment p-values were computed from a hypergeometric distribution [42]. The p-value cutoffs were selected at 0.01 for KEGG pathways and 0.001 for GO terms, to reduce the false discovery rate. The set of candidate bimodal genes was distributed among 186 unique KEGG pathways and 618 unique GO cellular component terms, for which an expected 1.9 and 0.6 of the terms may appear significant by chance at these p-value cutoffs, respectively. Functional enrichment analysis was repeated within the list of significant terms to identify tissue-specific behavior within brain, heart, and muscle tissue. These three tissues were selected because they are represented by a large number of samples within the data, and several terms that should be specific to these tissue types were identified as significantly enriched. Using the binomial distribution approach detailed in [1], we identified bimodal genes that were expressed either "high" or "low" within these three tissue types. KEGG pathway and GO subsets were tested for enrichment for sets of "high" and "low" mode genes against the hypergeometric distribution with p = 0.05.

Promoter analysis

The set of bimodal genes was evaluated for regulatory mechanisms including the core promoter type and the number of alternative promoters. Genes were separated into subsets of bimodal and non-bimodal for promoter analysis to evaluate differences within each of human and mouse. Promoter sites and sequence motifs were obtained from MPromDB [13]. Promoters corresponding to the targets of transcription factors were mapped to the sets of bimodal and non-bimodal genes using Entrez gene ID. The remaining promoter types seldom appeared and were bundled together into an "other" category. The frequency that each of these regulatory sites, including AP-1, AP-2, SP1, TATA and CAAT-signal, appear within bimodal genes and non-bimodal genes was assessed for the bimodal and non-bimodal subsets in human and mouse. Additional annotation for human and murine genes describing the number of alternative promoter sites (hspromoter.tab for human and mmpromoter.tab for mouse, both dated June 12th, 2007) were downloaded from the database of transcriptional start sites (DBTSS) [14]. The distribution of alternative promoters was computed as a histogram within the sets of bimodal genes and non-bimodal genes within human and mouse. The statistical significance for two, three, four, and two or more promoters was estimated using the hypergeometric distribution.

Analysis of DNA methylation effect on mode of expression

DNA methyaltion was explored as a regulatory mechanism for bimodal genes due to its known association with gene silencing. CpG methylation data were obtained from Illingworth at al.; supplementary information Dataset S1 [15]. This dataset documents methylation sites for roughly 8% of the human genome across four tissues – blood, brain, muscle, and spleen. The frequency of methylation sites that were mapped to 5', 3', and intragenic regions of known genes was computed and significance of enrichment within bimodal genes was estimated from the hypergeometric distribution. This methylation data were also used to evaluate correspondence between methylation status within either the 5' or intragenic regions and the mode of expression in bimodal genes. Genes were assigned to a mode of expression within each of the four tissues by treating expression measurements within each tissue as Bernoulli trials against the binomial distribution, as described in [1]. The frequency of DNA methylation was then calculated in the subsets of "high" and "low" genes for each of the four tissues.

Comparison of histone methylation enrichment versus mode of expression

The final regulatory mechanism that was assessed for a contribution to bimodal gene expression was histone methylation. Methylation data were obtained from Guenther et al. Table S3 and Table S4, which describes H3K4me3 enrichement scores and locations across the human genome for three cell types: human embryonic stem cells (hES), liver cells (hepatocytes), and B-cell lymphocytes [12]. Enrichment scores for H3K4me3 designated as high-confidence in Guenther et al. were used to create a gene set for each of these three tissue types [12]. Two additional gene sets were created based on differential H3K4me3 enrichment for liver versus H9 hES cells and for B-cells versus H9 hES cells. For example, the liver versus H9 hES gene set includes those enriched with high confidence in liver but not H9 hES cells in addition to those with high confidence in H9 hES cells but not liver. The frequency of histone methylation sites based on these three tissues as well as the differentially enriched sites were evaluated for the sets of bimodal genes and non-bimodal genes within human. Additionally, the significance of each list of sites was evaluated using the hypergeometric distribution.

To further investigate the interplay between histone methylation and bimodal gene expression, we gathered additional microarray samples corresponding to H9 stem cells (samples GSM249282, GSM225045, and GSM38629, from datasets GSE9865, GSE8884, and GSE2248, respectively) and evaluated the mode of expression for bimodal genes within that those H9 stem cells as well as liver samples within our dataset. Using the binomial distribution approach detailed in [1], we identified a group of bimodal genes as I) "high" in liver but "low" in stem cells, II) "low" in liver but "high" in stem cells, and III) expressed in common modes between these two tissues ("high" in both or "low" in both). These three subsets were then used to create a scatter plot of H3K4me3 enrichment, excluding genes with enrichment scores below a 2-fold enrichment in both hES and hepatocytes.

Authors' contributions

AE and AT worked together on this project. AE implemented the algorithms, performed the computations and provided a first draft for the manuscript. Both authors read and approved the final version of the manuscript.

Acknowledgements

This study was supported by the National Institute of Health (NIH) grant #232240 and by the National Science Foundation (NSF) grant # 235327.

References

  1. Ertel A, Tozeren A: Switch-like genes populate cell communication pathways and are enriched for extracellular proteins.

    BMC Genomics 2008, 9(1):3. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  2. Paliwal S, Iglesias PA, Campbell K, Hilioti Z, Groisman A, Levchenko A: MAPK-mediated bimodal gene expression and adaptive gradient sensing in yeast.

    Nature 2007, 446(7131):46-51. PubMed Abstract | Publisher Full Text OpenURL

  3. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment.

    Nucleic Acids Res 2008, (36 Database):D480-484. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology.

    Nat Genet 2000, 25(1):25-29. PubMed Abstract | Publisher Full Text OpenURL

  5. Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E: Tissue-specific transcriptional regulation has diverged significantly between human and mouse.

    Nat Genet 2007, 39(6):730-732. PubMed Abstract | Publisher Full Text OpenURL

  6. Keightley PD, Lercher MJ, Eyre-Walker A: Evidence for Widespread Degradation of Gene Control Regions in Hominid Genomes.

    PLoS Biology 2005, 3(2):e42. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Zhang J, Ji Y, Zhang L: Extracting three-way gene interactions from microarray data.

    Bioinformatics 2007, 23(21):2903-2909. PubMed Abstract | Publisher Full Text OpenURL

  8. Karin M: Too many transcription factors: positive and negative interactions.

    New Biol 1990, 2(2):126-131. PubMed Abstract OpenURL

  9. Travnickova-Bendova Z, Cermakian N, Reppert SM, Sassone-Corsi P: Bimodal regulation of mPeriod promoters by CREB-dependent signaling and CLOCK/BMAL1 activity.

    Proceedings of the National Academy of Sciences 2002, 99(11):7728-7733. Publisher Full Text OpenURL

  10. Jones PA, Takai D: The Role of DNA Methylation in Mammalian Epigenetics.

    Science 2001, 293(5532):1068-1070. PubMed Abstract | Publisher Full Text OpenURL

  11. Lorincz MC, Dickerson DR, Schmitt M, Groudine M: Intragenic DNA methylation alters chromatin structure and elongation efficiency in mammalian cells.

    Nat Struct Mol Biol 2004, 11(11):1068-1075. PubMed Abstract | Publisher Full Text OpenURL

  12. Guenther MG, Levine SS, Boyer LA, Jaenisch R, Young RA: A chromatin landmark and transcription initiation at most promoters in human cells.

    Cell 2007, 130(1):77-88. PubMed Abstract | Publisher Full Text OpenURL

  13. Sun H, Palaniswamy SK, Pohar TT, Jin VX, Huang THM, Davuluri RV: MPromDb: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-chip experimental data.

    Nucl Acids Res 2006, 34(suppl_1):D98-103. Publisher Full Text OpenURL

  14. Suzuki Y, Yamashita R, Nakai K, Sugano S: DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs.

    Nucl Acids Res 2002, 30(1):328-331. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  15. Illingworth R, Kerr A, DeSousa D, Jorgensen H, Ellis P, Stalker J, Jackson D, Clee C, Plumb R, Rogers J, Humphray S, Cox T, Langford C, Bird A: A Novel CpG Island Set Identifies Tissue-Specific Methylation at Developmental Gene Loci.

    PLoS Biology 2008, 6(1):e22. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. Ahmad I, Hoessli DC, Walker-Nasir E, Choudhary MI, Rafik SM, Shakoori AR: Phosphorylation and glycosylation interplay: protein modifications at hydroxy amino acids and prediction of signaling functions of the human beta3 integrin family.

    J Cell Biochem 2006, 99(3):706-718. PubMed Abstract | Publisher Full Text OpenURL

  17. Veiga SS, Elias MCQB, Gremski W, Porcionatto MA, da Silva R, Nader HB, Brentani RR: Post-translational Modifications of alpha 5beta 1 Integrin by Glycosaminoglycan Chains. THE alpha 5beta 1 INTEGRIN IS A FACULTATIVE PROTEOGLYCAN.

    J Biol Chem 1997, 272(19):12529-12535. PubMed Abstract | Publisher Full Text OpenURL

  18. Maretto S, Cordenonsi M, Dupont S, Braghetta P, Broccoli V, Hassan AB, Volpin D, Bressan GM, Piccolo S: Mapping Wnt/beta-catenin signaling during mouse development and in colorectal tumors.

    Proceedings of the National Academy of Sciences 2003, 100(6):3299-3304. Publisher Full Text OpenURL

  19. Moon RT, Bowerman B, Boutros M, Perrimon N: The Promise and Perils of Wnt Signaling Through beta-Catenin.

    Science 2002, 296(5573):1644-1646. PubMed Abstract | Publisher Full Text OpenURL

  20. Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, Yamamoto J-i, Sekine M, Tsuritani K, Wakaguri H, Ishii S, Sugiyama T, Saito K, Isono Y, Irie R, Kushida N, Yoneyama T, Otsuka R, Kanda K, Yokoi T, Kondo H, Wagatsuma M, Murakawa K, Ishida S, Ishibashi T, Takahashi-Fujii A, Tanase T, Nagai K, Kikuchi H, Nakai K, Isogai T, Sugano S: Diversification of transcriptional modulation: Large-scale identification and characterization of putative alternative promoters of human genes.

    Genome Res 2006, 16(1):55-65. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Rozenberg J, Shlyakhtenko A, Glass K, Rishi V, Myakishev M, FitzGerald P, Vinson C: All and only CpG containing sequences are enriched in promoters abundantly bound by RNA polymerase II in multiple tissues.

    BMC Genomics 2008, 9(1):67. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  22. Irvine RA, Lin IG, Hsieh C-L: DNA Methylation Has a Local Effect on Transcription and Histone Acetylation.

    Mol Cell Biol 2002, 22(19):6689-6696. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. Okitsu CY, Hsieh C-L: DNA Methylation Dictates Histone H3K4 Methylation.

    Mol Cell Biol 2007, 27(7):2746-2757. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Santos-Rosa H, Schneider R, Bannister AJ, Sherriff J, Bernstein BE, Emre NCT, Schreiber SL, Mellor J, Kouzarides T: Active genes are tri-methylated at K4 of histone H3.

    Nature 2002, 419(6905):407-411. PubMed Abstract | Publisher Full Text OpenURL

  25. Bannister AJ, Kouzarides T: Reversing histone methylation.

    Nature 2005, 436(7054):1103-1106. PubMed Abstract | Publisher Full Text OpenURL

  26. Fischle W, Wang Y, David Allis C: Binary switches and modification cassettes in histone biology and beyond.

    Nature 2003, 425(6957):475-479. PubMed Abstract | Publisher Full Text OpenURL

  27. Margueron R, Trojer P, Reinberg D: The key to development: interpreting the histone code?

    Current Opinion in Genetics & Development 2005, 15(2):163-176. PubMed Abstract | Publisher Full Text OpenURL

  28. Song MR, Ghosh A: FGF2-induced chromatin remodeling regulates CNTF-mediated gene expression and astrocyte differentiation.

    Nat Neurosci 2004, 7(3):229-235. PubMed Abstract | Publisher Full Text OpenURL

  29. Kurdistani SK: Histone modifications as markers of cancer prognosis: a cellular view.

    Br J Cancer 2007, 97(1):1-5. PubMed Abstract | Publisher Full Text OpenURL

  30. Cheong J, Yamada Y, Yamashita R, Irie T, Kanai A, Wakaguri H, Nakai K, Ito T, Saito I, Sugano S, Suzuki Y: Diverse DNA Methylation Statuses at Alternative Promoters of Human Genes in Various Tissues.

    DNA Res 2006, 13(4):155-167. PubMed Abstract | Publisher Full Text OpenURL

  31. Xu Q, Modrek B, Lee C: Genome-wide detection of tissue-specific alternative splicing in the human transcriptome.

    Nucl Acids Res 2002, 30(17):3754-3766. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  32. Lareau LF, Brooks AN, Soergel DA, Meng Q, Brenner SE: The coupling of alternative splicing and nonsense-mediated mRNA decay.

    Adv Exp Med Biol 2007, 623:190-211. PubMed Abstract OpenURL

  33. Gene expression omnibus [http://www.ncbi.nlm.nih.gov/geo/] webcite

  34. Barrett T, Edgar R: Gene expression omnibus: microarray data storage, submission, retrieval, and analysis.

    Methods Enzymol 2006, 411:352-369. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  35. ArrayExpress [http://www.ebi.ac.uk/microarray-as/aer/] webcite

  36. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA: ArrayExpress – a public repository for microarray gene expression data at the EBI.

    Nucleic Acids Res 2003, 31(1):68-71. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  37. Katz S, Irizarry R, Lin X, Tripputi M, Porter M: A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database.

    BMC Bioinformatics 2006, 7(1):464. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  38. Affymetrix [http://affymetrix.com] webcite

  39. KEGG FTP [ftp://ftp.genome.jp/pub/kegg/] webcite

  40. OMA Browser [http://omabrowser.org/] webcite

  41. Schneider A, Dessimoz C, Gonnet GH: OMA Browser – exploring orthologous relations across 352 complete genomes.

    Bioinformatics 2007, 23(16):2180-2182. PubMed Abstract | Publisher Full Text OpenURL

  42. Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated system for exploring gene sets in various biological contexts.

    Nucleic Acids Res 2005, (33 Web Server):W741-748. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL