Recent genomic scale survey of epigenetic states in the mammalian genomes has shown that promoters and enhancers are correlated with distinct chromatin signatures, providing a pragmatic way for systematic mapping of these regulatory elements in the genome. With rapid accumulation of chromatin modification profiles in the genome of various organisms and cell types, this chromatin based approach promises to uncover many new regulatory elements, but computational methods to effectively extract information from these datasets are still limited.
We present here a supervised learning method to predict promoters and enhancers based on their unique chromatin modification signatures. We trained Hidden Markov models (HMMs) on the histone modification data for known promoters and enhancers, and then used the trained HMMs to identify promoter or enhancer like sequences in the human genome. Using a simulated annealing (SA) procedure, we searched for the most informative combination and the optimal window size of histone marks.
Compared with the previous methods, the HMM method can capture the complex patterns of histone modifications particularly from the weak signals. Cross validation and scanning the ENCODE regions showed that our method outperforms the previous profile-based method in mapping promoters and enhancers. We also showed that including more histone marks can further boost the performance of our method. This observation suggests that the HMM is robust and is capable of integrating information from multiple histone marks. To further demonstrate the usefulness of our method, we applied it to analyzing genome wide ChIP-Seq data in three mouse cell lines and correctly predicted active and inactive promoters with positive predictive values of more than 80%. The software is available at http://http:/nash.ucsd.edu/chromatin.tar.gz webcite.
Transcriptional regulation in eukaryotic cells requires highly orchestrated interactions between transcription factors (TFs), their co-factors, RNA polymerase and the chromatin [1,2]. Several classes of regulatory elements, including promoters, enhancers, silencer and insulators, are involved in this process. Systematic and precise mapping of these elements in the genome is essential for understanding transcriptional programs responsible for temporal and tissue specific gene expression. A high throughput experimental approach has recently been used to tackle this problem and it involves the chromatin immunoprecipitation assay followed by microarray (ChIP-chip)[3,4] or large scale sequencing (ChIP-Seq)[5-8]. Currently, this approach is still limited by the availability of antibody specifically recognizing individual TFs at different regulatory elements. Another method involves comparative genomic analysis of related genomes[9,10] and clustering of multiple sequence motifs[11-13]. This approach has been successfully applied to a number of eukaryotic genomes including yeast, Drosophila and mammal genomes (see review, for example, ). These methods rely on precise alignment of regulatory elements across multiple genomes which is not necessarily true for all elements, or prior knowledge of a set of cooperative TFs which is not always available.
Recently, a chromatin based regulatory element mapping approach has been proposed. This approach exploits the observation that transcriptional promoters and enhancers are associated with distinct chromatin signatures. Specifically, the active promoters are characterized by tri-methylation on Lys4 in H3 (H3K4me3), while the active enhancers are associated with mono methylation of this residue and a much reduced or non-existent signal of the tri-methylation . Currently, it is not yet clear what mechanisms underlie the different chromatin signatures at these two classes of cis-regulatory sequences, but the characteristic chromatin signatures of regulatory elements provide a pragmatic way to systematically identify these elements in the genome without prior knowledge of the underlying sequences. Compared with the other methods, there are several advantages of this chromatin-based approach. First, it requires no prior knowledge of the sequence features of the promoters or enhancers; Second, the chromatin modification profiles could be obtained for most organisms as the existing antibodies can specifically recognize the characteristic histone modifications in different species. Third, this approach does not make the assumption that promoters or enhancers are evolutionarily conserved, thereby can identify fast evolving regulatory elements in the genome.
Distinct chromatin signatures at promoter and enhancers have been explored by Heintzman et al to map promoters and enhancers. In their study, ChIP-chip analysis using high-resolution tiling array was performed to localize the core histone H3 (referred as H3) and monitor the status of five histone modification marks, i.e. H4 acetylation (H4ac), H3 acetylation (H3ac), mono-, di- and tri-methylation of Lys4 in H3(H3K4me1, H3K4me2 and H3K4me3) in HeLa cells before and after treatment with interferon-gamma (IFNγ). In addition, binding sites for components of transcription machinery (RNAPII and TAF1) and p300 (a transcriptional co-activator) were identified to locate active promoters and enhancers, respectively. Using these functional sites, Heintzman et al determined characteristic chromatin modification profiles at the promoters and enhancers – promoters have both H3K4me1 and H3K4me3 marks in contrast to the prominent H3K4me1 presence at enhancers with much reduced H3K4me3 signal. Using the average profiles of the promoters and enhancers as templates, Heintzman et al. identified additional genomic regions sharing similar profiles and confirmed that many of the predictions indeed correspond to promoters and enhancers. Figure 1 shows the averaged profile of the histone profiles they studied. By comparing the prediction performance of all possible combinations of the six histone marks using cross-validation, they concluded that the combination of H3K4me1 and H3K4me3 best discriminated promoters from enhancers.
Figure 1. Histone modification patterns of promoters and enhancers in untreated HeLa cells. This figure is re-generated from Heintzman et al . All signals of six histone marks are drawn centered on TSSs and p300 binding peaks. Average signal of histone marks of TSS and enhancers are drawn in black.
In spite of the success of this profile-based method in predicting promoters and enhancers, it is limited in two aspects. First, the optimal performance of the method involves only two histone modification marks, therefore the prediction accuracy was sensitive to the noise of measurements of these two marks. The contribution of other chromatin modifications marks to the classification method and the interdependency of the histone marks were not considered. Second, the window size of histone modification patterns (10 kb) was chosen in an arbitrary way. The larger the window size, the smaller the portion of the central regions with the strongest signal intensity. Thus, the profiles built for the promoter/enhancer may not be optimal. Figure 2(A) shows examples of histone modification patterns and annotated genes in human chromosome 1. The TSSs of these genes are well aligned with strong histone patterns of promoters. The profile-based method by Heintzman et al. correctly identified the promoter near chr:148185131 but not the one near chr1:148158254 because of the relatively weak H3K4me3 signal. An enhancer was also identified close to chr1:148158254 because of weak H3K4me3. In Figure 2(B) a DHS region and a p300 binding peak overlap at chr6:132486009, showing strong evidence of enhancer. Since the H3K4me3 signal is relatively stronger than a typical enhancer profile which almost has no signal, the profile based method missed this site. Another example near chr8:119170000 shows weak pattern of H3K4me3, which misleads the prediction (Figure 2(C)).
Figure 2. Examples of histone modification patterns in promoters and enhancers. (A) Promoter prediction using chromatin signature. TSS near chr1:148185131 shows a typical histone modification pattern for promoter while H3K4me3 has a relatively weak signal for the TSS near chr1:148158254. The predictions made by the profile based method of Heintzman et al. are labeled in green and the predictions made by the HMM developed in this study are in red. (B) Enhancer prediction using chromatin signature. A p300 binding site is shown at chr6:132486009 and overlaps with a DHS site, which is a strong evidence to support an enhancer site. (C) Enhancer prediction using chromatin signature. A DHS site near chr8:119170000 overlaps with a weak H3K4me3 signal but is not found as an enhancer by the profile-based method.
To overcome the above limitations, we developed a method coupling HMM with simulated annealing (SA)  (a HMM-SA procedure) to identify promoters and enhancers based on chromatin signatures. The HMM is capable of extracting more information from the chromatin modification profile signals, is less sensitive to the measurement noise of an individual histone mark, and can automatically select the most informative combination of histone marks as well as the optimal window size. In each run of SA we trained HMMs[17,18] using the 105 promoters and 73 enhancers determined by the ChIP-chip experiments on RNAPII, TAF1 and p300. Inside each HMM, the histone patterns are regarded as continuous observation densities emitted from the HMM states. The number of histone patterns is the input dimension of the HMM. The optimal combination and window size of histone modifications to discriminate promoters from enhancers were searched using the HMM-SA procedure. We then used the trained HMMs to predict promoters and enhancers in the entire ENCODE regions. Below, we describe this method and the results comparing the performance of our new method with the previous method. We also demonstrated that including more histone marks can further boost the performance of our method, which is also distinct from the profile-based method. In addition, we showed the usefulness of our method on predicting the activity of promoters in the mouse genome using histone modification data generated by ChIP-Seq .
Results and discussion
Find the most informative combination and the optimal window size of histone modifications
To characterize chromatin signature of promoters and enhancers, one needs to define the histone modifications that can discriminate different regulatory elements. Since the chromatin signals from ChIP-chip analysis typically span thousands of base pairs, a small window may not fully capture the chromatin signature while a large window may include non-informative regions to deteriorate the prediction accuracy. Therefore, an optimal window size is critical in predicting promoters and enhancers using histone modification patterns. To find the most informative combination and the optimal window size, we coupled the hidden Markov model (HMM) with simulated annealing (SA)  (see Methods).
To compare with the profile-based method, we considered the 105 promoters and 73 enhancers determined by the ChIP-chip experiments on RNAPII, TAF1 and p300 in the Heintzman et al. study. The datasets were divided into two equal sets, one for training and one for evaluation. The HMM-SA procedure started with a random combination of histone marks and a random window size chosen from 1, 2, 4, 6, 8, 10 and 12 kb centered on the TSSs or p300 peaks. We have conducted 100 independent simulations and collected all the final outputs of the combinations of histone marks and the window size.
We found the window size of 2 kb to be the optimal window size in 75 out of 100 simulated annealing runs. As shown in Figure 1 the strongest and the most informative signals are close to the center but 1 kb-window may be too small to capture the characteristic patterns. We also examined the occurrence of histone modification combinations in the 100 runs and compared their prediction accuracy on the evaluation set of 53 promoters and 37 enhancers (Table 1). The combination of all six marks was selected by the HMM-SA procedure 43 times, which is much higher than the other combinations. This observation is not totally unexpected because more information is included when including more histone marks. We also observed that the prediction accuracies for different combinations of multiple histone modifications are comparable with that of all six marks, which may be due to the small data set we have and the dependency of HMMs on the initial conditions. The combination of all six histone marks was chosen most often because it has higher chance to get better result in the SA test and insensitive to the choice of initial conditions. For a larger dataset for training and evaluation, the differences between the prediction accuracies of different histone mark combinations are expected to be more significant.
Table 1. The results of 100 HMM-SA runs.
Among the pair combinations, the H3K4me1 and H3K4me2 pair was chosen six times while the combination of H3K4me1 and H3K4me3 was not found by HMM-SA. Examining the histone modification patterns (Figure 1), it is obvious that H3K4me2 is more informative to locate enhancers than H3K4me3 because H3K4me2 shows stronger signal around p300 binding sites than H3K4me3. Since we did not just classify promoter against enhancer but rather we predicted promoter/enhancer against background, H3K4me2 was selected more often than H3K4me3. We next examined which single histone modification is the most informative by simply counting how many times a histone mark was included in the final combination (Table 2). Consistent with the above observation, H3K4me1 and H3K4me2 turned out to occur most often (99 and 97 times, respectively). This is not surprising because on average these two marks have the strongest signal among all six histone modifications (Figure 1). H3 was included 76 times even though its signal is relatively weak, which is surprising at the first glance. Further examination of H3 signals showed that they are consistent and well aligned (Figure 1), which makes it informative in the sense to help locate the center of promoter/enhancer.
Table 2. Occurrence of each histone modification in the most informative combinations found by the 100 HMM-SA runs.
Cross validation shows that HMM method predicts enhancers more accurately than the profile-based method
Using the optimal combination and the window size the HMM-SA procedure found, we conducted five-fold cross-validation tests to compare the performance of the proposed method with the profile-based method. There were 105 promoter and 73 enhancer profiles in our analysis (see Methods). We used three hidden states to train the HMM for promoters and enhancers separately (see Methods). In total, we performed 30 independent cross-validation tests and the averaged results are shown in Table 3. The HMM and the profiles-based method have a comparable accuracy on predicting promoters (positive predictive value (PPV = TP/(TP+FP)) = 97.87 ± 1.06% versus 96% using all six histone marks). In contrast, significant improvement over the profile-based method was indeed observed on enhancer prediction: 93.52 ± 1.83% using all the six histone marks by the HMM versus 78% using all six marks or 85% using H3K4me1 and H3K4me3 by the profile-based method. It is not surprising that the HMM using all six marks outperforms the profile-based method using only H3K4me1 and H3K4me3 because more information is included by using more marks. However, as shown by Heintzman et al., the profile-based method achieved the best performance using two but not all the six marks. This may explain that the HMM can capture the characteristic pattern better than using profile, particularly for enhancers that have relatively weaker signals than promoters. This is further supported by the observation that the HMM using only H3K4me1 and H3K4me3 still achieved much higher prediction accuracy on enhancers than the profile-based method (PPV = 94.06 ± 0.89% in Table 3).
Table 3. Comparison of cross-validation results for predicting promoters and enhancers.
Analysis on the trained HMM
After we validated the HMM model using cross-validations, we further examined the probability density distribution of each state in the HMM. The 3-state HMMs with no backward transition were trained on promoters and enhancers separately. This type of structures without backward transitions has been widely used in speech recognition to capture the pattern of speech . The second state usually corresponds to the location of TSS or p300 peaks in this configuration. The first and the third states correspond to the upstream and downstream profiles of chromatin modifications, respectively. Figure S1 (A,B,C,D,E,F) (see Additional files 1, 2, 3, 4, 5, 6) shows the probability density of Gaussian mixtures for the three states of every chromatin mark on promoters and enhancers. It is obvious that promoters and enhances present differences in distributions of probability density, which reflects the chromatin modification patterns in these regions. For example, the probability density of H3K4me3 showed peaks in the high ChIP-chip ratio regions for promoters compared to peaks in the low ratio regions for enhancers (Figure S1(E), see Additional file 5). In addition, examining the probability density distribution of the three states in the promoters suggested that the HMM model also captured the characteristics of the chromatin modification profiles. The probability density functions of the third state for H3K4me3 were peaked around 2.5 of ChIP-chip log ratio. The second state and the first state peaked at low ChIP-chip ratios with lower probability. This is indeed a bimodal pattern with higher ChIP-chip ratios on the downstream, which is consistent with the finding in the previous study .
Additional file 1. Figure S1. Probability density of Gaussian mixtures for the three HMM states trained on promoter and enhancer for each chromatin marker. The x-axis is the log ratio of ChIP-chip intensity. The black curve is the mixture of 3 Gaussian (red curves represent individual Gaussians. (A) H4ac. Analysis on the trained HMM for H4ac.
Format: EPS Size: 30KB Download file
Format: EPS Size: 30KB Download file
Format: EPS Size: 32KB Download file
Format: EPS Size: 31KB Download file
Format: EPS Size: 30KB Download file
Format: EPS Size: 29KB Download file
Promoter prediction in the ENCODE regions
We then applied our model to predicting promoters in the entire ENCODE regions of HeLa cells before and after treatment with IFNγ. We examined the classification performance of the HMM classifier by counting how many promoter predictions were supported by the annotated TSSs , CAGE tags  and active promoters . We compared the results with the promoter predictions of Heintzman et al. They reported 198 and 208 TSS predictions in the untreated and IFNγ treated HeLa cells in the ENCODE regions . Figure 3 plots the number of predictions supported by the annotated TSSs against the total number of predictions by varying the cutoff c1 from 0.4 to 3.5 (see Methods). We observed the same number of promoter predictions as the profile-based method at the cutoff (c1) of 2.205 in the untreated HeLa cells, and at c1 = 2.1 in the treated cells (Table 4). We found that 77% and 73% of predictions by the HMM and the profile based method were common for the untreated and treated cell, respectively (Table 4). The HMM method (PPV = 189/198 = 95.45%) did outperform the profile-based method (PPV = 181/198 = 91.41%). The HMM method predicted more annotated TSSs when the PPV of the two methods were similar to each other: in the untreated cell PPV = 234/256 = 91.41% and in the treated cells PPV = 247/279 = 88.50%. When we further increased the number of prediction the total number of correct predictions reached to around 270 TSSs using the HMM method, while the maximum number of the correct prediction of the profile method was 190 correct predictions (Figure 3).
Table 4. Comparison of PPV = TP/(TP+FP) in promoter predictions using the annotated TSS sites.
Figure 3. True positives (TPs) versus the total number of promoter predictions (a) in the untreated and (b) in the treated HeLa cells. The TF was calculated at different cutoff values of the log-odds (see Methods). Ideal predictors are shown with a black line.
CAGE tags have been generated to map promoters and we investigated if the predicted TSSs were supported by CAGE tags. Ideally, only one CAGE tag is needed to map a promoter. But due to the noise of generating the tags, confident promoter are usually supported by the overlapping with multiple tags. The larger the number of CAGE tags overlap with the predicted promoters, the more confident the predictions are. We counted the number of predicted promoters supported by at least 5, 10, and 15 CAGE tags. We predicted 198 and 208 promoters in the untreated and the treated cells, respectively, for both the HMM and the profile-based method (Figure 4). When the CAGE tag cutoff was 5, the HMM method found 192 (PPV = 96.97%) and 201 (PPV = 96.63%) promoters supported by CAGE tags in the untreated and the treated cell, respectively, compared with 184 (92.9%) and 180 (86.5%) supported predictions by the profile-based method, respectively. It is not surprising that the number of the supported promoters decreased when we increased the minimum number of overlapping CAGE tags. We found that the performance improvement of the HMM over the profile-based method was more significant in the treated cells. Considering both methods were trained using the untreated data, it suggests that the HMM method is more robust than the previous method.
Figure 4. Promoter predictions supported by CAGE tags. We compared the number of predicted promoters supported by CAGE tags when changing the minimum number of CAGE tags found within 2.5 kb from the predicted TSSs. We compared the results when the HMM method made same number of predictions as the profile-based method (untreated cells: 198 predicted sites, treated cells: 208 predicted sites).
We next compared the performance of the two methods on predicting active promoters. Gene expression measurements showed that there were 177 active and 155 inactive promoters in the untreated HeLa cells, and 181 active and 151 inactive promoters in the treated cells. The profile based method detected 127 active and 31 inactive promoters in the untreated cells (Table 5). The two methods correctly predicted similar number of active promoters (127 and 128) when the number of predictions was around 200. While increasing the number of predictions, the profile based method did not but the HMM method did make more correct predictions of active promoters in both treated and untreated cells (Table 5).
Table 5. Comparison of active promoter predictions.
We investigated the sensitivity and specificity of each method (for the definition see Methods). Figure 5 shows the receiver operator characteristic (ROC) curves of the HMM and the profile-based methods in the untreated and treated cells. Both methods achieved good performance with high sensitivity and specificity but the performance improvement of the HMM method is prominent.
Figure 5. ROC curves of the HMM and profile-based methods in the untreated and treated cells.
Enhancer prediction in the ENCODE regions
We also used the trained HMM to predict enhancers in the ENCODE regions. We found 319 (82.01%) and 243 (75.00%) common enhancers predicted in the untreated (389 predictions) and treated cells (324 predictions), respectively, by the HMM method and the profile-based method with the same number of total predictions. To compare the performances of the two methods, we checked how many of them were supported (within 2.5 kb) by nearby p300 and TRAP200 binding sites as well as DNase hypersensitivity sites (DHSs). p300 is a transcriptional co-activator[22,23]. TRAP220 is a component of the Mediator complex[22,23] that have been shown to bind to enhancers as well as promoters. DHSs are nucleosome free regions that are often occupied by enhancers . We only considered p300, TRAP220 and DHS sites that are distal (> 2.5 kb) from any TSS to avoid confusion with promoters.
In the untreated HeLa cells, all three sites have been mapped in the ENCODE regions. We calculated sensitivity = TP/(TP+FN) of the two methods (Table 6). More predictions by the HMM method were supported by any and all of the three lines of evidences. In total, 213 out of 389 predictions by the HMM method were supported (PPV = 54.76%) by any of the three evidences, while the profile-based method made 206/389 = 52.96% supported predictions (Table 6 and Figure 6b). We should point out that there may exist true enhancers among the predicted ones by HMM but not supported by the p300, DHS or TRAP220 data. This may explain the relatively smaller improvement of our method over the profile-based method on enhancer predictions than on promoter predictions.
Figure 6. Comparison of (a) promoter and (b) enhancer prediction. The prediction results using 10 histone marks are compared with those using 6 histone marks.
Table 6. Comparison of enhancer predictions in the untreated Hela cells.
In the treated cell, only p300 binding data was available and it was used to evaluate the predictions of the two methods. While Heintzman et al. had 104 out of 318 predictions overlapping with p300 sites (sensitivity = 104/147 = 70.75%, PPV = 32.70%), the HMM method found 109, out of 288, p300-supported predictions (sensitivity = 76.22%, PPV = 37.85%). Again, the HMM method outperformed the profile-based method in this test set.
Including additional histone marks can further improve the performance of the HMM method
Recently, Hon et al. conducted the same ChIP-chip experiments on more histone modification marks, H3K9Ac, H3K18Ac, H3K27Me3 and H3K27Ac, in the ENCODE regions. A robust method should achieve better performance when including additional data. We applied the HMM method to this larger dataset and evaluated its performance as above. After training the HMM predictor using all ten histone marks and a window size of 2 kb, same as in the six histone mark dataset, we predicted promoters and enhancers in the ENCODE regions. We observed a significant improvement in the promoter predictions (Figure 6a). The HMM method using 10 histone marks was quite close to the ideal line even when other methods reached plateau. For example, the HMM made 291, out of 341, correct predictions (PPV = 291/341 = 85.34%) using 10 histone marks and only 264 out of 337 correct predictions using 6 histone makers. Such improvement became more significant when more predictions were made.
The performance of the HMM method on enhancer prediction was also improved using more histone marks (Figure 6b). For example, 232 enhancers were correctly predicted (PPV = 54.46%) using the 10 histone marks, compared with 226 correct predictions (53.05%) among the same number (426) of the total predictions using the 6 histone marks. The improvement was not as significant as in the case of the promoters. It is possibly because the evidences of true enhancers (p300/TRAP200 binding and DHS sites) are not as direct as those for the promoters (the annotated TSSs were determined using full length cDNA).
Prediction of active and inactive promoters using genome-wide ChIP-Seq data
Compared with ChIP-chip, ChIP-Seq is more costly effective and probably also less noisy on mapping chromatin modifications at the genome-wide scale. We investigated how well our method works with the ChIP-Seq data generated by Mikkelsen et al. in the three mouse cell lines: embryonic stem (ES) cells, neural progenitor cells (NPCs) and embryonic fibroblasts (MEFs). We first compared the patterns of the four histone marks, H3K4me3, H9K4me3, H3K27me3, and H3K36me3, around TSSs because these four marks were measured in all the three cell lines. We assigned each promoter to one of the four groups based on the gene expression level measured in the same study. We averaged the sequencing read counts of each group around TSS. The active and inactive promoters exhibit distinct patterns of all but H3K9me3 marks (Figure S2, see Additional file 7). Strong signals of H3K4me3 in the active promoters and H3K27me3 in the inactive promoters are consistent with their known functions. H3K36me3, a mark for transcriptional elongation, shows a quite spread out pattern around TSS.
Format: EPS Size: 60KB Download file
Next, we trained two HMMs on 200 active and 200 inactive promoters randomly selected from the ES cell and predicted the promoters in all three cell lines. Because the active promoters contain stronger chromatin modification signals (more sequencing reads) than the inactive promoters, our method predicted more active promoters than inactive ones. The majority of the predicted promoters were within 2.5 kb of the annotated RefSeq TSSs (Table 7): > 81% for active promoters and > 66% for inactive promoters. For the predictions located more than 2b from the annotated TSSs, these sites can be unannotated promoters or false positives. We then assessed the prediction accuracy of our model using gene expression. Among the genes that could be unambiguously called active or inactive, our method correctly predicted the activity of the majority of the promoters. Considering the scale of our predictions, the PPVs of both active and inactive promoter predictions (expression supported) are satisfactory: > 88% and ≥ 74%, respectively, for the two classes.
Table 7. Predicted active and inactive promoters in the mouse genome.
We present here an HMM method to predict promoters and enhancers using their characteristic histone modification patterns. We used a HMM-SA procedure to automatically select the most informative and the optimal window size of histone modifications. We showed that the more histone marks are considered, the better the performance of the HMM can achieve. We compared the HMM method with the best prediction results using the profile-based method in the Heintzman et al. study. The cross-validation test showed that the HMM method performed better than the profile-based method, especially in the enhancer classification (Table 3). This observation suggests that the HMM method has a better capability to learn complicated patterns particularly for the weak signals around enhancers. Because correct identification of distal enhancers is critical in deciphering transcriptional regulation, this feature of HMM gives it an edge over the profile-based method.
We also found that the window size of 2 kb gave the best balance between inclusion of sufficiently strong signals and exclusion of non-informative ones that undermine the prediction accuracy. However, the improvement of using a 2 kb window instead of 10 kb was rather small compared to the use of HMM (Table 1). It suggests that the improvement in classification is mainly from the HMM's ability to capture the characteristic patterns of histone modifications for multiple marks.
We demons trated that the HMM method outperforms the previously developed profile-based method on predicting promoters and enhancers using chromatin signatures, particularly on the independent test dataset in the HeLa cells treated with IFNγ. The profile-based method performed well with small number of predictions. It reached the maximum true positives (TPs) when the number of promoter predictions was about 230 (Table 4 and Figure 6). Beyond 230, TPs almost do not increase with the number of predictions. In contrast, the HMM method keeps making correct predictions and it outperformed the profile-based method even more significantly (Figure 6). The improvement in enhancer prediction is not significant (Figure 6), which may be due to the limited knowledge of enhancer positions in the genome. We only evaluated the prediction accuracy using the DHS and the binding sites of p300 and TRAP220 that may miss many enhancers.
The HMM method is also less sensitive to noise in individual histone modifications. As shown in Figure 2(A) the profile method failed to find a TSS where H3K3me3 signal is weak. The HMM method predicted this TSS by using all the histone marks. In Figure 2(B) the HMM method predicted an enhancer that is supported by both p300 and DHS sites. Weak signal of H3K4me3 may cause the failure of the profile based method of identifying this site. An opposite example is shown in Figure 2(C) where a relatively stronger H3K4me3 signal than typical enhancers prevents identification of DHS site to be enhancers by the profile-based method while the HMM method was not affected.
In the present work, we did not further distinguish sub-clusters of promoters and enhancers as in the study of Heintzman et al. to avoid overfitting. It is very likely that promoter and enhancers may have distinct histone modification patterns depending upon their functional state (active, repressed or poised) . As histone modification data are becoming available on more histone marks and on the entire human genome , it is possible to train separate or refined HMMs for promoter/enhancer in different functional states, which should further improve the performance of our model.
We also demonstrated the success of our approach on analyzing ChIP-Seq data. By including chromatin marks that are characteristics of transcription, our method could successfully predicted the activities of promoters. If annotated enhancers are available for training the HMMs, it is straightforward to extend our predictions to enhancers. With the fast accumulation of chromatin modification data, we believe that our method will provide a useful tool in systematically mapping regulatory elements.
The histone modification data were obtained from the Heintzman et al. study . The averaged profile and individual histone marks are shown in Figure 1, comparing the histone patterns on promoter and enhancer. We followed their smoothing procedure. Data were grouped into 100 bp bins and the values of probes within each bin were averaged, e.g. a histone pattern of 2 Kb consists of 20 bins. The regions not covered by probes were linearly interpolated if the size of the uncovered region is less than 1000 bp. Heintzman et al. studied histone modifications in both untreated HeLa cells and HeLa cells treated with IFNγ. To design a classifier, HMMs were trained on promoters, enhancers and background, respectively. Previous studies demonstrated that p300 and related acetyltransferases are present at enhancers and promoters. Heintzman et al. determined 124 and 182 p300 binding sites in the untreated and treated HeLa cells, respectively. We used 74 p300 binding sites in the untreated cells after removing those within 2.5 kb of the known 5' ends of genes. These sites were enriched with DNaseI hypersensitive sites (69.7%) and over 60% of them were conserved across species . These evidences strongly support that distal p300 binding sites represent a subset of enhancers. Heintzman et al. used 106 active promoters in the untreated cells that were centered at annotated RefSeq TSSs as their training data for promoters . In the current study, one promoter and one enhancer were deleted from the training set used by Heintzman et al. because they included many unprobed regions.
While Heintzman et al. only tested on the window size of 10 kb centered on TSS and p300 binding sites, we tested various window size. The candidate window sizes of histone marks for the HMM-SA procedure were 1, 2, 4, 6, 8, 10, and 12 kb. Once the optimal window size 2 kb was selected by HMM-SA, all the training dataset of 105 promoters and 73 enhancers were used to train HMMs to predict promoters and enhancers in the ENCODE regions. The histone patterns in the cell treated with IFNγ were used as an independent test set.
The HMM classifier
We designed an HMM(Θ) with left-right structure to represent the histone modification patterns. Left-right structure has been widely used in speech recognition to capture signal pattern, which serves well for our purpose of capturing histone modification patterns. The HMM has Q states (Figure 7). An HMM state emits a signal according to a probability density function of mixture Gaussian of N dimension. Here N is the number of histone modification patterns under consideration. The probability density function of the mixture Gaussian is
Figure 7. The HMM Classifier. (a) A left-right HMM with Q states. Each state has a transition to itself and outgoing transitions toward higher states behind. Once a state is left it never comes back in a left-right model. (b) Three HMMs are trained separately for promoter, enhancer and background. Log-odds are calculated to classify a genomic region (see Methods).
where x is the vector being modeled and cjm is the mixture coefficient for the mth Gaussian in state j; G [x, μjm, Ujm] represents the Gaussian function with mean vector μjm and covariance matrix Ujm. The forward and backward algorithm was used to estimate the transition probabilities and the mixture coefficients in each state. We trained three HMMs for promoters, enhancers and background separately. We set Q = (number of bins)/k to change the number of states depending on the length of data (we set k = 8) and the minimum Q was set to 3. The background HMM was designed to have the minimum number of states (Q = 3). Each state is composed of 3 mixtures of Gaussian components (M = 3) to capture the complex histone modification patterns. Models with larger m did not improve the prediction performance (data not shown).
For a given genomic region, a log likelihood score was calculated using the three HMMs for promoter, enhancer and background. Two log-odd scores were calculated as the following.
The log-odd score reflects how strong a signal is compared to the background. If the log-odd is below a cutoff (c1, c2), it is regarded as a background signal. The number of prediction depends on these cut-off values. We plotted Figure 3 and 6 while changing the cut-off values.
When we scan the ENCODE regions, we smoothed results by averaging adjacent 3 log-odds and took peaks of the log-odds of promoter over enhancer. This smoothing procedure reduced fluctuations of log-odds along the chromosome, especially at the boundaries of the unprobed regions. If multiple predictions were made within 1.5 Kbp, only the prediction with the highest log-odds was kept. If a promoter and an enhancer were predicted within 1.5 Kb, we only kept the prediction with the higher log-odd. We examined the percentage of promoters and enhancers being correctly predicted while varying the cutoff values c1 and c2 (Figure 3, Figure 6 and Table 4). Using six histone marks we observed the same number of prediction of the HMM predictor as the profile-based method when c1 = 2.205 (untreated) and c1 = 2.1 (treated). We used c2 = 0.25 (untreated) and c2 = 0.0 (treated) to compare the prediction result of the enhancer (Table 6).
Search for the most informative histone modification combination
Automatic search for the most informative combination of histone modification is a typical feature selection problem. We took an approach that couples HMM with simulated annealing (SA) to find the optimal combination. SA  is a generic approach for global optimization problems. Incorporating a temperature parameter into the optimization procedure, it explores broad parameter space at high temperatures and restricts exploration at lower temperatures. The simulated annealing updates were made based on the Metropolis criterion. The Metropolis criterion makes a change from Ecurrent to E with a probability of
To adapt the SA method to our model, we hybridized HMMs with SA. Initially, SA randomly selected a candidate combination of histone modifications. Also, a window size was randomly selected among 1, 2, 4, 6, 8, 10, 12 Kbp. An HMM was trained with the candidate combination and evaluated by Ecurrent. Ecurrent is defined as:
The combination is accepted with the probability given in equation (3). Ecurrent is always accepted if Ecurrent>E; otherwise, it is accepted with a probability that generally decreases as the temperature (T) decreases. The next move is made by randomly adding or removing one or two histone patterns and increasing or decreasing one 2 Kb of the window size. This procedure is repeated while decreasing the temperature T. In the simulation we used
In the HMM-SA procedure, the 105 promoter and 73 enhancers in the training dataset was divided into training and evaluation sets, half of them were used to train the HMMs and the other half to calculate Ecurrent. The training set (52 promoters and 36 enhancers) and the test set are fixed for each run. We set the maximum number of iterations to be 200 to give SA enough burning period. In fact, most simulations were converged in less than 100 iterations. We recorded the results for 30 independent simulations.
We validated the prediction results in the ENCODE regions by calculating how many predicted promoters are supported by annotated TSSs in RefSeq. The adjacent 3 log-odds (1, 2) are averaged. If multiple peaks of promoters or enhancers are found within 1.5 kb, only the highest log-odd is selected. A prediction was considered as correct if the predicted center is within D = 2.5 kb to the closest annotated TSS of a gene. When multiple predicted sites are supported by the same TSS or any enhancer evidence, we merge these predictions. However, when multiple predicted sites are not within the distance, we counted all of them as FPs. The total number of the predicted promoters in Heintzman et al was 208. Since two promoters are referred to the same gene, we treated these two promoters as one and thus the total number of predictions becomes 207. We defined PPV = TP/(TP+FP).
To compare the performance of the two methods, we plotted ROC curves for promoter predictions in both untreated and treated cells (Figure 5). We defined FN as the number of active promoters that were missed in our predictions. It is not very straightforward to define true negatives. We chose to divide the entire ENCODE regions into 2.5 kb-long non-overlapping segments. There were 9928 segments in which no annotated TSSs were found within ± 2.5 kb. We defined TN as the number of segments that did not contain any predicted promoters. The sensitivity and the specificity were given as TP/(TP+FN) and TN/(FP+TN), respectively.
ChIP-Seq data in the three mouse cell lines
Mikkelsen et al. generated the genome-wide mapping of chromatin modifications in three mouse cell lines: embryonic stem (ES) cells, neural progenitor cells (NPCs) and embryonic fibroblasts (MEFs) . Four chromatin marks, H3K4me3, H9K4me3, H3K27me3, and H3K36me3, were measured in all these cell lines. We trained a HMM classifier using the chromatin modification patterns around TSS in the ES cells and tested it in all three cell lines. Based on the gene expression measured by Mikkelsen et al., we randomly selected 200 active and 200 inactive promoters in the ES cells as the training set. Because there were only four chromatin marks, we used all of them in the HMM model. Similar to analysis of ChIP-chip data, we first used a 2 Kb window to locate TSSs in the genome (see above). Considering the spread out pattern of H3K36me3 that distinguishes active from inactive promoters (Figure S2, see Additional file 7), we next used a 10 Kb window to classify the predicted promoters into active or inactive category. A background HMM was trained using the sequencing reads mapped to chromosome 1.
We evaluated the classification performance of our method using gene expression and RefSeq annotation on predictions that could be unambiguously assigned to a gene, namely located 2.5 Kb within an annotated TSS. Mikkelsen et. al conducted replicate measurements of gene expression in the same cell lines (GEO accession number is GSE8024). There were 13482 unique genes in their experiments. The numbers of active and inactive genes in each cell line were counted using the majority rule in the replicate experiments and the genes with marginal expression levels or conflicting calls were excluded (Table 7).
HMM: Hidden Markov Model; TSS: Transcription Start Site; SA: Simulated Annealing; PPV: Positive Predictive Value; TP: True Positive; FP: False Positive; FN: False Negative; TN: True Negative; DHS: DNaseI hypersensitive Site; ROC: Receiver Operator Characteristics; TF: Transcription Factor
The authors declare that they have no competing interests.
KJW implemented the algorithms, performed all tests, and made all images. IC implemented HMM algorithm. BR and WW conceived of the algorithm and participated in its design and coordination. KJW, IC, BR, and WW wrote the manuscript. All authors read and approved the final manuscript.
We are grateful to Gary Hon and Nathaniel Heintzman for insightful discussion. This work was supported in part by NIH (to WW).
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins.
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, Ruan Y, Snyder M: Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies.
Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells.
Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D, Coulombe B, Robert F: Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression.
Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome.
Proceedings of the IEEE 1989, 77:257-286. Publisher Full Text
Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, Forrest AR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui-Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F, Suzuki H, Grimmond SM, Wells CA, Orlando V, Wahlestedt C, Liu ET, Harbers M, Kawai J, Bajic VB, Hume DA, Hayashizaki Y: Genome-wide analysis of mammalian promoter architecture and evolution.
2007, in press.
2007, in press.