DNA methylation is an inheritable chemical modification of cytosine, and represents one of the most important epigenetic events. Computational prediction of the DNA methylation status can be employed to speed up the genome-wide methylation profiling, and to identify the key features that are correlated with various methylation patterns. Here, we develop CpGIMethPred, the support vector machine-based models to predict the methylation status of the CpG islands in the human genome under normal conditions. The features for prediction include those that have been previously demonstrated effective (CpG island specific attributes, DNA sequence composition patterns, DNA structure patterns, distribution patterns of conserved transcription factor binding sites and conserved elements, and histone methylation status) as well as those that have not been extensively explored but are likely to contribute additional information from a biological point of view (nucleosome positioning propensities, gene functions, and histone acetylation status). Statistical tests are performed to identify the features that are significantly correlated with the methylation status of the CpG islands, and principal component analysis is then performed to decorrelate the selected features. Data from the Human Epigenome Project (HEP) are used to train, validate and test the predictive models. Specifically, the models are trained and validated by using the DNA methylation data obtained in the CD4 lymphocytes, and are then tested for generalizability using the DNA methylation data obtained in the other 11 normal tissues and cell types. Our experiments have shown that (1) an eight-dimensional feature space that is selected via the principal component analysis and that combines all categories of information is effective for predicting the CpG island methylation status, (2) by incorporating the information regarding the nucleosome positioning, gene functions, and histone acetylation, the models can achieve higher specificity and accuracy than the existing models while maintaining a comparable sensitivity measure, (3) the histone modification (methylation and acetylation) information contributes significantly to the prediction, without which the performance of the models deteriorate, and, (4) the predictive models generalize well to different tissues and cell types. The developed program CpGIMethPred is freely available at http://users.ece.gatech.edu/~hzheng7/CGIMetPred.zip webcite.
Epigenetics refers to structural adaptation of chromosomal regions to register, signal or perpetuate altered activity states . A major type of epigenetic event is DNA methylation, which involves the addition of a methyl group to the number 5 carbon of the cytosine pyrimidine ring . In the human genome, is DNA methylation mostly restricted to the cytosines of CpG dinucleotides. Though the human genome generally shows a great deficit of CpG dinucleotides (the genome-wide observed-to-expected CpG ratio is ~0.2), and most of these CpG dinucleotides are methylated in somatic cells , the CpG dinucleotides are enriched around gene promoters and form CpG islands, and tend to be protected from DNA methylation . It has been shown that DNA methylation plays an instrumental roles during normal cell development and cell differentiation, and is also involved in a number of key processes including genetic imprinting, X-chromosome inactivation, suppression of retroviral elements, and carcinogenesis [5,6].
A variety of techniques, based on biochemical experiments and computational analysis, have been devised for DNA methylation profiling. The biochemical experiment-based approaches are mainly based on methylation-sensitive restriction, immunoprecipitation, or bisulfite conversion, combined with the next-generation sequencing technologies . Whereas, computational predictive models have been developed to identify CpG dinucleotides methylated or unmethylated [8,9], CpG islands (or CpG-rich regions) methylated or unmethylated [3,10-13], and CpG islands (or CpG-rich regions) differentially methylated in different tissue/cell types or phenotypes [4,14]. These computational approaches can effectively complement the biochemical-experiment based approaches to speed up genome-wide DNA methylation profiling and to identify critical factors or pathways controlling DNA methylation patterns.
A key step for building computational predictive models is to select features. Here we provide a brief review of the existing computational models based on their features for prediction. For the prediction of DNA methylation, the features can be roughly grouped into two broad categories: genetic and epigenetic. Given a region of interest (ROI, e.g., a CpG island or a genomic region centered around a particular CpG dinucleotide), the genetic features include (1) general attributes of the ROI (e.g., length of the ROI, and distribution of the CpG dinucleotides in the ROI), (2) patterns of the DNA sequence composition of the ROI, (3) patterns of conserved transcription factor binding sites (TFBSs) or conserved elements within or near the ROI, (4) structural and physicochemical properties of the ROI, (5) functions of the genes within or near the ROI, (6) the extent of the diversity of the ROI within the population, and (7) the extent of the conservation of the ROI among species. And, the epigenetic features mainly regard the methylation and acetylation status of the histones.
Bhasin et al. used DNA composition features to predict the methylation of single cytosines. A 39-nucleotide long DNA fragment centered around the cytosine of interest was considered as the ROI, and each nucleotide in the ROI was coded by using a 5-bit binary sparse code. In this way, each ROI was represented by a series of codes, and the difference between ROIs was able to be quantified. A ~75% accuracy was reported using a support vector machine-based classifier . Lu et al. also used DNA composition features for predicting whether a CpG dinucleotide is methylated or not. A 1,000 nucleotide long DNA fragment centered around the CpG dinucleotide was used as the ROI, and the frequencies of all pentamer oligonucleotides formed the features. A ~77% accuracy was reported for the CD4 lymphocytes data set using a nearest neighbor-based classifier . Feltus et al. used frequencies of seven DNA patterns, TCCCCCNC, TTTCCTNC, TCCNCCNCCC, GGAGNAAG, GAGANAAG, GCCACCCC, and GAGGAGGNNG with N representing any base, and achieved an ~82% accuracy on the human fibroblast data set when distinguishing between methylation-prone and methylation-resistant CpG islands using a linear programming-based classifier .
In addition to DNA composition features, Fang et al. also used the distribution of the repetitive element AluY as well as the distribution of TFBSs for predicting the methylation status of CpG rich segments, and reported an ~84% specificity and ~84% sensitivity on the human brain data set using a support vector machine-based classifier . Bock et al. used DNA composition features, predicted DNA helix structure, attributes of repeat elements and TFBSs, evolutionary conservation of PhastCons elements  and the number of single nucleotide polymorphisms (SNPs) for the prediction of CpG island methylation [10,11], and their method achieved a high specificity (~98%) but a relatively low sensitivity (~67%) on human lymphocytes using a support vector machine-based classifier . Ali et al. also used the DNA composition information, predicted DNA structure, and SNP features, and reported a ~72% accuracy on the human lymphocytes data set using a K nearest neighbor-based classifier . To predict tissue-specific differentially methylated regions (DMRs), Previti et al. used CpG island specific attributes, attributes of repetitive elements, number and frequency of PhastCons elements, as well as structural and physicochemical properties. When classifying CpG islands into four categories: constitutively methylated, constitutively unmethylated, tissue-specific DMR, and lack of methylation exclusively in sperm, they reported an ~89% accuracy using a decision tree-based classifier .
Computational prediction models that are solely based on genetic features can hardly fully characterize DNA methylation status. This is because DNA methylation, as an epigenetic phenomenon, is affected by some other epigenetic factors, such as histone methylation and histone acetylation. In light of the reported interaction between histone modification enzymes and DNA methylases [16,17], Fan et al. found four histone methylation marks that are highly correlated with the DNA methylation status of CpG islands, and then incorporated these histone methylation marks into the prediction of the methylation status of CpG islands. Compared to those methods without histone methylation information [13,11], the augmented features indeed led to improved performance: a ~94% specificity and ~74% sensitivity on the CD4 T cell data set using a support vector machine-based classifier .
In this study, we consider various attributes that are possibly related to the CpG island methylation. These attributes include those that have been previously investigated (CpG island specific attributes, DNA sequence composition patterns, DNA structure patterns, distribution patterns of conserved TFBS's and conserved elements, and histone methylation status), and those that have not been extensively investigated but are potentially related to DNA methylation from biochemical perspectives (nucleosome positioning propensities, gene functions, and histone acetylation status). The contribution of each individual feature is evaluated by statistical tests; and the correlation between features is reduced by principal component analysis (PCA). These DNA methylation-relevant yet non-intercorrelated features are then used to build support vector machine (SVM)-based models to predict the methylation status of CpG islands. The predictive models are evaluated by using the HEP data set. Specifically, the CpG island methylation profiles in the CD4 lymphocytes are used to train and validate the models, while the CpG island methylation profiles in the other 11 tissues/cell types are used to test the generalizability of the models. Through these experiments, we assess the individual and combinational influence of the newly added features and the impact of histone modification information.
The rest of the paper is organized as follows. In Section 2, we describe the data collection used to train, validate and test the computational models. In Section 3, we discuss the methods for feature extraction, feature selection, and building the predictive models. The experimental results are reported in Section 4. And finally in Section 5 we draw conclusions.
We obtain the methylation profiles of the human genome from HEP. bisulfite DNA sequencing technique, and provides high-resolution data of the genome-wide DNA methylation patterns in various tissues and cell lines . It currently covers chromosomes 6, 20 and 22, and contains ~1.9 million CpG methylation values of 2,524 amplicons from 12 different tissues and 43 different samples. The methylation values of the CpGs range from 0 to 100 inclusive, where 0 corresponds to the lowest and 100 to the highest methylation intensity.
We define the CpG island as a DNA stretch that is not a repetitive element but satisfies the Gardiner-Garden criteria, i.e., with length of ≥ 200 bps, GC content ≥ 50%, and observed to expected CpG ratio ≥ 0.6 . We construct our training data set based on the CpG islands extracted from the UCSC genome browser and the DNA methylation profiles specified by HEP. Specifically, we only consider those CpG islands more than 10% of whose CpG dinucleotides are annotated with methylation intensities. For each tissue or cell type, the methylation intensity of a CpG dinucleotide is calculated as the average in different samples ; and the methylation intensity of a CpG island is calculated as the average of all the CpG dinucleotides within it. The CpG islands with methylation intensity ≥ 50 are regarded as the methylated (positive), while those with methylation intensity ≤ 10 are regarded as the unmethylated (negative) . The number of so-obtained methylated and unmethylated CpG islands are summarized in Table 1. In particular, there are 101 methylated and 368 unmethylated CpG islands for the CD4 lymphocytes, which are used for training and validating the predictive models, while the CpG islands in the other tissues or cell types are used for generalizability testing.
Table 1. Number of methylated and unmethylated CpG islands in the twelve different tissue and cell types based on the DNA methylation profiles of HEP.
The core of our establishment of the computational predictive models consists of three parts, feature extraction, feature selection and model training and testing, as depicted in Figure 1. We here describe these three steps in detail.
Figure 1. Workflow used for the prediction of the methylation status of CpG island in human genome. The CpG island map is obtained by applying the traditional Gardiner-Garden sequence criteria on non-repetitive sequences of the human genome. The core steps of our model development consist of three parts - feature extraction, feature selection and predictive modeling.
A key step for building computational predictive models is to select features. It has been shown that the CpG island methylation status is correlated with the following features: CpG island specific attributes (e.g. length, GC content, GC observed/expected ratio) [14,21,3], patterns of DNA sequence composition [4,21,10], patterns of predicted DNA structure [14,10], patterns of conserved TFBS's and conserved elements , as well as the methylation status of nearby histones . Computational prediction of CpG island methylation status based on the statistical properties of these features could render fairly reasonable accuracy (e.g., ~89% [4,13]). In this study we incorporate three more sets of attributes that have not been extensively explored, including (i) the nucleosome positioning propensities of the CpG island, (ii) the acetylation status of nearby histones, and (iii) the functional roles of nearby genes. In the following paragraphs, we describe how these features are extracted.
Three attributes, including the GC content, length and observed/expected CpG ratio, are directly obtained from UCSC human genome browser for each CpG island .
DNA sequence composition
We use the tetramer frequencies and their corresponding z-scores to characterize the DNA composition patterns of the CpG island. The z-score of a tetramer, Z(N1N2N3N4), depicts how much the observed frequency of the tetramer N1N2N3N4, O(N1N2N3N4), deviates from its expected frequency E(N1N2N3N4).
where E(N1N2N3N4) is approximated by using a maximal-order Markov model :
and the standard deviation σ(N1N2N3N4) is calculated based on the observed frequencies of dimers and trimers:
Altogether, we extract 512 features about DNA sequence composition, including 256 for tetramer frequencies and 256 for their z-scores.
Conserved TFBS's and conserved elements
The distribution patterns of the conserved TFBS's and conserved elements in the CpG island and the flanking regions are also taken into account. Here a conserved TFBS refers to one that is conserved in human, mouse and rat genomes ; and there are 258 such TFBS's that can roughly be grouped into 115 groups according to their function similarity . Also, a conserved element refers to a genomic segment (other than TFBS) that is conserved across vertebrate, insect, worm and yeast genomes . Each conserved TFBS or conserved element is characterized by a score quantifying its degree of conservativeness. We consider both the short- and long-range associations between these elements and CpG islands, and therefore select the flanking regions of various lengths (ranging from 100 bps to 2,000 bps with an increment of 100 bps) upstream and downstream of each CpG island. Given a CpG island (and its flanking region of a particular length), for each TFBS group (or conserved element), we count the number of TFBS's (or conserved elements) that overlap with this CpG island (and its flanking region) and the average score of these TFBS's (or conserved elements). Therefore, in terms of conserved TFBS's and conserved elements, each CpG island is characterized by 210 (115 × 2, for conserved TFBS's) plus two features (for conserved elements).
We focus on those basic characteristics that capture the DNA 3-D conformation and newly added nucleosome positioning propensities. The DNA conformation related features measure the twist, tilt, roll, shift, slide and rise propensities of dinucleotides . For each of these six features, the average value over all dinucleotides in the CpG island is used.
Due to an accumulating body of evidence showing that DNA methylation is influenced by nucleosome positioning propensities , we also investigate these features. Nucleosome positioning propensities of the CpG islands are estimated based on the genome-wide prediction of the nucleosome organization map . There are two types of predictions, one at the nucleotide level, and the other at the DNA fragment level. The nucleotide level prediction regards the probability of each nucleotide being covered by any nucleosome, based on which we calculate the mean and standard deviation over the entire CpG island. The fragment level prediction regards the nucleosome positioning potential of each 147 bp (typical length of a nucleosome) DNA fragment, based on which we calculate the mean and standard deviation over all fragments overlapping with the CpG island. Altogether, we extract four features regarding nucleosome positioning propensities.
Functional roles of nearby genes
Since DNA methylation is heavily involved in biological processes such as tumor suppressor gene silencing [28,29], we examine whether a CpG island's nearby genes are involved in any cancer-related biological processes. A CpG island's nearby genes refer to those whose promoter region (from the 1,000 bps upstream to the 200 bps downstream of the transcription start site) overlaps with the CpG island. 37 biological processes (30 oncogene related, 11 tumor suppressor related, and 4 common) are determined through gene ontology enrichment analysis of the genes retrieved from the Cancer Gene Census . If the gene ontology annotations of a gene include one or more of these processes, the corresponding gene function feature is 1 and 0 otherwise. We have two features for functional roles of nearby genes, one for oncogene related and the other for tumor suppressor gene related biological processes.
Histone methylation and acetylation
We consider the methylation status of each CpG island's nearby histones. The histone methylation information is obtained from Barkski et al's data set, which characterizes the genome wide distribution of 20 histone methylations as well as histone variant H2A.Z, RNA polymerase II, and the insulator binding protein CTCF in CD4 lymphocytes .
Since DNA methylation has also been observed to be associated with histone acetylation , we further include the histone acetylation features in the feature set. The histone acetylation information is obtained from Wang et al.'s data set , which characterizes the genome-wide patterns of 18 histone acetylations in CD4 lymphocytes.
In both data sets, a nucleotide is tagged if its nearby histone undertakes a methylation or acetylation modification; hence, the number of tags at each nucleotide can be interpreted as being proportional to the modification level of nearby histones. We use the average and standard deviation of the number of tags over all nucleotides of a CpG island to represent the methylation (or acetylation) level of the CpG island's nearby histones. Altogether, we have 46 features for histone methylation and 36 features for histone acetylation.
Altogether, we generate 841 features using the above procedure as summarized in Table 2. Compared to the size of our training data set (see Table 1), this dimension of the feature space is prohibitively high, which will potentially lead to classifier designs that are too expensive to implement or that cannot well generalize to unseen data. Therefore, we perform a two-step feature selection procedure, where the statistical test is used to select those features that are highly correlated with the methylation status of CpG islands, and PCA is used to minimize the redundancy in the features.
Table 2. Number of features in each category and information resource for the feature extraction.
Three statistical tests, Fisher's exact , Chi-squared  and Kolmogorov-Smirnov (KS) tests , are used to identify those features whose statistical patterns are significantly different between the positive and negative datasets. Specifically, the Fisher's exact tests are used for functional roles of nearby genes, for which the feature variable is categorical and some expected values in the contingency tables are extremely small (< 5). The Chi-squared tests are applied to categorical features, including the number of conserved TFBS's and conserved elements. And, the KS tests are applied to the numeric features, including CpG island general attributes, DNA sequence composition features (frequencies and z-scores), average scores of conserved TFBS's and conserved elements, structural properties, histone methylation and histone acetylation. A feature is selected if the p-value rendered by the statistical test is less than 0.05.
Although statistical tests may identify those features showing correlation with the CpG island methylation, the identified features might be inter-correlated themselves. For example, DNA sequence and structure properties are likely to be correlated, because most DNA structures are predicted based on DNA sequences. The histone methylation and acetylation status are likely to be correlated, because some acetylation and methylation (e.g. histone H3 at lysine 9) play opposite roles in gene activity . The correlation between features makes the feature space unnecessarily high-dimensional. To minimize the redundancy in the features, we perform the PCA on those methylation-related features that are selected via the above statistical tests. The PCA uses an orthogonal transformation to convert a set of values of possibly correlated dimensions into a set of values of uncorrelated dimensions called principal components . After PCA transformation, the feature components are completely decorrelated, and the information contained in the original feature space before the transformation is maximally retained in the first several number of components of the new feature space. Therefore, by keeping only the first several components of the new feature space, most of the information can still be retained while the redundancy in the feature collection is greatly removed and the dimensionality of the feature space is greatly reduced.
Model training, validation and testing
After feature selection through statistical tests and PCA, each CpG island is represented by a multi-dimensional feature vector that corresponds to the retained principal components. The feature is then fed to the models to predict the methylation status of the CpG island. To examine the contribution of the newly added features as well as the impact of the inhibitive-to-acquire histone modification information, we establish 16 models, (1) M1: with all information being incorporated, (2) M2: with all but the histone modification information being incorporated, (3) M3-M9: models with individual or combinations of the newly added features being excluded, and (4) M10-M16: models with individual or combinations of the newly added features as well as the histone methylation information being excluded. Each model is based on the SVM, and outputs binary results indicating whether the CpG islands are methylated or unmethylated and continuous results ranging from 0 (minimum) to 100 (maximum) indicating the methylation intensities of the CpG islands. Given the binary predictions provided by a model and the true methylation status as specified in the HEP data set for a group of CpG islands, we can estimate the specificity, sensitivity and accuracy of the model as in Eqns. (4)-(6):
where SP, SE, ACC stand for specificity, sensitivity and accuracy, respectively. And, given the continuous predictions and the true methylation intensities of the CpG islands, we can calculate their correlation coefficient as:
where CC stand for correlation coefficient, cov(·) denotes the covariance, and σ denotes the standard deviation. Note that the specificity reflects the model's capabilities in dealing with the negative (unmethylated) data - a high specificity measure implies that a predicted unmethylated CpG island is highly likely truly unmethylated. And the sensitivity reflects the models's capabilities in dealing with the positive (methylated) data - a high sensitivity measure implies that a predicted methylated CpG island is highly likely truly methylated. Whereas, the accuracy and correlation coefficient reflect the model's overall capabilities in dealing with all types of CpG islands - high accuracy and high (close to one) correlation coefficient implies that the predictions are highly likely true.
All these models are trained and validated by using the CD4 lymphocyte data with a 10-fold cross validation scheme. The 469 CpG islands are randomly partitioned into 10 approximately equally-sized folds. Each fold is used in turn for validation while the remaining folds are used for training. The performance of the model is assessed based on the data in the validation fold. This partition-training-and-validation procedure is repeated for 20 times, and the performance of the model (in terms of specificity, sensitivity, accuracy and correlation coefficient) is averaged over the 200 validation folds (10 validation folds per partition ×20 partitions).
Two predictive models built on the CD4 lymphocyte data, M1 (using all information) and M2 (using all but histone modification information), are also tested for generalizability using the data of the other 11 tissues and cell types. For generalizability testing on M1, we apply the histone modification information of the CD4 lymphocyte to the other 11 tissues and cell types because correlation analysis by ourselves and others has indicated that histone modifications exhibit modest to strong correlations for different cell lines [39,13]. The generalizability performance of the model is also measured in terms of specificity, sensitivity, accuracy and correlation coefficient, which are averaged over all the models constructed from all the above training/validation partitions.
Results and discussions
Statistical tests and PCA
Out of a total number of 841 features, 342 features are retained whose p-values in the statistical tests are less than 0.05. These features include two of the CpG island specific attributes, 217 DNA sequence compositional features, and eight DNA structural features, 35 features regarding the conserved TFBSs, two features regarding the conserved elements, two features regarding the functional roles of the neighboring genes, and 76 features related to the modification status of nearby histones. Particularly, among the newly added features, two out of the four nucleosome positioning features, all of the 36 histone acetylation features, and both of the features regarding the functional roles of the neighboring genes are retained after statistical tests.
PCA is performed on these 342 selected features to minimize their correlations. Table 3 summarizes the number of principal components that must be retained to keep a certain percentage of the variance of the original feature space. Observe that the first eight principal components together can account for the ~99.90% of the total variance and are therefore used to build the predictive models. Figure 2 depicts the contribution of each of the 342 original feature dimensions to the eight principal components. Observe from Figure 2 that each of the following categories of features, (i) the CpG island general attributes, (ii) DNA sequence composition, (iii) distribution of the conserved TFBS's and conserved elements, (iv) DNA structure patterns, (v) gene functions, (vi) histone methylation and acetylation status, makes substantial contributions to one or more principal components, suggesting that these categories of information, though correlated, are complementary to a certain extent for predicting the CpG island methylation.
Table 3. Number of principal components (PCs) required to retain a certain percentage (Pcnt) of the variance of the original feature space of the 342 features selected through statistical tests.
Figure 2. Contribution of the 342 features to the eight principal components. Each column corresponds to a principal component, and each row corresponds to an original feature dimension. All feature categories make substantial contributions to one or more principal components, suggesting that these categories of information, though correlated, are complementary to a certain extent for predicting the CpG island methylation.
Performance of the predictive models based on the CD4 lymphocyte data
The specificity, sensitivity, accuracy and correlation coefficient measures of our predictive model M1 that incorporates all information are summarized in Table 4. The performance of our classifier is compared to that of Fan et al.'s method (which is based on a similar set of features and represents the state of the art ). Note that both models have incorporated the histone modification information. Observe that our model shows an improved specificity and accuracy while maintaining a comparable sensitivity.
Table 4. Performance of our classifiers M1 on CD4 lymphocytes with comparison to the existing method.
We could argue that the improvement of our model M1 over the existing model is partly due to the incorporation of the three new types of features - nucleosome positioning propensities, gene functions, and histone acetylation status. The performance of our models M3 through M9, each with an individual or a combination of the new types of features being excluded, are summarized in Table 5. Observe that the performance of the predictive model deteriorate to different extents when individual or combinations of the newly added features are excluded. Specifically, the models without histone acetylation information (M3, M6, M7, and M9) deteriorate more than those models with histone acetylation information but without the other two types of newly added features (M4, M5, and M8). Therefore, histone acetylation appears to be the most influential feature to the performance of the predictive model among the newly added features.
Table 5. Performance of the predictive models (M3 through M16), each with an individual or a combination of the newly added categories of features being excluded.
We suspect that the information carried by the histone methylation features is too dominant to fairly assess the influence of these newly added features; and therefore exclude the histone methylation features and repeat the above experiments excluding individual or combinations of the newly added features. The resultant models are M10 through M16, and their performance is summarized in Table 5. Similarly, the models without an individual or a combination of the newly added features deteriorate. It is noteworthy that (1) the histone methylation and acetylation information greatly affect the sensitivity of the models, and (2) the loss of histone methylation information could largely be made up by including the histone acetylation information. This is not surprising, given that these two forms of histone modifications are closely related as repeatedly observed in various tissues and cell types .
The two predictive models, one with the histone modification information (M1) and the other without (M2), that are both built on the human CD4 lymphocyte data are then tested on the data of the other 11 tissue and cell types for their generalizability. The sensitivity, specificity, accuracy and correlation coefficient of M1 and M2 during these testing experiments are summarized in Tables 6 and 7.
Table 6. Performance of the classifier model and the influence of newly added features on the data of 11 different tissues and cell types: with histone modification.
Table 7. Performances of the classifier model and the influence of newly added features on the data of 11 different tissues and cell types: without histone modification.
When the histone modification information is incorporated, the classifier model built on the CD4 lymphocyte data can be applied to most of the other tissues and cell types (except for sperm) with little or no performance deterioration. When the histone modification information is not used, the performance of the predictive model on the data of the other tissues and cell types deteriorate substantially, especially in terms of the sensitivity. However, if compared to the validation results where the histone modification information is not used (see Table 3), the performance on the testing data is not unexpected. Therefore, with or without the histone modification information, the predictive model established on the CD4 lymphocyte data can well generalize to the other tissue or cell type data.
Considering that DNA methylation is heavily involved in cellular differentiation, our results in Tables 6 and 7 may look suspicious. We therefore count the number of differentially methylated CpG islands (Table 8) and calculate the correlation of the CpG island methylation levels between any two different tissue and cell types (Figure 3). Observe that between somatic/placenta cells, the number of differentially methylated CpG islands is small and the correlation coefficients are very high, whereas between the somatic/placenta and sperm cells, the number of differentially correlated CpG islands is relatively larger and the correlation coefficients is relatively lower. This suggests that the methylation status of CpG islands are highly correlated in various somatic/placenta cells, and therefore do not represent tissue-specific differentially methylated regions. Our observations are consistent with recent studies [17,40] that there are few variance in methylation levels of autosomal CpG island promoters, and there is only a relatively small fraction of CpG islands with tissue-specific methylation. The difference between the somatic/placenta and sperm cells, as reflected by their moderate cross-correlations and the performance deteriorations of our prediction models being applied to the sperm cell data, suggests that gametes are epigenetically more deviated from somatic cells than somatic cells themselves. This difference is likely related to the meiotic process, the special conditions and gene expression required for gamete production .
Table 8. The number of CpG islands that are differentially methylated in any two tissues among 321 common CpG islands for all the 12 tissues.
Figure 3. Correlation coefficients of the CpG island methylation levels across different tissues and cell types. The methylation status of CpG islands are highly correlated among the somatic and placenta cells. The methylation status of CpG island in sperm exhibits much difference in comparison with other tissue and cell types.
Conclusions and future works
The establishment of DNA methylation pattern is a crucial part of cell differentiation and organ development, suppression of viral genes and deleterious elements, and carcinogenesis. Computational prediction of DNA methylation levels provides an effective, fast and cheap alternative approach for studying the DNA methylation patterns. In this study, we perform the computational prediction of the CpG island methylation by incorporating additional features and effectively selecting and decorrelating the features. We incorporate the information regarding the nucleosome positioning propensity, acetylation status of nearby histones, and the functional roles of nearby genes. These features are first screened through statistical tests and PCA. The most DNA methylation-relevant yet non-intercorrelated features are subsequently used to build computational models to predict the methylation status of CpG islands. Our experiments on the HEP data set demonstrated that (1) an eight-dimensional feature space, which combines all the eight categories of information, is effective in predicting the methylation status of CpG islands; (2) by incorporating the information regarding the nucleosome positioning propensities, gene functions, and histone acetylation, our predictive model achieves a higher specificity and accuracy than the existing model while maintaining a comparable sensitivity; (3) the histone modification attributes carry a weight of information for the prediction, without which the performance of the predictive model deteriorates substantially in terms of sensitivity; (4) with or without the histone modification information, the performance of the predictive models are consistent on the validation and testing data.
Though it is known that DNA methylation is heavily involved in the normal development and differentiation, as well as in the onset and progression of diseases, the exact mechanisms are yet to be discovered. It will certainly help to accelerate biomedical investigations if we can, through computational predictions, comparative analyses, and evolutionary studies, identify those DNA regions whose methylation variation patterns are correlated with, indicative of, and underlying of the variations in gene expressions, histone modifications and chromatin structures that are related to normal development, cell differentiation, genome imprinting, X-chromosome inactivation, and phenotypic changes, respectively. This computational model, with its evidently high specificity and sensitivity, provides an effective tool for identification of new methylation targets and therefore lays foundation for our future endeavors in the regulation mechanisms of DNA methylation.
An standalone program for the CpGIMethPred is freely available for download at http://users.ece.gatech.edu/~hzheng7/CGIMetPred.zip webcite. Given the chromosome location (hg18) of a CpG islands, CpGIMethPred is able to predict the methylation status of it.
The authors declare that they have no competing interests.
This study is partially supported by Georgia Cancer Coalition Award RC672 "Establishing the Computational Model to Detect Cancer-Specific Differentially Methylated CpG Islands in Human Genome."
This article has been published as part of BMC Medical Genomics Volume 6 Supplement 1, 2013: Proceedings of the 2011 International Conference on Bioinformatics and Computational Biology (BIOCOMP'11). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/6/S1. Publication of this supplement has been supported by the International Society of Intelligent Biological Medicine.
Proceedings of the National Academy of Sciences USA 2003, 100:12253-12258. Publisher Full Text
Tao Y, Xi S, Shan J, Maunakea A, Che A, Briones V, Lee E, Geiman T, Huang J, Stephens R, Leighty R, Zhao K, Muegge K: Lsh, chromatin remodeling family member, modulates genome-wide cytosine methylation patterns at nonrepeat sequences.
J Biomedical Science and Engineering 2010, 3:672-676. Publisher Full Text
Siepel A, Bejerano G, Pedersen J, Hinrichs A, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier L, Richards S, Weinstock G, Wilson R, Gibbs R, Kent W, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Bernstein B, Mikkelsen T, Xie X, Kamal M, Huebert D, Cu J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagscha A, Feil R, Schreiber S, Lander E: A bivalent chromatin structure marks key developmental genes in embryonic stem cells.
Eckhardt F, Lewin J, Cortese R, Rakyan V, Attwood J, Burger M, Burton J, Cox T, Davies R, Down T, Haefliger C, Horton R, Howe K, Jackson D, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K, Beck S: DNA methylation profiling of human chromosomes 6, 20 and 22.
Fujita P, Rhead B, Zweig A, Hinrichs A, Karolchik D, Cline M, Goldman M, Barber G, Clawson H, Coelho A, Diekhans M, Dreszer T, Giardine B, Harte R, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn R, Learned K, Li C, Meyer L, Pohl A, Raney B, Rosenbloom K, Smith K, Haussler D, Kent W: The UCSC Genome Browser database: update 2011.
Chodavarapu R, Feng S, Bernatavichute Y, Chen P, Stroud H, Yu Y, Hetzel J, Kuo F, Kim J, Cokus S, Casero D, Bernal M, Huijser P, Clark A, Kramer U, Merchant S, Zhang X, Jacobsen S, Pellegrini M: Relationship between nucleosome positioning and DNA methylation.
Nature Letter 2010, 466:388-392. Publisher Full Text
Nature Letter 2009, 458:362-366. Publisher Full Text
Koch C, Andrews R, Flicek P, Dillon S, Karaoz U, Clelland G, Wilcox S, Beare D, Fowler J, Couttet P, James K, Lefebvre G, Bruce A, Dovey O, Ellis P, Dhami P, Langford C, Weng Z, Birney E, Carter N, Vetrie D, Dunham I: The landscape of histone modifications across 1% of the human genome in five human cell lines.
Irizarry R, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, Ji H, Potash J, Sabunciyan S, Feinberg A: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores.
Nature Reviews Cancer 2009, 41:178-186. Publisher Full Text
Nawapen P, Junpen S, Dion H, Michael D, Bernie C, Mongkol T: Different DNA methylation patterns detected by the Amplified Methylation Polymorphism Polymerase Chain Reaction (AMP PCR) technique among various cell types of bulls.