J Craig Venter Institute, San Diego, CA, USA

Medical Sciences Program, Indiana University School of Medicine, Bloomington, IN, USA

School of Computer Science and Engineering, Bioinformatics Institute, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea

Abstract

Background

DNA methylation is essential for normal development and differentiation and plays a crucial role in the development of nearly all types of cancer. Aberrant DNA methylation patterns, including genome-wide hypomethylation and region-specific hypermethylation, are frequently observed and contribute to the malignant phenotype. A number of studies have recently identified distinct features of genomic sequences that can be used for modeling specific DNA sequences that may be susceptible to aberrant CpG methylation in both cancer and normal cells. Although it is now possible, using next generation sequencing technologies, to assess human methylomes at base resolution, no reports currently exist on modeling cell type-specific DNA methylation susceptibility. Thus, we conducted a comprehensive modeling study of cell type-specific DNA methylation susceptibility at three different resolutions: CpG dinucleotides, CpG segments, and individual gene promoter regions.

Results

Using a k-mer mixture logistic regression model, we effectively modeled DNA methylation susceptibility across five different cell types. Further, at the segment level, we achieved up to 0.75 in AUC prediction accuracy in a 10-fold cross validation study using a mixture of k-mers.

Conclusions

The significance of these results is three fold: 1) this is the first report to indicate that CpG methylation susceptible "segments" exist; 2) our model demonstrates the significance of certain k-mers for the mixture model, potentially highlighting DNA sequence features (k-mers) of differentially methylated, promoter CpG island sequences across different tissue types; 3) as only 3 or 4 bp patterns had previously been used for modeling DNA methylation susceptibility, ours is the first demonstration that 6-mer modeling can be performed without loss of accuracy.

Background

DNA methylation is the chemical modification of DNA bases, mostly on cytosines that precede a guanosine in the DNA sequence, i.e., the CpG dinucleotides. This epigenetic modification involves the addition of a methyl group to the number 5 carbon of the cytosine pyrimidine ring. DNA methylation is essential for cellular growth, development and differentiation

Previous work

Several recent studies have attempted to predict CpG island methylation patterns in normal and cancer cells. DNA pattern recognition and supervised learning techniques were used by Feltus et al to discriminate methylation-prone (MP) and methylation-resistant (MR) CpG islands based on seven DNA sequence patterns

While the focus of the above studies was on CpG island methylation susceptibility, recent experiments have convincingly demonstrated that methylation levels of CpG sites, i.e. genomic location of CpG dinucleotides, within a CpG island can be highly variable. For example, Handa et al found that certain sequence features flanking CpG sites were associated with high- and low-methylation CpG sites in an in vitro DNMT1 overexpression model

Motivation

Previous CpG island methylation susceptibility prediction studies have not considered cell type-specific methylation status. Considering variations in DNA methylation level even in the same genomic regions of different types of cells, we asked the question: can cell type-specific DNA methylation susceptibility be modeled? The significance of exploring this question is based on evidence supporting the strong association of genomic sequence features with DNA methylation status. Furthermore, recent studies strongly indicate the existence of methylation sensitive/resistant CpG islands in different cancer types

Methods

The problem: methylation susceptible dna segment modeling problem

The need for segment modeling

Bisulfite sequencing data clearly demonstrates that methylation levels, even within a single gene promoter, can be highly variable. Furthermore, a figure in Additional file

**DNA methylation level variation.** A figure in the file shows DNA methylation level variation in an amplicon from 5 cell types.

Click here for file

Definition of the problem

The following notations were used to formally define the problem. A small set of pre-selected k-mers **x **= {_{i}**t **= {_{j}_{j }

For each cell type, a k-mer mixture logistic regression model (equation 1) was built using a small set of pre-selected patterns, i.e.

where _{i}

The k-mer mixture modeling problem

Our goal was to test whether methylation susceptibility can be modeled by a logistic regression model using a small set of k-mers. Although using k-mers for DNA methylation modeling is not entirely new, to our knowledge, only short k-mers (3 or 4 bp in length) were used in previous studies

1. First, we attempted to use longer k-mers (up to 6 bp) to utilize those that only occur in methylation susceptible sequences (vs. frequency for short k-mers, described above).

2. Our goal of determining whether machine learning predictors can be built by using k-mers required that we address two important issues: over-fitting and generalizability of prediction beyond the test data. The over-fitting problem was addressed by selecting a small number of k-mers from the training data set (using a larger number of k-mers can easily over-fit the training data). The cross validation technique was used to test the generalizability of prediction power. We selected k-mers and built machine predictors by using only the training data set. We then assessed the predictor on the test data set not used for either selecting k-mer features or building predictors.

Two k-mer feature selection methods

We used a selected set of k-mers for DNA methylation susceptibility modeling in the different cell types. The research question explored in this paper is the feasibility of modeling methylation susceptible segments given a set of k-mers. As selection of the "best set" of k-mers for modeling was not explored (a solution to the combined problem was too difficult), we used two standard pattern selection methods for a two-class data set.

1. Feature selection with t-test: A popular t-test method was used to select k-mers because of its simplicity and applicability for all modeling approaches. For each attribute

2. Feature selection with the random forest technique: The RF algorithm

In both methods, we extracted a set of patterns in the balanced data set. First, centered at each CpG site, we extracted a flanking sequence of length

Modeling methylation levels of DNA segments

Definition A boundary variable

_{i }_{i }_{a }_{z }_{a }_{z }_{i }** configuration**.

Illustration of the initial segment definition

**Illustration of the initial segment definition.** Because all boundary variables are set to 1, 10 initial segments are defined. Later, the segment modeling algorithm considers alternative segment definition by changing the boundary variable values. Figure was modified from

Labeling data

Given a segment _{i}_{i }_{i }_{i }_{i }_{i}

Attributes for modeling

K-mer occurrences in segments in the training data set were used as attributes. A small subset of k-mers features **x **was selected from all k-mers using the feature selection methods.

Modeling

A single logistic regression model was used to model all DNA segments for each cell line, using attributes **x **and labels **t**.

Segment-level modeling challenges: exponential search space

Although the methylation status of a DNA segment is defined by an aggregation of the methylation status of individual promoter regions (as we did for the whole promoter region-modeling approach), how to define methylation susceptible DNA segments is currently unknown. For example, consider a DNA segment with five CpG sites {^{n}

A random binary segment merging algorithm

A Naïve approach to segment modeling simply enumerates all possible segment configurations. Every combination of segment boundaries is considered, while changing the setting of values for boundary indicator variable _{i }^{m }_{i}

1. **Initialization of a configuration: **Define a boundary variable _{i }

2. **Computing a logistic regression model**: Given a k-mer occurrence and a segment configuration, compute a logistic regression model by (1). This is how

3. **Computing an error of a segment configuration**: Errors in the segment set

where _{i }_{i }_{i}| _{i }

The random binary segment merging algorithm

Given the current segment configuration {_{i}_{j}_{j }_{1}, . . . , _{n }> where _{1}, . . . , _{n }

Once a segment _{j }_{j+1 }_{j}_{j }

**Input **: A set of pre-selected k-mers K = {_{i}

**Output**: A logistic regression model; A segment configuration.

**HillClimbingConfigurationSearch**(N)

**begin**

(

**for ****to ****do**

(

**if ****then**

**end**

**report **(

**end**

**end**

**RandomConfigurationSearch **( )

**begin**

**while ****do**

(C',M',E') =

**if **(**then break**

**return **

**end**

**end**

**RandomBinaryMerging**(**configuration **

**begin**

**bool ****{false**}

**while **∃**false do**

**true ** _{j }

**false ** _{j }_{j+1}

**if ****then**

**true **

**else**

**true **

**end**

**end**

**return **

**end**

**Algorithm 1: **Hill climbing configuration search algorithm. An algorithm tries to merge two segments at random until all segments are considered for merging. A new configuration is accepted only when the error is reduced with a new logistic regression model, thus it is a hill climbing algorithm.

Results

Data set

We used data from Zhang et al

Experimental setup

The 10-fold cross validation (described above) was used to compare the performances of three modeling approaches. For each round of 10-fold validation, one of the 10 subsets was set aside for testing, and the k-mer features were selected only from the training set, ensuring that the test data would have no influence on the k-mer feature selection. Also, regression coefficients were computed in only training stage. We measured the area under the ROC curve (AUC) score for performance comparison.

Effectiveness of the segment modeling approach

We extensively tested the effectiveness of the segment modeling algorithm using 4-mer, 5-mer, and 6-mer patterns. For each of the experiments, the AUC score was measured from 10-fold cross validation for the initial segment definition vs. the final segment definition. The RF-based algorithm with 100 trees was used for k-mer feature selection. For each k-mer selection procedure, 30 random experiments were performed, and k-mers with z-score > 0 that appeared in at least 90% of experiments were selected as k-mer features. Using the set of k-mers, the optimal logistic regression model was computed.

10-fold cross validation experiments

The performance comparison between the initial segments and the final segments in the test set is shown in Figure

Effectiveness of segment modeling in 10-fold cross-validation experiments

**Effectiveness of segment modeling in 10-fold cross-validation experiments.** Bars between adjacent dotted lines show improvement in the between prediction results of two models with the initial segment setting and the final segment setting in terms of AUC scores. We measured the performance improvement using 4-mer, 5-mer, and 6-mer features. For each cell type, the segment modeling algorithm identified significantly improved segment definitions. Five panels in each plot corresponds to tissue types: (A) Fibroblast, (B) HEK293, (C) HepG2, (D) Leukocytes, and (E) Trisom 21.

Search behavior

The search behavior of the segment modeling algorithm is shown in Figure

The search behavior of the segment modeling algorithm using the whole data set

**The search behavior of the segment modeling algorithm using the whole data set.** Pairwise plots showing reduced learning error (2) at each iteration of segment merging and model recalculation. The columns for the pairwise plots are k-mers; rows are cell lines. In each plot, the X-axis denotes the number of iterations and the weighted squared prediction error is denoted on the Y-axis. The HillClimbing search algorithm effectively reduced the error between prediction and observation. In fitting the whole data set, as opposed to 10 fold cross validation, the final model predicted methylation susceptibility in the different cell types.

Discussion on the predictive power of the model

The predictive power of the model measured by 10-fold cross validation is encouraging. For 6-mers, the predictive accuracy was 0.69 for Fibroblast, 0.70 for HEK293, 0.54 for HepG2, 0.73 for Leukocytes, and 0.65 for Trisom 21. These prediction accuracies using 6-mer cannot be achieved in random data sets where the expected prediction accuracy is 0.5. Variations in the prediction accuracy for the five cell types, especially for HepG2, may be due to the cell type specific characteristics. On the other hand, the data obtained from

Effect of the number of k-mers used for prediction

The three modeling approaches were compared in terms of AUC obtained by 10-fold cross-validation technique. We conducted comprehensive modeling of cell-type specific DNA methylation susceptibility at three different resolutions: individual CpG sites, CpG segments, and promoter regions in terms of AUC obtained by the 10-fold cross validation technique. The methods for modeling at individual CpG sites and at promoter regions are described in Additional file

**Competing modeling approaches.** Compared to segment modeling, two competing modelings, CpG site-specfic modeling and promoter region modeling were described.

Click here for file

Effect of the number of k-mers used for three modeling approaches

**Effect of the number of k-mers used for three modeling approaches.** The performance of three modeling approaches was measured from 10-fold cross-validation. Each bar is the AUC value of the experiment. X-axis is the number of most significant variables (p-value in t-test) used in each experiment. Consistently in 4-mer to 6-mer and regardless of number of patterns, segment modeling outperformed other modeling approaches. More importantly, from the experiments using variable numbers of k-mers from 10 to 100, we have shown that the selection of k-mers does not have a big impact on the model performances and the higher accuracies of the segment modeling approach, compared to the promoter and site-specific modeling approaches, is likely due to the effectiveness of the segment model.

Conclusion

We conducted a comprehensive modeling study for cell-type specific DNA methylation susceptibility. By performing extensive computational experiments of data from five distinct cell types, we show that DNA methylation susceptibility can be accurately modeled at the segment level, achieving up to 0.75 in AUC prediction accuracy in a 10-fold cross validation study. The two-step iterative segment modeling algorithm successfully identified optimal segments that can be modeled as a logistic regression model using a set of k-mers. Our model further shows the significance of certain k-mers for the mixture model, which can potentially highlight DNA sequence features (k-mers) of differentially methylated promoter CpG island sequences in different cells and tissues, including malignancies. As only used 4 bp patterns were used in previous modeling studies of DNA methylation susceptibility, this is the first report to show that k-mer modeling can be performed using up to 6-mer without the loss of modeling accuracy.

List of abbreviations used

• AUC: area under the ROC curve; • DNA: deoxyribonucleic acid; • MP: methylation-prone; • MR: methylation-resistant; • RF: random forest; • YY: Youngik Yang; • SK: Sun Kim; • KN: Ken Nephew.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YY designed the computational framework, conducted simulation, and wrote the manuscript. KN gave critical input on biological discussion of this work, and drafted the manuscript. SK led the project, designed the algorithm and tests, and drafted the manuscript.

Acknowledgements and funding

This work supported by NIH U54 CA11300-02 (Interrogating Epigenetic Changes in Cancer Genomes) to SK and KN and by Korea National Research Foundation 0543-20110016 to SK.

This article has been published as part of