Abstract
Background
Whole genome sequencing of bisulfite converted DNA (‘methylCseq’) method provides comprehensive information of DNA methylation. An important application of these whole genome methylation maps is classifying each position as a methylated versus nonmethylated nucleotide. A widely used current method for this purpose, the socalled binomial method, is intuitive and straightforward, but lacks power when the sequence coverage and the genomewide methylation level are low. These problems present a particular challenge when analyzing sparsely methylated genomes, such as those of many invertebrates and plants.
Results
We demonstrate that the number of sequence reads per position from methylCseq data displays a large variance and can be modeled as a shifted negative binomial distribution. We also show that DNA methylation levels of adjacent CpG sites are correlated, and this similarity in local DNA methylation levels extends several kilobases. Taking these observations into account, we propose a new method based on Bayesian classification to infer DNA methylation status while considering the neighborhood DNA methylation levels of a specific site. We show that our approach has higher sensitivity and better classification performance than the binomial method via multiple analyses, including computational simulations, Area Under Curve (AUC) analyses, and improved consistencies across biological replicates. This method is especially advantageous in the analyses of sparsely methylated genomes with low coverage.
Conclusions
Our method improves the existing binomial method for binary methylation calls by utilizing a posterior odds framework and incorporating local methylation information. This method should be widely applicable to the analyses of methylCseq data from diverse sparsely methylated genomes. BisClass and example data are provided at a dedicated website (http://bibs.snu.ac.kr/software/Bisclass webcite).
Keywords:
DNA methylation; Bayes classifier; Local DNA methylation level; MethylCseqBackground
DNA methylation is a prevalent epigenetic modification of genomic DNA with fundamental functional consequences on developmental processes, regulation of gene expression and diseases [1,2]. Accurately inferring the level of DNA methylation at a specific nucleotide in the genome is a critical step toward elucidating the molecular mechanisms of regulation via DNA methylation. A method widely gaining popularity is the whole genome sequencing of bisulfite converted genomic DNA, often referred to as ‘methylCseq’ (also referred to as ‘BSseq’ elsewhere). This method is based upon the particular chemical properties of DNA methylation to ‘protect’ cytosines from converting to uracils by sodium bisulfite [3]. Specifically, during the sodium bisulfite conversion process, nonmethylated cytosines are changed to uracils, which then change to thymine after PCR amplification. Consequently, following a sodium bisulfite treatment, nonmethylated cytosines should be read as thymines while methylated cytosines should remain as cytosines.
Compared to microarraybased methods, the methylCseq method is powerful in a multitude of ways. In addition to the fact that it can provide information on every nucleotide in the genome, a notable strength of this method is that it can be applied to analyses of nonmodel species where predefined microarrays (such as beadchip) are not readily available. For this reason, methylCseq has been instrumental in the recent surge of genomic DNA methylation analyses from diverse taxa, in particular from many invertebrates (e.g., [46]). These studies show that invertebrate genomes generally exhibit very different patterns of DNA methylation compared to those of mammalian genomes. The most significant difference is that invertebrate genomes tend to be much more sparsely methylated than mammalian genomes. For example, the mean level of DNA methylation for individual CpGs in the honey bee genome is <1% [7,8], which is far lower than that in the human genome (60 ~ 90% [9,10]). Even relatively heavily methylated genomes of some aquatic species such as the freshwater snail Biomphalaria glabrata or the pacific oyster Crassostrea gigas harbor only a few percent of methylated cytosines [11]. Similarly, plant genomes appear to be generally much more sparsely methylated than mammalian genomes. For example, only a few percent of cytosines are methylated during the early stages of Populous floral development [12].
Analyzing such sparsely methylated genomes presents unique technical challenges. In heavily methylated species such as mammals, the measure of interest is usually the fraction of methylated reads (‘C’ reads) in the total number of reads per site, the socalled ‘fractional DNA methylation’ [1315]. In sparsely methylated genomes, these values are typically very small. Moreover, these values are heavily influenced by errors associated with the conversion and sequencing processes (see below). For these reasons, it is often important to determine whether a specific position has any methylation or not. In other words, a binary classification of methylated versus nonmethylated cytosines is critical to evaluate the distribution of DNA methylation and different levels of DNA methylation [4,8,16,17]. In principle, this should be simple: cytosines covered by any number of ‘C’ reads should be considered methylated. However, in reality, this step is not straightforward due to the nature of chemistry underlying the MethylCseq method. Specifically, the sodium bisulfite conversion step is not perfect, and includes both i) the possibility of nonconversion (nonmethylated C is not properly converted to U/T), leading to an overestimation of actual DNA methylation (Figure 1A), as well as ii) overconversion (methylated C is also converted), leading to an underestimation of actual DNA methylation (Figure 1A) [3]. Consequently it is necessary to take into account these technical errors for a binary classification of a specific nucleotide. In particular, these errors can occur at rates comparable to the actual methylation levels in some genomes. Despite these wellknown and substantial technical issues, methods to efficiently account for these imperfections are surprisingly rare. The most widely used method is the socalled binomial method[8,17]. However, this method has some shortcomings when the genomic methylation levels and the coverage of specific site are low (see below).
Figure 1. Potential errors and biases of methylCseq and binomial method. (A) Errors associated with the methylCseq method. Nonmethylated Cs may not be completely converted (nonconversion error, nonmethylated C remains as C). In addition, methylated Cs may undergo conversion (overconversion error, methylated C converts to T). (B) Reduced power of the binomial test in sparsely methylated genomes and low coverage. The Yaxis indicates FDRcorrected qvalues from the binomial test, calculated following the equation (2) in the main text. The Xaxis indicates the percentiles of pvalues, which is equivalent to the whole genome methylation levels. Four cases are shown, including when a specific cytosine is covered by a single ‘C’ read (filled circles), one ‘C’ and one ‘T’ reads (crosses), one ‘C’ and two ‘T’ reads (open triangles) and two ‘C’ reads (filled triangles). The fractional methylation levels of these four cases are all substantial, 100%, 50%, 33% and 100%, respectively. However, in sparsely methylated genomes, many of these sites will have qvalue > 0.05 and will be classified as ‘unmethylated’. For example, when only a single ‘C’ read is available (line with filled circles), despite the fact that the read itself indicates a 100% methylation, it will be designated as unmethylated (qvalues > 0.05) unless the overall methylation level of the genome is greater than 4%. In another case, when a C is covered by one ‘C’ Read and two ‘T reads (line with open triangles), the fractional methylation level of such a position is 33%. However, such a site will be called as ‘unmethylated’ unless the overall level of methylation in the genome is 12% or higher.
Here, we propose a new method, the Bisulfitesequencing data classification method (BisClass). This method takes the prior methylation distribution into account to infer methylation status in the framework of Bayesian probabilistic models, which is known to minimize classification errors in the presence of a known alternative hypothesis [18]. In addition to utilizing a Bayesian classification scheme, we take into account the fact that DNA methylation levels of adjacent sites are correlated (Figure 2). Consequently, including information on DNA methylation levels of the genomic neighborhood improves our ability to correctly infer the DNA methylation status. We demonstrate that BisClass alleviates the problems of the binomial method and improve sensitivity and accuracy using extensive simulations as well as analyses of actual methylCseq data.
Figure 2. Properties of methylCseq coverage and spatial correlation of CpG methylation level. (A) Quantilequantile (QQ) plot between observed coverage and theoretical coverage which is from a shifted negative binomial distribution. (B) Spatial correlation plot of a honeybee methylome from Herb et al. [7]. (C) QQ plot between observed pvalues from KolmogorovSmirnov (KS) test and theoretical pvalues from a null distribution. The KS test was used to detect the spatial correlation between selected CpG sets, consisting of one CpG per window. Detailed procedure is explained in the Additional file 1. The green region implies 95% confidence interval of theoretical ordered pvalues. The overlap between the observed Pvalues indicates that the data follows the null distribution. (D) Spatial correlation plot of an Arabidopsis methylome (methylCseq data from GSM276809, [29]). (E) Smoothed methylation level using triangle kernel in scaffold 1.1, for three samples. Xaxis is physical location and Yaxis is methylation level. Red lines represent average methylation fractions calculated from whole CpG methylomes.
Additional file 1. Estimation of error rates using ExpectationMaximization (EM) algorithm.
Format: DOCX Size: 28KB Download file
Methods
Pitfalls of the binomial method
We first describe the widely used binomial method in some detail. In this method, the probability that a nonmethylated C remains as C, or the ‘nonconversion’ error (which we will refer to as p_{0}), is used to infer whether the observed methylation signal is more likely to have arisen by chance. Specifically, the methylation status of a site is determined under a binomial distribution where p_{0} is used as the success rate. The null hypothesis is that the site is not methylated, and the Pvalue for this null hypothesis is:
where k is the number of methylated reads at the site of interest, and N is the total number of reads at this site. The resulting Pvalues are further corrected for multiple testing, typically by the false discovery rate (FDR) [19]. The main parameter p_{0} is determined either by examining nonmethylated portions of the genome (such as repetitive regions in insect genomes or chloroplast genomes in plants, e.g., [8,17]) or from ‘spikedin’ lambda genomic DNA (e.g., [13]). This approach is intuitive and straightforward.
However, the power of the binomial method is weak when the number of reads (N) is small (i.e., equation (1) above). Moreover, when combined with the correction for multiple testing with FDR, the power of the binomial method is particularly reduced at low coverage sites in sparsely methylated genomes. This reduction is because the FDRcorrected qvalue of the i^{th} site is calculated as
where p_{i} is the binomial Pvalue for a specific site i. Since the binomial Pvalues are limited by the number of reads (N) for that site, Pvalues from lowcoverage sites (low N) will have moderate ranks at most. Consequently, in sparsely methylated genomes, even if a site is truly methylated, if the number of reads is small, the Pvalue of such site will not be ranked sufficiently low to be classified as methylated after FDR correction. Figure 1B demonstrates this phenomenon using specific examples. For example, a site covered by a single ‘C’ read (the line with filled circles, Figure 1B) will not be classified as ‘methylated’ with the binomial method in genomes with the overall methylation levels typical of hymenopteran insects (i.e., < 4%). Likewise, a site with 50% methylation with coverage of two (the line with crosses, Figure 1B) will also be classified as nonmethylated in sparsely methylated genomes. Consequently, the binomial method may produce a high number of false negatives in lowly methylated genomes. To avoid this problem, some studies suggest using sites that are covered by at least two [20], or four [17] reads. However, since the number of reads typically has a large variance, even with a moderate coverage sequence data, substantial numbers of cytosines are covered by single reads (Table 1), making it impractical to remove positions with a small number of reads. For example, if we remove sites with fewer than four reads, almost 50% of data are discarded in representative methylCseq datasets in honey bee (Table 1).
Table 1. Properties of the methylCseq data sets used in this study
Bayes classification using methylation level as prior probabilities
To overcome the aforementioned problems in the binomial method, here we propose to use a Bayesian probabilistic model to infer methylation status. The posterior probability of methylation status is determined based upon the product of prior distribution of methylation and the likelihood of specific reads aligned to a site. Specifically, the posterior distribution of methylation is given as.
, where M is a random variable representing methylation status (m for methylated, nm for nonmethylated). R = {R_{1}, R_{2,..,} R_{N}} is the set of sequence reads mapped to a site. If a sample consists of N number of CpGs and i^{th} CpG has n_{i} reads, R_{i} denotes a set of reads assigned in i^{th} CpG and R_{ij} denotes j^{th} read of i^{th} CpG (i = 1, …, N and j =1, …, n_{i}). In addition, likelihood P (R_{i}M) is given as the product of P (R_{ij}M) s. π (M) is the prior information on DNA methylation.
Derivation of P (RM)
The likelihood P (R_{i}M) can be calculated by explicitly incorporating the errors associated with the inference of methylation. The main source of errors for nonmethylated sites is the nonconversion rate (denoted as p_{0}, Figure 1A). If there is no additional error, the probability of obtaining a C read in nonmethylated site is equivalent to the nonconversion rate p_{0}. Likewise, the probability of obtaining a C read in methylated site is 1 (overconversion rate), which we denote as p_{1} (Figure 1A). There may be additional errors occurring during sequencing process. We define the sequencing error (ϵ) as the probability of being misread from other nucleotide (For example, reading C read as T read or vice versa).
Consequently, our observation likelihood P (R_{i}M) consists of the following distributions according to the methylation status.
Since sequencing errors are confounded with p_{0} or p_{1} in reality, we will regard p_{0}′ and p_{1}′ as p_{0} and p_{1} in this article, respectively. The parameters p_{0} or p_{1} are inferred from data using the ExpectationMaximization (EM) algorithm [21]. The details of this calculation are shown in the Additional file 1.
Incorporating local methylation information to improve inference
We demonstrate below that methylated cytosines are heterogeneously distributed and locally clustered in different species (Figure 2). For example in the honeybee genome, some regions exhibit >70fold higher methylation levels compared to other regions (Figure 2E). We take this observation into account and incorporate local methylation levels into the methylation prior to improve classification accuracy. Since methylated cytosines are heterogeneously distributed and locally clustered, the use of local methylation information would be useful. Since some regions may have extreme methylation values, it might be also useful to include information on the global methylation level. Here, we propose using a weighted average of local and global methylation levels to produce a more robust estimation of posterior odds. Specifically, if we denote the global methylation level as π_{1}^{G} and the local methylation level as π_{1}^{L}, the combined methylated prior, π_{1}^{C}, can be represented as π_{1}^{G} × (1  w) + π_{1}^{L} × w. The weight parameter, w, can decide how much local versus global methylation levels can be included in the prior. This factor can have any value between 0 and 1. In our analyses we used 0.5, to treat local and global information equally. In our experience, using the weight factor of 0.5 produced good AUC (Area Under Curve), sensitivity and low error rate compared with other weight factor values for honey bee data (Additional file 2). Nevertheless, in this implementation of BisClass the users can choose any arbitrary value of the weight factor.
Additional file 2. Using highconfidence CpG sites (coverage ≥7) and sampling one read for each site, we examined the AUC, sensitivity, and 1specificity of different kernels and weight factors. The results indicate that the three kernels tried (Triangle, Gaussian, and Laplace) provide similarly high sensitivity and acceptable 1specificity. Gaussian kernel performs slightly worse with respect to sensitivity.
Format: DOCX Size: 1.5MB Download file
The global methylation level can be estimated from the observed numbers of C and T reads, as an extension of the widely used ‘fractional methylation levels’ [1315]. Details are presented in the Additional file 1. In order to estimate local methylation levels (denoted as ), we additionally use the kernel function which adjusts the weight of a specific function to consider distance from the site which is to be determined. For a kernel function K (d), d is the physical distance from a site which is of interest. Then can be estimated as weighted average through kernel function, as follows:
and are estimates of p_{0} and p_{1}, respectively. L is the number of reads for a specific site, and F is the number of ‘C’ reads divided by the total number of reads (Additional file 1), and is equivalent to the commonly used ‘fractional methylation’ measures [1315]. k = 1, 2, …, K denotes the index of CpGs in a window. The kernel function K (d) can be many types of functions which decreases as d increases. In our analyses, we chose to use the triangle kernel which decreases linearly for d ≤ d_{0} and zero for d > d_{0}. In addition, we define a window around the considered site as the region whose kernel weights are not zero. The window size, d_{0}, can be arbitrarily selected. We also define K (0) = 0 to exclude the focal cytosine. Our approach is very flexible, as the width and the kernel type can be easily changed according to the properties of each dataset. We selected the triangle kernel because it is similar to be observed patterns of spatial correlation between methylation levels of adjacent CpG sites (Figure 2B). Applying alternative kernels such as Gaussian or Laplace provided similar results (Additional file 2). The width of kernel in our analyses was determined as the point where the spatial correlation decreases to below 0.2, which is approximately 1.5 kb in the honey bee data (Figure 2B).
Posterior odds
After following the above steps, the posterior odds for i^{th} CpG can be constructed as:
If the value of a specific site is larger than some criteria, it will be classified as methylated. We propose using 19 as the criterion (meaning that the probability of being classified as methylated is 19 times bigger than that of being classified as nonmethylated). This criterion also means that the probability of being falsely classified as methylated is smaller than 0.05 at the site [22]. Consequently this is equivalent to the FDRcorrected qvalue < 0.05, as typically used in the binomial test.
Results and discussion
Features of MethylCseq data with emphasis on honey Bee
In this section we present analyses of actual bisulfitesequencing data that are pertinent to our proposed method. Honey bee is one of the first invertebrates for which the methylCseq method has been applied. The usage of the methylCseq method has been crucial to elucidating the importance of DNA methylation on gene regulation in honey bee, including its role in the differentiation of castes [8], behavioral differentiation of worker bees [7], and alternative splicing [23,24]. We examined two recent methylCseq datasets of honey bee, one from the brains of worker and queen bees [8], the other from brains of six forager and six nurse bees [7]. All data have been mapped to the assembly 2.0 using BSmap [25].
As reported previously in the original studies, mean fractional methylation levels are extremely low, between 0.3 ~ 0.5% for all cytosines, and 0.5% ~ 0.9% for CpGs ( in Table 2). The mean coverage in these data sets ranges between 2.65 and 7.24 (Table 1) and the variance of read depths is quite high (Table 1). The distribution of coverage follows a shifted negative binomial distribution with similar mean and variance as observed (Figure 2A). An important consequence of this is that most of the data (~50%) are covered by fewer than four reads and a substantial portion of the data are covered by only a single read (Table 1).
Table 2. Methylation classification using the binomial and BisClass methods
Correlated levels of local DNA methylation
Methylated cytosines are not randomly distributed along the genome. DNA methylation levels of nearby cytosines are correlated; for example, in a forager sample from Herb et al. [7], the correlation coefficient between two CpGs 100 bps apart is 0.5 (Figure 2B). The correlation decreases as the distance between two cytosines increases, and this pattern is more pronounced for CpGs than nonCpGs (Figure 2B). Covariation of DNA methylation of adjacent cytosines extends to several kilobases (Figure 2B and 2C). We observed similar trends in multiple species analyzed. For example, in Arabidopsis, a similar pattern is observed (Figure 2D, also see [26]). A similar level of spatial correlation has been also observed in the human genome [27]. When examined in detail, methylated cytosines in the honey bee are locally clustered in the genome (Figure 2E), with several regions in the chromosome exhibiting elevated levels of DNA methylation (Figure 2E). Importantly, this pattern and the locations of methylated clusters are consistent across different biological replicates (Figure 2E), indicating that the spatial correlation is not caused by technical noises, but reflects the inherent pattern in the genomic distributions of DNA methylation in these species.Together with the results in the above section, we show that a substantial portion of the genome is covered by very few reads, the overall level of methylation is low, and that local methylation levels are correlated. As discussed above and seen in the Figure 1, such aspects of data render the binomial method prone to high false negative rates. Consequently, we propose BisClass as a practical alternative to the commonly used binomial method of classification. In the next section, we show comprehensive simulation results based upon the observed parameters of the data, indicating that BisClass outperforms the binomial method.
Improved sensitivity and accuracy of methylation calling by Bisclass
We performed extensive simulation to compare the performance of BisClass to the binomial method. We generated methylCseq data for a genome of 100,000 cytosines, with the mean coverage ranging between 3× to 9×. The numbers of total reads at each site were generated from a shifted negative binomial distribution with the whole genome coverage as the mean and three times the mean as the variance, similar to the typical methylCseq dataset in honey bee (Table 1, Figure 2A). The selected parameters p_{0} and p_{1}, as well as the total methylation levels are also similar to those observed in the empirical data (Table 2). We also examined the effects of each parameter when they were slightly greater than the observed values. The weight parameter we used is 0.5, to consider global information and local information equally. To examine the effect of DNA methylation clustering, we generated two types of genomes. The first is a genome where methylated CpGs are uniformly distributed. In the second type, DNA methylation is concentrated in 1/10 of the genome in a 10× intensity compared to whole genome methylation level. We generated 100 replicates for each parameter combination. Local information was obtained from the 200 nearest cytosines (which is equivalent to considering CpGs with 3000 bps of a specific site in the honey bee methylation data).
We then compared classification results with the true status and calculated the sensitivity as the proportion of sites classified as methylated when they are truly methylated (Figure 3). The higher the sensitivity, the lower the rate of false negatives. In genomes where DNA methylation occurs uniformly (‘homogeneous’), both the binomial method and BisClass provide similar results across almost all settings (purple and green bars filled with diagonal lines in Figure 3). We note that the binomial methods in clustered genomes and homogenous genomes are statistically equivalent, which is apparent in the simulation results. Sensitivities are low when the sequence coverage is low, and increase with sequence coverage. In the nonhomogenous, clustered genomes, BisClass (solid green bar) outperforms the binomial method and exhibit much higher sensitivity (therefore lower false negatives) than the binomial method (solid purple bar, Figure 3). While BisClass displays higher sensitivities compared to the binomial method in a variety of settings, the difference is most pronounced when the coverage is low. In addition, the difference between BisClass and the binomial method is large when the ratios between the two error rates (p_{0} and p_{1}) are high and the whole genome methylation level is low.We also examined the incidences of misclassification. Because the proportions of methylation and nonmethylation sites are not balanced, a direct comparison between accuracy measures is difficult to perform. For this, we define ‘1specificity’ as the ratio of the number of misclassified nonmethylated sites to the number of true methylated sites. The resulting plots (Figure 4) show that all methods have acceptably low error rates (less than five percent of true methylated sites).
Figure 3. Comparison of sensitivities of BisClass and the binomial method using simulated data. Sensitivities are evaluated in a variety of parameter settings and plotted in (A)(H). Purple bars and green bars indicate the results from the binomial method and the BisClass, respectively. Bars with diagonal lines indicate the results from homogeneous methylomes and solid bars indicate those from regionally clustered methylomes.
Figure 4. Comparison of misclassification rates of nonmethylated CpGs via the BisClass and the binomial method using simulated data. 1specificities are evaluated in a variety of parameter settings and plotted in (A)(H). The Yaxis indicates the ratio of the number of misclassified nonmethylated CpGs to the total number of methylated CpGs. Purple bars and green bars are the results from the binomial method and the BisClass, respectively. Bars with diagonal lines imply the results from homogeneous methylomes and solid bars imply those from regionally clustered methylomes.
These simulation results demonstrate that, with the cutoff comparable to FDRcorrected qvalue < 0.05, BisClass exhibits a greater sensitivity and a comparable specificity compared to the binomial method. Overall, BisClass has a greater accuracy (calculated as the sum of (proportion of methylated sites) x sensitivity and (proportion of nonmethylated sites) x specificity) than the binomial method. To illustrate this further we evaluated the Area Under Curve (AUC) measure of the ROC (Receiver operating characteristic) under identical simulation settings, which is expected to provide a comprehensive comparison because it summarizes both sensitivity and specificity across all possible cutoff values [28]. This analysis (Additional file 3) demonstrates that the AUC values of BisClass are larger than those of the binomial method, especially in settings where the sequence coverage is low and DNA methylation occurs heterogeneously, i.e., settings closely resembling the observed patterns in the actual methylCseq data (Tables 1 and 2, Figure 2). Together these results indicate that BisClass provides superior results compared to the binomial method.
Additional file 3. Comparison of the AUC measures in simulated data sets. Parameter settings of the simulation are identical with those in the Figures 3 and 4 in the main text. AUC is generally higher for the BisClass compared to the Binomial method.
Format: DOCX Size: 1.7MB Download file
Application of Bisclass to MethylCSeq data
We applied the BisClass to the aforementioned honey bee data sets. We first estimated the parameters p_{0} and p_{1} using the EM algorithm. The results are shown in Table 2; all data had very low p_{0}, indicating that the error rates due to nonconversion are small. Importantly, the p_{0} estimated from EM are highly similar to the values provided by the authors using experimental methods (Table 2). The estimates of p_{1} values are around 70% for honey bee data sets. These are much lower than the estimate from the human genome (Table 2). The underlying cause for this discrepancy needs to be studied in future experiments.
The genomewide mean DNA methylation levels are inferred from the estimated p_{0} and p_{1} (Additional file 1). These are highly similar to, but slightly lower than, the fractional methylation levels ( in Table 2). Intuitively, because the nonconversion rate (p_{0}) is substantial and on par with the mean methylation levels (Table 2), the fractional methylation levels at the face value could overestimate the actual methylation levels. On the other hand, the fact that there may exist substantial levels of overconversion (1p_{1}) indicates that ignoring the effect of overconversion can lead to underestimate the overall methylation levels. For instance, if we assume p_{1} = 0.95 (near perfect conversion), the estimated global methylation level is much lower than fractional methylation (Table 2).
It is interesting to note that in the human data, the rate of overconversion (1p_{1}) is much lower than in the honey bee data. Nevertheless, due to the overconversion effectively underestimating the actual methylation levels, the observed fractional methylation levels may be underestimates of the true methylation levels in the human genome. Again, if we assume a better overconversion rate, the estimated global methylation level is closer to the observed fractional methylation levels (Table 2). Additional file 1 includes more detailed discussions on how these errors can affect estimation of DNA methylation differently in sparsely and heavily methylated genomes.
We then evaluated posterior odds of each site to classify each site as methylated or nonmethylated. Local information is obtained from 3 kb adjacent to the focal CpG site (1.5 kb on each side), and the weight parameter used is 0.5. The numbers of methylated and nonmethylated CpGs are shown in Table 2. In honeybee samples, BisClass detects on average 10% more methylated CpGs compared to the binomial method (Table 2). To determine whether this increase is due to high false positives or due to the improved inference, we investigated the difference between callings provided by the binomial method and the Bisclass methods further by several different approaches.
First, we found that many of these mCpGs detected by Bisclass are sites that are covered by a single C read that occur in highly methylated regions (Additional file 4). This improved detection is because while the binomial method cannot recognize any mCpGs covered by only a single read (e.g., Additional file 4), BisClass can provide methylation calling if that position occurs near other methylated CpGs. We demonstrate this property using two examples recovered from the data. The first example is the gene (GB16479) from two honey bee MethylCseq data sets (Figure 5). In this data, four cytosines cluster in a region with high overall methylation levels (the fractional methylation level of a 1000 bps encompassing these four sites is ~ 0.9 in both samples). In sample A, the four cytosines were covered by only single reads, all ‘C’s. In sample B, the same four cytosines were covered by two ‘C’ reads. The binomial method calls all cytosines in the first sample as ‘unmethylated’, while calling all four cytosines in the second sample as ‘methylated’. This example demonstrates the pitfalls of the binomial method clearly: two samples with exactly same qualitative information (100% ‘C’ reads in both cases) are classified as opposite directions due to the low sample size. BisClass, on the other hand, classified all four cytosines as ‘methylated’ for both cases. In the second example (Figure 6), we show the distribution of reads mapped to the locus GB 13135 in Herb et al. [7]. There are twelve samples in this data (six forager bees and six nurse bees). In the Forager 1 sample, the third and fifth positions are covered by single C reads. binomial method will call these as nonmethylated (Figure 6B). However, since these sites occur in a heavily methylated region, BisClass calls both of these sites as methylated (Figure 6B). In other samples, these sites are covered by more than one read. For example, in the Forager 6 sample, both positions third and five are covered by seven C reads, and consequently called as methylated CpGs. The similarity between different biological replicates indicates that using local information improves methylationcalling accuracy. FDRcorrected qvalues and posterior odds for each position of this locus are provided in the Additional file 5.
Additional file 4. Histogram of mCpG counts detected using the BisClass and the Binomial method. Red and blue bars are the results from the BisClass and the Binomial method, respectively. Xaxis indicates the coverage of each site and the Yaxis indicates the sum of methylated CpG counts in the 12 samples in Herb et al. [7].
Format: DOCX Size: 1.1MB Download file
Figure 5. The GB 16479 locus exhibits qualitatively identical information yet opposite methylation calling under the binomial method. Data are from unpublished methylCseq experiments of two honey bee individuals from the Yi lab, and are available upon request. All reads mapped to the four CpGs are ‘C’ reads (indicating 100% methylation). However, the binomial method provides a different methylation calls for these two samples. Specifically, the binomial calls all CpGs in the sample A as nonmethylated (white dots), and all CpGs in the sample B as methylated (black dots). BisClass correctly identifies identical methylation features in the two replicates.
Figure 6. Contrasting methylationcalling results of the GB 13135 locus in Herb et al. [[7]] data by the two methods. (A) The numbers of ‘C’ reads (brown) and ‘T’ reads (orange) in 8 CpG positions of GB13135. (B) Classification results following the binomial method (qvalue < 0.05) and the BisClass method (Odds ≥ 19). CpGs classified as methylated are shown as black dots and those classified as nonmethylated are shown as white dots. Sites with no read are marked as X. BisClass provides results that are more consistent across the biological replicates.
Additional file 5. qvalues and odds of 12 honeybee samples in GB13135 which is displayed in Figure 6.
Format: DOCX Size: 25KB Download file
Second, we did the following experiments to directly assess the difference between the binomial method and BisClass when the numbers of reads is reduced. We assumed that we could distinguish methylated and nonmethylated positions in coveragerich CpG sites. We selected CpGs with over 7 coverage from the honey bee scaffold 1.1. There were 9300 CpGs that satisfied this criterion. For these coveragerich sites, we considered those with < 10% ‘C’ reads as unmethylated, and > 30% ‘C’ reads as methylated. We then generated a new methylseq data set by randomly selecting only a single read from these sites, thereby artificially reducing the coverage. We then used the binomial method and BisClass for methylation calling. Since we have information on the true methylation status, we can directly assess the false positives and false negatives from this experiment. We also performed the same experiments for the coverages of two and three reads. Each experiment was repeated 100 times. The results of these analyses, shown in Additional file 6, demonstrate that BisClass is superior in these low coverage sites in the real data.
Additional file 6. Comparison of three accuracy measures (AUC, sensitivity and specificity) evaluated from the confirmation analyses. We used high coverage CpG sites and then reduced their coverages to 1 and analyzed how well each method performs. using reduced coverage honeybee data: the Xaxis indicates the number of reads. Definitions of sensitivity and specificity are identical with those used in the Figures 3 and 4. Violet bars imply results of the Binomial method and green bars imply results of the BisClass.
Format: DOCX Size: 170KB Download file
Third, we examined biological consistency across different methylCseq data sets. We compared the calling results across the biological replicates offered by Herb et al. [7]. BisClass yields methylation callings that are more consistent among biological replicates. First, the coefficient of variation (CV) of methylated CpG counts in 12 samples from BisClass (0.067) is less than half of the CV from the binomial method (0.150). Second, we calculated pairwise correlations of gene methylation levels between samples in each subtype (foragers and nurses). Correlations between individuals are much greater for BisClass than those via the binomial method (Figure 7). Based on the biological facts that methylation pattern are similar between individuals in the same species (e.g., Figure 2E), the observed higher correlations implies more realistic classification of DNA methylation via BisClass. We also note that in the binomial method, pairwise correlations are highly sensitive to the coverage levels. Specifically, nurse samples have more variable coverage than forager samples (Table 1), and the calling via binomial method is highly variable, in contrast to the more stable methylation calling from BisClass.
Figure 7. Correlations between biological replicates are higher in the BisClass calling compared to the binomial calling. The left panel represents the pairwise correlations between the methylation statuses of biological replicates in the forager samples from Herb et al. [7] data. The right panel represents the pairwise correlations between the nurse samples from the same study.
Conclusions
The development of the methylCseq method has propelled genomewide evaluation of DNA methylation in diverse genomes across the tree of life. Due to the nextgeneration sequencing nature of methylCseq, the information content at each position varies greatly. Given such constraints, statistical methods that can perform robustly, even when sequence coverage is low, are desired. The existing binomial method is prone to errors in low coverage sites, particularly in sparsely methylated genomes. Our approach solves this problem by explicitly incorporating local DNA methylation levels in a Bayesian framework. This is based upon the observation that methylated sites are locally clustered. By utilizing both global and local methylation information, we can obtain more biologically consistent and relevant information. BisClass is particularly wellsuited in the analyses of sparsely methylated genomes such as insect genomes.
Availability of supporting data
All datasets used in our study can be found in Gene Expression Omnibus (GEO) site (http://www.ncbi.nlm.nih.gov/geo webcite) using below accession IDs.
Honeybee DNA methylation data from reference [7]: GSE36650.
Honeybee DNA methylation data from reference [8]: GSE56399.
Human brain DNA methylation data from reference [15]: GSE37202.
Arabidopsis DNA methylation data from reference [29]: GSM276809.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
IH, TP, and SY conceived and designed the basic algorithms for BisClass. IH implemented and optimized the package. XY mapped the methylCseq data and performed some statistical tests. TP and SY supervised all aspects of the project. IH, TP and SY wrote and revised the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant (2012R1A3A2026438) and by the Bio & Medical Technology Development Program of the NRF grant (2013M3A9C4078158) to TP and by the Abell Faculty Development Fellowship from School of Biology of Georgia Tech and NSF grants (BCS1317195 and MCB0950896) to SVY. We thank Dr. Isabel Mendizabal and Dr. Thomas Keller for valuable comments and suggestions to the manuscript.
References

Jones PA: Functions of DNA methylation: islands, start sites, gene bodies and beyond.

Suzuki MM, Bird A: DNA methylation landscapes: provocative insights from epigenomics.

Grunau C, Clark SJ, Rosenthal A: Bisulfite genomic sequencing: systematic investigation of critical experimental parameters.

Gao F, Liu XS, Wu XP, Wang XL, Gong D, Lu H, Song Y, Wang J, Du J, Liu S, Han X, Tang Y, Yang H, Jin Q, Zhang X, Liu M: Differential DNA methylation in discrete developmental stages of the parasitic nematode Trichinella spiralis.

Hunt BG, Glastad K, Yi SV, Goodisman MAD: Patterning and regulatory associations of DNA methylation are mirrored by histone modifications in insects.

Wang X, Wheeler D, Avery A, Rago A, Choi JH, Colbourne JK, Clark AG, Werren JH: Function and Evolution of DNA Methylation in Nasonia vitripennis.

Herb BR, Wolschin F, Hansen KD, Aryee MJ, Langmead B, Irizarry R, Amdam GV, Feinberg AP: Reversible switching between epigenetic states in honeybee behavioral subcastes.

Lyko F, Foret S, Wolf S, Falckenhayn C, Maleszka R: The honey bee epigenomes: differential methylation of brain DNA in queens and workers.

Zeng J, Nagrajan HK, Yi SV: Fundamental diversity of human CpG islands at multiple biological levels.

Ziller MJ, Gu H, Muller F, Donaghey J, Tsai LTY, Kohlbacher O, De Jager PL, Rosen ED, Bennett DA, Bernstein BE, Gnirke A, Meissner A: Charting a dynamic DNA methylation landscape of the human genome.

Gavery MR, Roberts SB: Predominant intragenic methylation is associated with gene expression characteristics in a bivalve mollusc.

Vining KJ, Pomraning KR, Wilhelm LJ, Priest HD, Pellegrini M, Mockler TC, Freitag M, Strauss SH: Dynamic DNA cytosine methylation in the Populus trichocarpa genome: tissuelevel variation and relationship to gene expression.

Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, TontiFilippini J, Nery JR, Lee L, Ye Z, Ngo QM, Edsall L, AntosiewiczBourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes at base resolution show widespread epigenomic differences.

Zemach A, McDaniel IE, Silva P, Zilberman D: Genomewide evolutionary analysis of eukaryotic DNA methylation.

Zeng J, Konopka G, Hunt BG, Preuss TM, Geschwind D, Yi SV: Divergent wholegenome methylation maps of human and chimpanzee brains reveal epigenetic basis of human regulatory evolution.

Becker C, Hagmann J, Muller J, Koenig D, Stegle O, Borgwardt K, Weigel D: Spontaneous epigenetic variation in the Arabidopsis thaliana methylome.

Calarco JP, Borges F, Donoghue MT, Van Ex F, Jullien PE, Lopes T, Gardner R, Berger F, Feijo JA, Becker JD, Martienssen RA: Reprogramming of DNA methylation in pollen guides epigenetic inheritance via small RNA.

Devroye L, Györfi L, Lugosi G: A probabilistic theory of pattern recognition. New York: SpringerVerlab; 1996.

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Dinh HQ, Dubin M, Sedlazeck FJ, Lettner N, Mittelsten Scheid O, von Haeseler A: Advanced methylome analysis after bisulfite deep sequencing: an example in Arabidopsis.

Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data via the EM Algorithm.

Storey JD: The positive false discovery rate: a bayesian interpretation and the qvalue.

Foret S, Kucharski R, Pellegrini M, Feng S, Jacobsen SE, Robinson GE, Maleszka R: DNA methylation dynamics, metabolic fluxes, gene splicing, and alternative phenotypes in honey bees.

LiByarlay H, Li Y, Stroud H, Feng S, Newman TC, Kaneda M, Hou KK, Worley KC, Elsik CG, Wickline SA, Jacobsen SE, Ma J, Robinson GE: RNA interference knockdown of DNA methyltransferase 3 affects gene alternative splicing in the honey bee.

Xi Y, Li W: BSMAP: whole genome bisulfite sequence MAPping program.

Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE: Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning.

Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, Haefliger C, Horton R, Howe K, Jackson DK, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K, Beck S: DNA methylation profiling of human chromosomes 6, 20 and 22.

Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms.

Lister R, O’Malley RC, TontiFilippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly Integrated SingleBase Resolution Maps of the Epigenome in Arabidopsis.