Northwestern University Biomedical Informatics Center (NUBIC), NUCATS, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA

Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA

The Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL 60611, USA

Center for Genetic Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA

Abstract

Background

High-throughput profiling of DNA methylation status of CpG islands is crucial to understand the epigenetic regulation of genes. The microarray-based Infinium methylation assay by Illumina is one platform for low-cost high-throughput methylation profiling. Both Beta-value and M-value statistics have been used as metrics to measure methylation levels. However, there are no detailed studies of their relations and their strengths and limitations.

Results

We demonstrate that the relationship between the Beta-value and M-value methods is a Logit transformation, and show that the Beta-value method has severe heteroscedasticity for highly methylated or unmethylated CpG sites. In order to evaluate the performance of the Beta-value and M-value methods for identifying differentially methylated CpG sites, we designed a methylation titration experiment. The evaluation results show that the M-value method provides much better performance in terms of Detection Rate (DR) and True Positive Rate (TPR) for both highly methylated and unmethylated CpG sites. Imposing a minimum threshold of difference can improve the performance of the M-value method but not the Beta-value method. We also provide guidance for how to select the threshold of methylation differences.

Conclusions

The Beta-value has a more intuitive biological interpretation, but the M-value is more statistically valid for the differential analysis of methylation levels. Therefore, we recommend using the M-value method for conducting differential methylation analysis and including the Beta-value statistics when reporting the results to investigators.

Background

Methylation of cytosine bases in DNA CpG islands is an important epigenetic regulation mechanism in the organ development, aging and different disease statuses

To estimate the methylation status, the Illumina Infinium assay utilizes a pair of probes (a methylated probe and an unmethylated probe) to measure the intensities of the methylated and unmethylated alleles at the interrogated CpG site

Results

Definition of Beta-value and M-value

The Beta-value is the ratio of the methylated probe intensity and the overall intensity (sum of methylated and unmethylated probe intensities). Following the notation used by Illumina methylation assay ^{th }interrogated CpG site is defined as:

where _{i,menty}
_{i,unmenty}
^{th }methylated and unmethylated probes, respectively. To avoid negative values after background adjustment, any negative values will be reset to 0. Illumina recommends adding a constant offset

The M-value is calculated as the log2 ratio of the intensities of methylated probe versus unmethylated probe as shown in Equation 2:

Here we slightly modified the definition given in

Relationship between Beta-value and M-value

For Illumina methylation data, typically more than 95% of interrogated CpG sites have intensities (_{
i,unmethy
}+_{
i,methy
}) larger than 1000 (our evaluation dataset had 99.8% interrogated CpG sites with intensities higher than 1000.). Therefore, the relatively small offset value (i.e., 100) in the denominator of Equation 1 has negligible effect on the Beta-value for most interrogated CpG sites. Similarly, the offset

Equation 3 indicates that the relationship is a logistic function (shown as a base 2 logarithm instead of natural logarithm). Figure

The relationship curve between M-value and Beta-value

**The relationship curve between M-value and Beta-value**.

Histograms of Beta-value and M-value

Figure

The histograms of Beta-value (left) and M-value (right) (27578 interrogated CpG sites in total)

**The histograms of Beta-value (left) and M-value (right) (27578 interrogated CpG sites in total)**.

The distribution of standard deviation across different methylation levels

In high-throughput statistical data analyses, many of them, like canonical linear models or ANOVA, assume the data is

The mean and standard deviation relations of technical replicates. Beta-value (left) and M-value (right)

**The mean and standard deviation relations of technical replicates. Beta-value (left) and M-value (right)**.

Performance comparison between Beta and M-values

Evaluation dataset

Titration data has been widely used to evaluate the performance of new methods for analyzing mRNA expression microarrays

As shown in Figure

Beta-value: low (0, 0.2), middle [0.2, 0.8] and high (0.8, 1).

M-value: low (-Inf, -2), middle [-2, 2] and high (2, Inf).

Define differentially methylated CpG sites based on correlation

If an examined CpG site has a significant methylation difference between Sample A and B, its methylation profile should be correlated with the titration profile shown in Table

Design of the methylation titration experiment

**% mix of A and B for each sample**

**Mix1**

**Mix2**

**Mix3**

**Mix4**

**Mix5**

**A**

100

90

75

50

0

**B**

0

10

25

50

100

**N _{tech}***

2

2

1

1

2

* N_{tech }represents the number of technical replicates

Performance comparison based on differential methylation analysis

One of the major statistical paradigms in expression microarray analysis has been the "Fold change-ranking with a non-stringent p-value cutoff"

Following a similar logical framework, we first used a simple t-test to compare two technical replicates of Sample A and two technical replicates of Sample B, and require a differentially methylated CpG site to have p-value < 0.05. We then separated these filtered CpG sites into the three analysis groups listed in the "Evaluation Dataset" subsection: low (2221 CpG sites for Beta-value; 2794 CpG sites for M-value), middle (6855 CpG sites for Beta-value; 6179 CpG sites for M-value) and high (457 CpG sites for Beta-value; 625 CpG sites for M-value) methylation analysis groups. In each analysis group, we sorted the CpG sites in decreasing order based on their absolute methylation difference between Sample A and B, i.e., ^{th }CpG site. We then evaluate the performance of each method by selecting the top _{detected}|/|_{detected}|, where _{detected }represents the CpG sites included in the evaluation set. We also calculated the Detection Rate (DR) for each evaluation set, where DR was defined as the percentage of detected TP CpG sites among all TP CpG sites, i.e., _{detected}|/|

Performance comparisons of Beta- and M-value in the range of low, middle and high methylation levels based on the relationship of 1 - Detection Rate versus True Positive Rate

**Performance comparisons of Beta- and M-value in the range of low, middle and high methylation levels based on the relationship of 1 - Detection Rate versus True Positive Rate**.

Refinement of the basic differential methylation analysis

Similar to other hybridization techniques, there is an inherent level of variability associated with sample preparation, sample loading, the microarrays and the detectors. To address this variability it is very common to add a "minimum difference threshold" to select out CpG sites with little difference between two biological conditions. Next we want to evaluate the performance of the Beta-value and M-value statistics if we include a minimum difference threshold in addition to the p-value requirement.

After imposing a difference threshold, the identified differentially methylated CpG sites will have p-values < 0.05 and have the mean methylation level difference between A and B samples larger than the difference threshold. Figure

Performance comparisons of Beta and M-value based on the True Positive Rate (TPR) and Detection Rate (DR) at different thresholds of methylation difference

**Performance comparisons of Beta and M-value based on the True Positive Rate (TPR) and Detection Rate (DR) at different thresholds of methylation difference**. (A) TPR versus threshold of difference of Beta-value; (B) TPR versus threshold of difference of M-value; (C) DR versus threshold of difference of Beta-value; (D) DR versus threshold of difference of M-value.

Figure

Discussion

The Beta-value method has already been widely used to calculate methylation levels, and it is the manufacturer recommended method for analyzing Illumina Infinium HumanMethylation27 BeadChip microarrays. The M-value method has been widely used in the expression microarray analysis, and has been used to calculate methylation levels in some methylation microarray analyses

To compare the performance of Beta and M-value methods in identifying the differentially methylated CpG sites, we designed a methylation titration experiment. As we do not know the 'true' methylated CpG sites, we have defined a set of True Positives (TPs) based on high levels of correlation between the methylation and titration profiles. It is important to note that some true differentially methylated CpG sites may not be included in this set of TPs; at the same time, some false positives may also be included in the TPs. Fortunately, athough a small number of false positives or false negatives will affect the estimation of TPRs and DRs, but does not affect the overall performance comparisons between two methods (We did simulations by randomly adding or removing 10% TPs, and found the performance difference between Beta and M-values are consistent with the curves shown in Figure

In microarray differential analysis, adding a difference (or fold-change) threshold is another common practice and effective way to improve the TPR. However, due to the severe

Conclusions

The Beta-value method has a direct biological interpretation - it corresponds roughly to the percentage of a site that is methylated. This makes the Beta-value very attractive when modeling the underlying biological effect. However, this interpretation is an approximation

Methods

Titration Samples

Similar to the titration design using Goldengate methylation chips by Bibikova and et al

DNA Methylation Profiling using Illumina Infinium BeadChip Microarrays

The DNA samples were prepared following the guidelines suggested by the manufacturer (Illumina, Inc.), and then measured by Illumina Infinium HumanMethylation27 BeadChip, which measures 27578 CpG sites. The HumanMethylation27 BeadChip contains a pair of methylated and unmethylated probes designed for each CpG site. All experiments were conducted following the manufacturer's protocols by the Genomics Core at Northwestern University. The Illumina BeadChips were scanned with an Illumina BeadArray Reader and then preprocessed by the Illumina GenomeStudio software. Raw data have been deposited in the NCBI GEO database under the accession number of GSE23789.

We used the Bioconductor

Authors' contributions

PD and SML initialized the idea of this paper. PD conducted all data analysis and drafted the manuscript. LH and SML supervised the methylation project. CH participated all discussions of data analysis and manuscript revisions. SML, PD, LH and WAK designed the titration experiment. XZ performed the titration experiment. All authors participated in the project at different stages, discussed the results and commented on the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We appreciate the very constructive critique and insightful comments of the reviewers. This work was supported in part by the NIH award 1RC1ES018461-01 to LH. PD, SML and WAK acknowledge the support of P30CA060553 and UL1RR025741. We would like to thank Vivi Frangidakis for conducting the Illumina BeadChip experiments, Leming Shi for discussing the "FC-ranking" paradigm. We would also like to acknowledge other participants in the "DNA Methylation Alterations in Response to Pesticides Exposure" project meetings for their inputs and support: Hehuang Xie, Min Wang, Yue Yu and Marcelo Bento Soares.