Abstract
Background
DNA methylation profiles differ among disease types and, therefore, can be used in disease diagnosis. In addition, largescale whole genome DNA methylation data offer tremendous potential in understanding the role of DNA methylation in normal development and function. However, due to the unique feature of the methylation data, powerful and robust statistical methods are very limited in this area.
Results
In this paper, we proposed and examined a new statistical method to detect differentially methylated loci for case control designs that is fully nonparametric and does not depend on any assumption for the underlying distribution of the data. Moreover, the proposed method adjusts for the age effect that has been shown to be highly correlated with DNA methylation profiles. Using simulation studies and a real data application, we have demonstrated the advantages of our method over existing commonly used methods.
Conclusions
Compared to existing methods, our method improved the detection power for differentially methylated loci for case control designs and controlled the type I error well. Its applications are not limited to methylation data; it can be extended to many other case–control studies.
Keywords:
Nonparametric method; Onesided test; Combining pvalueBackground
Essential for normal development and an influence on a variety of processes related to DNA integrity and function, DNA methylation plays an important role for epigenetic gene regulation in both development and disease [1]. It is associated with processes including genomic imprinting, Xchromosome inactivation, suppression of repetitive elements, and carcinogenesis [24]. Aberrant DNA methylation, such as hypomethylation of oncogenes and hypermethylation of tumor suppressor genes, leads to genomic instability and tumorigenesis [510].
With the development of highthroughput platforms, genomewide analysis of largescale DNA methylation patterns and their impacts on gene regulation have received a significant amount of attention. We are proposing an ageadjusted nonparametric method to detect differentially methylated loci that can account for age effects that has advantages over existing methods the limitations of which we explain next. Typically, methylation levels in Illumina methylation assays are quantified in terms of the βvalue calculated from the intensity ratio of methylated (M) to unmethylated (U) alleles. Specifically, where M and U are the intensities of red and green dyes, respectively, for the GoldenGate and VeraCode Methylation assays, or the signals A and B, respectively, for the Illumina assay. The striking feature of the βvalues is that they are continuous and range from 0 (totally unmethylated) to 1 (fully methylated).
With more and more DNA methylation data generated from the highthroughput DNA methylation platforms, powerful and efficient statistical methods to handle these complex data are becoming highly demanded. One of the important research topics in this area is to detect differentially methylated loci in case and control studies. The commonly used methods for this purpose are the Student’s ttest and linear regression. Recently, a number of modelbased approaches have been proposed in the literature. Siegmund [11] introduced a Bernoullilognormal mixture model to perform clustering analysis on methylation data generated using MethyLight. Houseman [12] proposed a βmixture model to classify different tissue types using a recursivepartitioning algorithm for highdimensional data from Illumina arrays. Wang [13] developed a likelihood based uniformnormalmixture model to select differentially methylated loci between case and control groups using the Illumina array. The basic idea of Wang [13] is to describe the data using a threecomponent mixture model and treat completed methylated, unmethylated, and partially methylated loci differently. Wang [13] showed that, compared to the Student’s ttest under some situations, their method increases detection power [13]. However, the aforementioned methods assume that the methylation profiles follow some special distributions that are known in terms of a finite number of parameters. Obviously if the underlying true distribution is far from the proposed ones, such assumptions will lead to biased results in practice.
Another complexity of the methylation study comes from the presence of other potential confounders such as age effects. As shown in [1417], age is by far the strongest demographic risk factor for cancer, and there is substantial evidence that aging affects DNA methylation of specific loci, including cancerrelated genes. Based on these observations, it is necessary to adjust age effects in the analysis of detecting differentially methylated loci. If the relationship between the methylation and age is more complex than a linear one, a traditional linear regression leads to inaccurate results. To solve this problem, Chen [16] proposed a method by first dividing the samples into several age groups and then combined the pvalues obtained from each individual group together to form a new test. This method has been shown to increase the power in contrast to the traditional ttest without age adjustment or regression model with age adjusted linearly. However, within each group, a pvalue is obtained from a simple ttest that is less robust and thus leaves room for improvement.
In this paper, we propose and examine a novel means to detecting differentially methylated loci and, that is, an ageadjusted nonparametric method that can account for age effects, given that substantial evidence exists that aging affects DNA methylation of specific loci, including cancerrelated genes. Our method stems from the rankbased nonparametric methods that focus on a comparison of two entire empirical distribution functions rather than only two means. More specifically, we first divide the subjects into several age groups; then for each group, a nonparametric test is performed on each locus, and an individual pvalue is reported. An overall pvalue for each locus is estimated through combining all the individual pvalues computed previously for that locus in different age groups. Our method does not depend on any distribution assumption but rather adjusts for age effects in an efficient way. We demonstrate the powerfulness of our method using both simulated and real datasets.
Methods
Assume all the subjects are from K different age groups. In the k^{th} group, k = 1,…,K, there are n_{1k} control subjects and n_{2k} case subjects. For each DNA methylation marker, let y_{ijk,}i = 1,…,n_{jk}, j = 1,2, k = 1,…,K, denote the observation of βvalue for the i^{th} subject in j^{th} treatment (1 for control and 2 for case) and k^{th} age group. To adjust the age influence on the methylation level, the linear model takes the form
Where a_{j} and b_{k} are regression coefficients and ε_{ijk} follows a i.i.d normal distribution. The normality assumption is clearly invalid for the βvalues which have limited range between 0 and 1. Moreover, the relationship between the βvalue and age is likely to be more complicated than a linear one. Therefore more powerful and robust nonparametric methods are desirable.
Here we propose a new nonparametric method which does not depend on any distribution assumption and meanwhile allows for the adjustment of covariates. We process as follows. For each age group k (k = 1,…,K), we test the difference between the control group y_{i1k}(i = 1,…,n_{1k}) and the case group y_{i2k}(i = 1,…,n_{2k}). Our goal is to test whether or not the two methylation groups follow the same distribution. Toward this end, the Wilcoxon rank sum test is a useful tool when there are reasons to believe that the outcome variables of interest may fail certain distributional assumptions required for parametric methods. However, as discussed in Baumgartner [18], Wilcoxon rank sum test is not suitable for situations where the expected values of the two populations are close to each other. To overcome this problem, they proposed a more powerful nonparametric test to handle the general twosided twosample problem [18]. Neuhaeuser further extended the twosided twosample test to a onesided test that can detect if one population is stochastically larger than the other [19]. Our pvalue calculation for each age group is based on Neuhaeuser’s onesided test whose statistics can be explicitly formulated as
where
Here G_{i}, i = 1,2,…, n_{1k} and H_{j},j = 1,2,…,n_{2k} are the ranks of the samples from the k^{th} case and control groups, respectively. Due to the intractable asymptotic distribution for the test statistics B, it is hard to find a close form for the relationship between pvalue and B. We will use numerical fit to approximate this relationship. We first obtain the empirical distribution of B based on 10^{7} permutations and then fit the distribution function piecewise exponentially to obtain the empirical relationship. The pvalue for the k^{th} age group can be calculated according to this empirical formula.
As a consequence, we have K pvalues from the leftsided test, denoted by p_{lk}(k = 1,…,K), and K pvalues from the rightsided test, denoted by p_{rk} = 1p_{lk}(k = 1,…,K). Then combining K leftsided pvalues together gives a statistic
Similarly, combining K rightsided pvalues together gives a statistic
Under the null hypothesis of no difference between the two treatment groups, p_{lk} and p_{rk} are uniform [0,1] random variables for k = 1,…,K. Therefore, according to Fisher [20], both T_{l} and T_{r} will follow a chisquare distribution with degree of freedom 2 K. We define a new variable:
then we have [21]
Thus, for small α, we can approximate the pvalue of T by its upper bound 2α as
More details can be found in [2123]. For large α, we will fit a formula empirically through permutation. We call the above proposed method “combined BaumgartnerWeißSchindler (BWS) test”.
Results and discussion
Empirical fit of the pvalue formula for onesided BWS test
The asymptotic distribution function of B is complex and in practice permutation method is often used. The permutation results depend on the sample size. But as mentioned in Baumgartner [18], for a twosided test, the asymptotic distribution can be approximated by the permutation method quite well even with a small sample size (as small as 10). We first derive the empirical formula to fit the asymptotic distribution using the permutation method. Toward this end, we sample two subpopulations from the same distribution (e.g. standard normal), each of which has a sample size of 30. Then, the whole populations are permuted 10^{7} times, and a onesided BWS test statistic B is calculated for each permuted sample. Then we fit the empirical cumulative distribution of B using a piecewise exponential function to arrive at the following empirical formula
The node points we used here are 1.5 and 9. We find that the choice of node points has very little influence on the final analysis results for both simulated and real data. Note that the sample size we used for deriving this formula is 30. We have also tried some other choices and found that the results are quite similar. Thus, we recommend the above formula to be used in practice for problems with a sample size larger than 10.
Simulation results
The first simulation settings are for the evaluation of the type I error rate for the proposed method. For the purposes of comparison, we also include the results from the combined ttest proposed in [16], linear regression and combined Wilcoxon methods for all simulated and real datasets. We assume that there are 6 age groups, and each group includes 100 subjects, 50 controls and 50 cases. For each age group, we also assume the βvalues follow a threecomponent mixture distribution as in Wang [13]. Let τ_{1} and τ_{2} be the two threshold values. The first and the second components are uniform distributions and . The third component is a truncated normal distribution . The probabilities for a measurement to fall into these three components are π_{1}, π_{2}, π_{3} respectively. Under the null hypothesis, the two treatment groups are sampled from the same distribution. The mean of the truncated normal distribution is taken to be μ + k*δ_{μ} for the k^{th} age group. The simulation is repeated 10000 times. Table 1 lists the proportion of times that null is rejected using the four different methods under different parameter settings. Table 1 shows that the nominal type I error rate of 0.05 is well controlled by all methods.
Table 1. Estimated Type I error rates at significant level 0.05 based on the four methods under different parameter settings of the threecomponent mixture distributions with τ_{1} = 0.1, τ_{2} = 0.9 and δ_{μ} = 0.05
The second simulation settings are for assessing the power of the proposed method under alternative hypothesis, i.e. the case and control subjects are sampled from different distributions. We still assume that the βvalues follow the similar threecomponent mixture distributions. Two scenarios are considered here. In the first scenario, we use different means for different treatment groups. More specifically, we let the truncated normal mean for the control sample be constant μ, but for the case sample vary as µ + k * δ_{μ} for the k^{th} age group. We replicate the simulation 10000 times for each scenario. The power is defined as the proportion of times that the pvalue is less than 0.05. The first two rows in Table 2 list the results for this scenario. In the second scenario, we let the two threshold values vary as τ_{1} + kδ_{τ} and τ_{2}kδ_{τ} for the k^{th} case group but keep them constants as τ_{1} and τ_{2} for all control groups. The results are listed in the last two rows of Table 2. For the first scenario, the mean values are different for the case and control groups. As expected, all four methods have increasing powers as the signal increases. For the second scenario, the expected values are the same, but the variances are different between the two treatment groups. In this situation, our proposed method is more powerful in detecting the difference than the other three methods. Therefore, the proposed method can detect not only the location difference but also the scale difference between the two distributions.
Table 2. Estimated powers of the four methods at significant level 0.05 under different parameter settings for the threecomponent mixture distributions with τ_{1} = 0.1, τ_{2} = 0.9
The third settings assume that the βvalues for both the case and control subjects follow the betadistributions. Let s_{1} = s_{2} = 4. For the case group, the βvalues are sampled from a betadistribution Beta(s_{1} + δ, s_{2} − δ). For the control group, the βvalues are sampled from a betadistribution Beta(5s_{1} + δ, 5s_{2} − δ). Here δ takes six different values, 3λ, 2λ, λ, 0.5λ, λ, and 1.5λ, one for each age group.
Based on the above setting, it can be easily shown that the mean of the distribution for the case group is while the mean of the distribution for the control group is The mean difference between the two treatment groups is which will increase with δ. Table 3 lists the empirical powers of four methods for different λ values. In all situations, the most powerful method is combined DWS. Since the variance of two distributions are different, even in situation where δ = 0, the power of the proposed method can still reach 0.67, whereas the other three methods have no power at all. As λ decreases, the mean differences become smaller and make it more difficult to distinguish between the two treatment groups. As expected, the powers decrease for all methods as λ decreases. For small λ, the power from the combined DWS method is much bigger than those from the other three methods. The performances of the combined ttest and combined Wilcoxon methods are quite similar, and both are much better than the linear regression method.
Table 3. Change of the power with λ for four different methods when the distributions are Beta(s_{1} + δ, s_{2} − δ) and Beta(5 s_{1} − δ, 5 s_{2} + δ) for the case and control groups; δ takes the values of 3λ, 2λ, λ, 0.5λ, λ, and 1.5λ for the six age groups and s_{1} = s_{2} = 4
Real data results
We also applied our proposed method to the United Kingdom Ovarian Cancer Population Study (UKOPS) [15] to select differentially methylated loci between ovarian cancer cases and healthy controls. The data were created using Illumina Infinium Human Methylation27 Beadchip and downloaded from the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo webcite) under the accession number GSE19711. There were 274 control samples and 131 pretreatment case samples, and the number of loci was 27578. For the data quality control, we removed 29 patients (15 controls and 14 treatments) with BS conversion efficiency value < 4000 or coverage rate < 95% [15]. For each treatment group, the samples were divided into 6 age groups (50–55, 55–60, 60–65, 65–70, 70–75, and 75 and over). We further removed 12 patients in the pretreatment group whose ages exceeded 78 since there were no such patients in the control group. The final sample size for each individual group is the same as the one listed in Table 2 of Chen [16] except that the “75 and over” group has 13 pretreatment samples instead of 25. This dataset was analyzed by both Wang [13] and Chen [16] in their papers. Wang [13] did not consider the age effect, and only 96 cases and 136 controls were included in their analysis after further screening; while Chen [16] included the 12 patients with ages exceeding 78; thus their results are different from ours even though the same method was used.
Figure 1 shows the scatter plots for each locus based on the negative log10 pvalues derived from four different methods. Figure 1 (a) plots the results for the combined DWS test and the linear regression method, Figure 1 (b) plots the results for the combined DWS test and combined ttest, Figure 1 (c) plots the results for the combined DWS test and combined Wilcoxontest. From Figure 1 (a), it can be seen that most of the loci with small pvalue in the linear regression tend to have small pvalue in our proposed method as well. However, there exist many loci whose pvalues are large in the linear regression but small and significant in the combined DWS test, i.e., those points in Figure 1 (a) with xvalues close to zero but yvalues large than 3. This indicated that our proposed method is more powerful than the linear regression method in detecting the differentially methylated loci. Similar conclusions can be drawn from Figure 1 (b) and Figure 1 (c) for the comparison with the combined ttest and combined Wilconxon test respectively.
Figure 1. Scatter plots for negative log10 pvalues based on different methods. The left panel is for the comparison between combined DWS and linear regression. The middle panel is for the comparison between combined DWS and combined ttest. The right panel is for the comparison between combined DWS and combined Wilcoxon test.
Table 4 lists the number of the loci detected by each of the four tests based on four different significant levels, 10^{3}, 10^{4}, 10^{5}, 10^{6}. Clearly at the same significant level, in terms of the number of significant loci detected, the most powerful method is combined DWS, the next one is combined Wilcoxon test, and then combined ttest and linear regression. Table 4 also reports the numbers of significant loci obtained by pairs of the proposed test and each of the other three methods for given significance levels. For example, when the cutoff pvalue is 10^{3}, the combined BWS, the linear regression, the combined ttest, and the combined Wilcoxon obtained 3387, 2038, 2754, and 3143 significant loci, respectively. However, among those 3387 loci that have pvalues less than 10^{3} using the combined BWS method, there were 1884, 2659, and 3081 loci overlapping with the linear regression, the combined ttest, and the combined Wilcoxon methods, respectively. In other words, there were 1503, 728, and 306 significant loci were only obtained by the proposed test, but not by the linear regression, the combined ttest, and the combined Wilcoxon, respectively. In contrast, at the same significance level 10^{3}, there were only 154, 95, and 62 loci whose pvalues from the new method are larger than 10^{3} but less than 10^{3} from the linear regression, the combined ttest, and the combined Wilcoxon, respectively. Therefore most of the loci detected by the other three methods are also detected by the proposed method. However, for the same cutoff level, there were many loci that were significant in the proposed method but not in the other three methods, a point that clearly demonstrated the advantages of our method over those three.
Table 4. Number of loci with pvalues less than the given cutoff significance levels from different methods
Discussion
To study whether or not the proposed method can control type I error rate as well, we created pseudo case and control samples. The way we did this was to first randomly divide the original control subjects into two parts for each age group. Then we put one part into the new pseudocontrol group and the other one into the new pseudocase group. The distribution of pvalues from applying the proposed method to this new case–control data set is shown in Figure 2 (a). It is very close to uniform distribution, and this finding is further confirmed by the qqplot against the uniform [0,1] distribution as illustrated in Figure 2(b). Therefore, our method increased the detection power while it simultaneously controlled the type I error rate.
Figure 2. Test the distribution of pvalues by applying the proposed method to a newly created case–control data based on the samples from the original control group. The left panel is for the histogram and the right is a qqplot against the uniform [0,1] distribution.
In this paper we chose different cutoff pvalues to compare the performance of the proposed test with others. We did not consider the multiple testing issue, which is an important but difficult task for this area [4] and other areas where a large number of correlated variables are tested simultaneously [2428]. The traditional correction methods for multiple comparisons, such as Bonferroni correction, are inappropriate here as they are too conservative due to the fact that many loci from the same gene are highly positively correlated. To account for the correlations among loci, methods based on the concept of “effective number” may be adopted [29].
There are many ways to combine pvalues from independent tests [20,3032]. In this paper, we chose to use Fisher test due to its robustness. It is possible, however, that other methods may be more powerful under some certain conditions. The combined pvalue method used in this paper is based on the assumption that the test for every individual age group is independent from each other. However, if this assumption is questionable, the current combined pvalue method needs to be modified such that it can handle the correlations among the individual tests as well. This is another research topic we will pursue in a future paper.
Conclusions
In case–control methylation studies, the underlying distribution of the βvalues is rarely known in advance, and clearly the normality assumption is not valid. Parametric models can be powerful if the assumptions for the models are valid, but they can also lead to biased results if the underlying true distribution is far deviated from the imposed ones. Thus, it is desirable to choose a powerful yet robust statistical test that does not depend on any underlying distribution assumption. In this article we proposed and examined a rankbased nonparametric method to detect differentially methylated loci. Through simulation, we showed that our proposed method is as powerful as the linear regression, ttest and Wilcoxon rank sum test methods if the mean differences between the two treatment groups are large. However, our method substantially outperformed the others in situations where the mean differences between the two groups were small while the variance differences were large.
Note that the age effects are strongly associated with methylation, and the ignoring age effect will cause a loss of power or a large number of false positives. Another advantage of the proposed method over many existed ones is that it combined the nonparametric test together with age adjustment. Our next goal was to generalize our method to adjust for more confounders other than the age such that it can have a much wider application in methylation research.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HH devised the basic idea, analyzed the data and drafted the manuscript; ZC devised the basic idea and analyzed the data. XH participated in the study design, discussion and edit the manuscript. All authors read and approve the final manuscript.
Acknowledgements
The authors would like to thank Ms. Kim Lawson for her professional editorial assistance, which highly improved the presentation of this paper. ZC also thanks the support from the faculty research funds awarded by the School of Public Health at the Indiana University Bloomington.
References

Baylin SB, Ohm JE: Epigenetic gene silencing in cancer–a mechanism for early oncogenic pathway addiction?
Nat Rev Cancer 2006, 6(2):107116. PubMed Abstract  Publisher Full Text

Kuan PF, Wang S, Zhou X, Chu H: A statistical framework for Illumina DNA methylation arrays.
Bioinformatics 2010, 26(22):2849. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Gräf S, Tomazou EM, Bäckdahl L, Johnson N, Herberth M: An integrated resource for genomewide identification and analysis of human tissuespecific differentially methylated regions (tDMRs).
Genome Res 2008, 18(9):15181529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuan PF, Chiang DY: Integrating prior knowledge in multiple testing under dependence with applications to detecting differential DNA methylation.
Biometrics 2012. Publisher Full Text

Feinberg AP, Tycko B: The history of cancer epigenetics.
Nat Rev Cancer 2004, 4(2):143153. PubMed Abstract  Publisher Full Text

Jabbari K, Bernardi G: Cytosine methylation and CpG, TpG (CpA) and TpA frequencies.
Gene 2004, 333:143149. PubMed Abstract  Publisher Full Text

Jones PA, Baylin SB: The fundamental role of epigenetic events in cancer.
Nat Rev Genet 2002, 3(6):415428. PubMed Abstract  Publisher Full Text

Kulis M, Esteller M: DNA methylation and cancer.
Adv Genet 2010, 70:2756. PubMed Abstract  Publisher Full Text

Laird PW: Principles and challenges of genomewide DNA methylation analysis.
Nat Rev Genet 2010, 11(3):191203. PubMed Abstract  Publisher Full Text

Xu GL, Bestor TH, Bourc'his D, Hsieh CL, Tommerup N, Bugge M, Hulten M, Qu X, Russo JJ, ViegasPéquignot E: Chromosome instability and immunodeficiency syndrome caused by mutations in a DNA methyltransferase gene.
Nature 1999, 402(6758):187191. PubMed Abstract  Publisher Full Text

Siegmund KD, Laird PW, LairdOffringa IA: A comparison of cluster analysis methods using DNA methylation data.
Bioinformatics 2004, 20(12):18961904. PubMed Abstract  Publisher Full Text

Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zheng S, Wiencke JK: Modelbased clustering of DNA methylation array data: a recursivepartitioning algorithm for highdimensional data arising as a mixture of beta distributions.
BMC Bioinformatics 2008, 9(1):365. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wang S: Method to detect differentially methylated loci with case–control designs using Illumina arrays.
Genet Epidemiol 2011, 35(7):686694. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Christensen BC, Houseman EA, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, Nelson HH, Karagas MR, Padbury JF, Bueno R: Aging and environmental exposures alter tissuespecific DNA methylation dependent upon CpG island context.
PLoS Genet 2009, 5(8):e1000602. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Teschendorff AE, Menon U, GentryMaharaj A, Ramus SJ, Weisenberger DJ, Shen H, Campan M, Noushmehr H, Bell CG, Maxwell AP: Agedependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer.
Genome Res 2010, 20(4):440446. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen Z, Liu Q, Nadarajah S: A new statistical approach to detecting differentially methylated loci for case control Illumina array methylation data.
Bioinformatics 2012, 28(8):11091113. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen Z, Huang H, Liu J, Ng HKT, Nadarajah S, Huang X, Deng Y: Detecting differentially methylated loci for Illumina Array methylation data based on human ovarian cancer data.
BMC Med Genomics 2013, 6(Suppl 1):S9. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Baumgartner W, Weiß P, Schindler H: A nonparametric test for the general twosample problem.
Biometrics 1998, 54:11291135. Publisher Full Text

Neuhäuser M: Exact tests for the analysis of case–control studies of genetic markers.
Hum Hered 2002, 54(3):151156. PubMed Abstract  Publisher Full Text

Fisher RA: Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd; 1932.

Owen AB: Karl Pearson's metaanalysis revisited.
Ann Statist 2009, 37(6B):38673892. Publisher Full Text

Chen Z, Ng HKT: A robust method for testing association in genomewide association studies.
Hum Hered 2012, 73(1):2634. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen Z: A new association test based on Chi‐square partition for case‐control GWA studies.
Genet Epidemiol 2011, 35(7):658663. PubMed Abstract  Publisher Full Text

Chen Z, Huang H, Ng HKT: Testing for association in case–control genomewide association studies with shared controls.
Stat Methods Med Res 2013.
First published on February 1, 2013
Publisher Full Text 
Chen Z, Huang H, Ng HKT: Design and analysis of multiple diseases genomewide association studies without controls.
Gene 2012, 510(1):8792. PubMed Abstract  Publisher Full Text

Chen Z, Liu J, Ng HKT, Nadarajah S, Kaufman HL, Yang JY, Deng Y: Statistical methods on detecting differentially expressed genes for RNAseq data.
BMC Syst Biol 2011, 5(Suppl 3):S1. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chen Z, McGee M, Liu Q, Kong YM, Huang X, Yang JY, Scheuermann RH: Identifying differentially expressed genes based on probe level data for GeneChip arrays.
Int J Comput Biol Drug Des 2010, 3(3):237257. PubMed Abstract  Publisher Full Text

Chen Z, Liu Q, McGee M, Kong M, Huang X, Deng Y, Scheuermann RH: A gene selection method for GeneChip array data with small sample sizes.
BMC Genomics 2011, 12(Suppl 5):S7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chen Z, Liu Q: A New approach to account for the correlations among single nucleotide polymorphisms in genomewide association studies.
Hum Hered 2011, 72(1):19. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen Z: Is the weighted z‐test the best method for combining probabilities from independent tests?
J Evol Biol 2011, 24(4):926930. PubMed Abstract  Publisher Full Text

Chen Z, Nadarajah S: Comments on ‘Choosing an optimal method to combine p‐values’ by Sungho Won, Nathan Morris, Qing Lu and Robert C. Elston, Statistics in Medicine 2009; 28: 1537–1553.
Stat Med 2011, 30(24):29592961. PubMed Abstract  Publisher Full Text

Lancaster H: The combination of probabilities: an application of orthonormal functions.
Austral J Statist 1961, 3:2033. Publisher Full Text