Abstract
With the availability of very dense genomewide maps of markers, multiple testing has become a major difficulty for genetic studies. In this context, the falsediscovery rate (FDR) and related criteria are widely used. Here, we propose a finite mixture model to estimate the local FDR (lFDR), the FDR, and the false nondiscovery rate (FNR) in variancecomponent linkage analysis. Our parametric approach allows empirical estimation of an appropriate null distribution. The contribution of our model to estimation of FDR and related criteria is illustrated on the microarray expression profiles data set provided by the Genetic Analysis Workshop 15 Problem 1.
Background
In the context of genetic studies for which highdensity genetic maps are now widely available, a major multiple testing problem arises due to the large number of statistical tests that are performed simultaneously. In a recent study, Morley et al. [1] analysed microarray geneexpression data together with genomewide single nucleotidepolymorphism (SNP) genotyping in 14 threegeneration families to localize the genetic determinants underlying geneexpression variability (data provided for Genetic Analysis Workshop 15 (GAW 15) Problem 1). For the genomewide linkage analysis, the authors calculated a nonparametric HasemanElston statistic and used the genomewide significance thresholds proposed by Lander and Kruglyak [2] to identify linked loci. Thus, they controlled the classical familywise error rate (FWER), i.e., the probability of falsely rejecting at least one null hypothesis.
Although the FWER is the oldest extension of the classical type I error rate, FWERbased procedures are often too conservative, particularly when numerous hypotheses are being tested [3]. As an alternative and less stringent error criterion, Benjamini and Hochberg introduced, in their seminal paper published in 1995 [4], the falsediscovery rate (FDR), which is defined as the expected proportion of false discoveries among all discoveries (here, a discovery refers to a rejected null hypothesis). The opposing criterion, the false nondiscovery rate (FNR), corresponds to the expected proportion of false nondiscoveries among all the nonrejected null hypotheses [5].
More recently, Efron et al. introduced the local FDR (lFDR) [6], which can be interpreted as a variant of the Benjamini and Hochberg's FDR that gives each tested null hypothesis its own "measure of significance". While the FDR is defined for a whole rejection region, the lFDR is defined as the probability that a null hypothesis is true conditional on a particular value of the test statistic. As discussed by Efron [7], the local nature of the lFDR is advantageous for interpreting results from individual test statistics. Moreover, the FDR can be estimated directly from the lFDR [6].
Efron proposed an empirical Bayes' procedure [7,8] to estimate the lFDR without any assumption about the distribution under the alternative hypothesis. From this procedure, only an upper bound estimate can be obtained for the lFDR and, indirectly, a lower bound for the FNR. One important feature of this approach is that it considers an empirical rather than theoretical null distribution. Indeed, as noted by Efron, these distributions may be very different and strong arguments support using the empirical one in genetic studies for which extensive data are available [5].
In this work, and for variancecomponent linkage analysis, we introduced a twocomponent mixture model based approach that allows estimation of lFDR, FDR, and FNR. We illustrate the contribution of our model to the analysis of real GAW15 data. Our results highlight the importance of correctly estimating the null distribution through the proposed mixture model based approach.
Methods
Consider the variancecomponent linkage analysis between a particular phenotype (here, the expression level of a defined gene) and a specific marker. The null hypothesis of no linkage (additive genetic variance due to the studied quantitative trait locus (QTL) equals zero) is tested by comparing the likelihood of this restricted model with that of a model in which the variance is estimated. Under the null hypothesis, the theoretical asymptotic distribution of the likelihoodratio statistic X is a 50:50 mixture of a χ^{2 }and a point mass at 0 [9]. When testing n markers, n likelihoodratio statistics X_{i}, (i = 1,...,n) are available, with each X_{i }following either the null or the alternative distribution.
For modeling of the marginal distribution of X, we consider the following twocomponent mixture model, in which the marginal cumulative distribution F_{X }of X is:
where ω_{c }is the mixing proportions for the c components (c = 1, 2; ω_{c }∈ [0, 1]; ω_{1 }+ ω_{2 }= 1). Here, c = 1 corresponds to the null hypothesis component and c = 2 to the alternative hypothesis component, respectively. The parameter θ ∈ [0, 1] is the weight of the point mass at 0 for the null hypothesis component.
In this model, the conditional distributions F_{c}(xα_{c}, β_{c}) are gamma distributions with parameters α_{c }and β_{c}, where α_{c }is the mean and α_{c}/β_{c }the variance of the distribution. Here, we impose that α_{1 }<α_{2}.
As discussed in the Background, the empirical distribution under the null hypothesis can be very different from the theoretical distribution [8]. Therefore, we decided to not consider theoretical values (θ = 1/2, α_{1 }= 1, and β_{1 }= 1/2) for the first component distribution parameters but rather to estimate them. For the second component, we used a gamma distribution, which represents a convenient and parsimonious way to model the nonnull distribution.
Parameters of interest are inferred by sampling from their joint posterior distributions using Monte Carlo Markov chain (MCMC) samplers implemented in WinBUGS software [9]. All results presented correspond to 25,000 sweeps of MCMC algorithms following a burnin period of 25,000 sweeps (period required to achieve algorithm stability). Convergence is checked by visual inspection of the curve of the plots for the different parameters of the mixtures.
For each marker, the posterior probabilities of belonging to the null hypothesis can be estimated directly from the algorithm output, using empirical averages. These probabilities are natural estimates of the lFDR for each marker. They can be used to compute modelbased estimates of the observed FDR and FNR (conditionally to the data) [10,11].
Results
We started from the cell intensity files (*.CEL) obtained from the GeneChip^{® }Human Genome Focus Array Hgfocus [12] that provide geneexpression measurements of 8794 probe sets for 276 samples. We normalized and summarized those measurements using the robust multiarray average (RMA) method proposed by Irizarry et al. [13]. A multipoint variancecomponent linkage analysis was performed with MERLIN [14] on the normalized phenotypes using all 194 individuals from the 14 Centre d'Etude du Polymorphisme Humain (CEPH) families and the 2819 autosomal SNP data. Using the proposed mixture model, we then estimated the lFDR at each marker, and FDR and FNR. Here, we present only the results obtained for the following 10 genes discussed in the article by Morley et al. [1]: CHI3L2, DDX17, PSPHL, IL16, HOMER1, ALG6, CBR1, TNFRSF11A, TGIF, and DSCR2.
Table 1 gives the estimated parameters of the twocomponent mixture model for the expression of each of the 10 genes (phenotypes). The estimated values of the null distribution parameters differed markedly from the theoretical values. For the 10 selected genes, the maximal differences between the theoretical and empirical values were: 0.11 for θ (PSPHL), 1.96 for α_{1 }(DDX17), and 1.08 for β_{1 }(ALG6). For example, Figure 1 illustrates the histogram distribution of the (nonnull) observed likelihoodratio statistic X, and superimposed theoretical null hypothesis, marginal and null hypothesis densities estimated from the mixture model for the DDX17 gene. The marked difference between the theoretical and estimated null distributions strongly supports the use of the estimated null distribution rather than the theoretical one. As noted by Efron [8], these differences can substantially affect any simultaneous inference (including FDR estimation and FWER control). It is worth noting that when the FWER is controlled at 5% with a classical Bonferroni procedure, the pvalues for the DDX17 gene calculated from the theoretical null distribution yielded 52 significant results, while the pvalues calculated from the estimated null distribution gave only 13 significant results. In this example, considering the theoretical null distribution clearly tended to overestimate the number of significant results.
Table 1. Estimated parameters of the twocomponent mixture model for each of the ten genes analyzed
Figure 1. Histogram distribution of the (nonnull) observed likelihood ratio statistic, theoretical null hypothesis density, and marginal and null hypothesis densities estimated from the mixture model for the DDX17 gene.
Summary statistics calculated from the full output of the MCMC algorithm (after discarding the burnin samples) provide information on the posterior probabilities of belonging to the null hypothesis component. Using these estimates, probabilistic classification of the data (in terms of discoveries and nondiscoveries) can be obtained concomitantly with the estimations of FDR and FNR [10,11]. Herein, we decided to consider as discoveries (linkage) the markers with posterior probabilities below a threshold value, which can be different for each phenotype and was chosen to ensure 5% FDR. Figure 2 shows the estimated posterior probabilities (equivalent to the lFDR) along the 22 chromosomes for the 10 phenotypes. Meanwhile, the estimated FNR ranged from 23% (PSPHL) to 28% (HOMER1) (data not shown). The selected markers with an lFDR estimate below the defined threshold are plotted in red. These selected markers differed substantially from those obtained by Morley et al. [1]. For example, we found multiple cisacting and transacting regulators for DDX7 and IL16, while Morley et al. [1] found only cisacting regulators for these genes.
Figure 2. Estimated posterior probabilities (lFDR) for the 10 selected genes along the 22 chromosomes. Significant results at FDR threshold 0.05 are plotted in red.
However, it is difficult to directly compare the two approaches because the selection strategies rely on completely different criteria. Moreover, it is worth noting that while the Bonferroni procedure depends on the order of the pvalues, our procedure depends on the order of the posterior probability (lFDR) values, and the two can be completely different.
Conclusion
Herein we described a mixture model based approach to estimate FDR, FNR, and lFDR in the context of variance component linkage analyses. This approach allows the selection process to take into account both false positives and false negatives. Moreover, it provides an estimate of the empirical null distribution, which is a key component for any simultaneous inference procedure.
Indeed, in many situations, the empirical null distribution deviates from the theoretical one [8], leading to incorrect statistical inferences and resulting decisions. Traditional estimating methods in linkage analysis used simulation approaches in which marker alleles were randomly dropped from the genealogies. When markers are numerous or pedigrees are complex, that method can become very burdensome, with computations requiring several days of running time. New genetic studies for which large amounts of data are available open new opportunities by allowing the estimation of appropriate null and alternative densities without resorting to simulations. Hence, our approach is much easier to handle because examination of each of the different phenotypes analysed required less than 1 hour of computer time. It is important to note that this approach can be extended by incorporating different null distribution parameters for a set of phenotypes in a single model. In conclusion, we think that new insights on linkage analysis using genomewide technologies might emerge from a mixture modelbased approach.
Competing interests
The author(s) declare that they have no competing interests.
Acknowledgements
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/1?issue=S1.
References

Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genomewide variation in human gene expression.
Nature 2004, 430:743747. PubMed Abstract  Publisher Full Text

Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results.
Nat Genet 1995, 11:241247. PubMed Abstract  Publisher Full Text

Hochberg Y, Tamhane A: Multiple Comparison Procedures. New York: John Wiley & Sons, Inc; 1987.

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Genovese CR, Wasserman L: Operating characteristics and extensions of the false discovery rate rrocedure.
J R Stat Soc Ser B 2002, 64:499518. Publisher Full Text

Efron B, Tibshirani R, Storey J, Tusher V: Empirical Bayes analysis of a microarray experiment.
J Am Stat Assoc 2001, 96:11511160. Publisher Full Text

Efron B: Local false discovery rates. [http://wwwstat.stanford.edu/~brad/papers/False.pdf] webcite

Efron B: Largescale simultaneous hypothesis testing: the choice of a null hypothesis.
J Am Stat Assoc 2004, 99:96104. Publisher Full Text

Spiegelhalter D, Thomas A, Best N, Lunn D: WinBUGS User manual Version 1.4.1. [http://www.mrcbsu.cam.ac.uk/bugs] webcite

Newton MA, Noueiry A, Sarkar D, Ahlquist P: Detecting differential gene expression with a semiparametric hierarchical mixture method.
Biostatistics 2004, 5:155176. PubMed Abstract  Publisher Full Text

McLachlan GJ, Bean RW, BenTovim Jones L: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays.
Bioinformatics 2006, 22:16081615. PubMed Abstract  Publisher Full Text

Affymetrix website [http://www.affymetrix.com] webcite

Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data.
Biostatistics 2003, 4:249264. PubMed Abstract  Publisher Full Text

Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin – rapid analysis of dense genetic maps using sparse gene flow trees.
Nat Genet 2002, 30:97101. PubMed Abstract  Publisher Full Text