Skip to main content
  • Methodology article
  • Open access
  • Published:

Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation

Abstract

Background

The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true.

Results

In this context we present a semi-parametric approach based on kernel estimators which is applied to different high-throughput biological data such as patterns in DNA sequences, genes expression and genome-wide association studies.

Conclusion

The proposed method has the practical advantages, over existing approaches, to consider complex heterogeneities in the alternative hypothesis, to take into account prior information (from an expert judgment or previous studies) by allowing a semi-supervised mode, and to deal with truncated distributions such as those obtained in Monte-Carlo simulations. This method has been implemented and is available through the R package kerfdr via the CRAN or at http://stat.genopole.cnrs.fr/software/kerfdr.

Background

Multiple-testing problems occur in many bioinformatic studies where we considere a large set of biological objects (genes, SNPs, DNA patterns, etc.) and we want to test a null hypothesis H for each object. Typically, H may be 'the expression level of the gene is not affected by the treatment' or 'the pattern is as frequent as expected in the observed DNA sequence'. The control of the number of false positives, i.e. falsely rejected hypotheses, is the crucial issue in multiple testing. To this end, several error rates, such as the Family-Wise Error-Rate (FWER) or the False Discovery Rate (FDR), have emerged and various strategies to control these criteria have been developed (see [1] for a review).

In the last decade the FDR criterion introduced in [2] has received the greatest focus, due to its lower conservativeness compared to the FWER. The FDR is defined as the mean proportion of false positives among the list of rejected hypotheses. It is therefore a global criterion that cannot be used to assess the reliability of a specific hypothesis, i.e. that of a given gene, SNP or pattern.

More recently, a strong interest has been devoted to the local version of the FDR, called 'local FDR' [3] and denoted hereafter ℓFDR. The idea is to quantify the probability for a given null hypothesis to be true. Even if many different strategies were designed to estimate the ℓFDR, some of them based on the estimation of FDR itself [4], most of them rely on a mixture model assumption [5], which is a general and statistically convenient framework: the score (test statistics, p-values) on which the testing procedure is based follows a mixture distribution depending on the unobserved status of the hypothesis (true or false). Different approaches have been proposed: fully parametric [69], semi-parametric [10], Bayesian [11, 12] or empirical Bayes [3].

The semi-parametric approach developed by [10] uses the knowledge of the distribution f0 of the score under the null hypothesis, to provide a flexible non-parametric estimation of the alternative distribution (denoted f1), i.e. under the alternative hypothesis. However, some important questions remain partially or not addressed in this reference.

In this paper we provide an implementation of the method with several important and practical generalizations. The Results and Discussion Section recalls the theoretical framework underlying our method, the properties of the estimation algorithm as well as the main steps of its implementation.

Performances are then studied via simulations, and compared to other existing methods. Finally, applications to various bioinformatic data sets, such as gene expressions, DNA sequence patterns and genome-wide associations, are carried out and proposed to the reader

Results and discussion

Semi-parametric mixture model

Our estimation of the local FDR (ℓFDR) relies on the semi-parametric mixture model proposed in [10]. e have at our disposal n hypotheses {H i }i = 1,...,nwe want to test. Suppose that an unknown proportion π0 of them are true nulls. For any hypothesis, we define a random variable H i that equals 0 if it is under H0 (true null hypothesis), and equals 1 under H1 (false null). For each H i , we compute a score denoted by X i (a p-value for example). We assume that these scores are independent and identically distributed, with mixture distribution

f(x) = π0 f0 (x) + π1 f1 (x), (1)

where π1 = 1 - π0 states for the proportion of false null hypotheses, f0 denotes the probability density function (pdf) of scores under H0 and f1 is the pdf of scores under H1. Note that f0 is completely specified. For instance if X i is the p-value of a Student statistic, f0 is the uniform distribution on [0, 1]. If any transformation (probit or log) is applied, f0 remains completely known. On the contrary, f1 needs systematically to be estimated so as to π0.

In our framework, ℓFDR defined the probability that H i = 0 given the observed value x i of the score X i :

F D R ( x i ) = d e f τ i = Pr [ H i = 0 | X i = x i ] = π 0 f 0 ( x i ) f ( x i ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeS4eHWMaemOrayKaemiraqKaemOuaiLaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkdaWfWaqaaiabg2da9aWcbaaabaGaemizaqMaemyzauMaemOzaygaaOGaeqiXdq3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcyGGqbaucqGGYbGCcqGGBbWwcqWGibasdaWgaaWcbaGaemyAaKgabeaakiabg2da9iabicdaWiabcYha8jabdIfaynaaBaaaleaacqWGPbqAaeqaaOGaeyypa0JaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGDbqxcqGH9aqpjuaGdaWcaaqaaiabec8aWnaaBaaabaGaeGimaadabeaacqWGMbGzdaWgaaqaaiabicdaWaqabaGaeiikaGIaemiEaG3aaSbaaeaacqWGPbqAaeqaaiabcMcaPaqaaiabdAgaMjabcIcaOiabdIha4naaBaaabaGaemyAaKgabeaacqGGPaqkaaGccqGGUaGlaaa@6393@

This quantity may be interpreted as a measurement of how likely the hypothesis at hand could be falsely rejected.

Since f1 is unknown, we use the following (non-parametric) kernel estimator for a given bandwidth h > 0

f 1 ^ ( x ) = [ i = 1 n H i h k ( x X i h ) ] / ( j = 1 n H j ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaaaOGaayPadaGaeiikaGIaemiEaGNaeiykaKIaeyypa0ZaamWaaeaadaaeWbqaaKqbaoaalaaabaGaemisaG0aaSbaaeaacqWGPbqAaeqaaaqaaiabdIgaObaakiabdUgaRnaabmaajuaGbaWaaSaaaeaacqWG4baEcqGHsislcqWGybawdaWgaaqaaiabdMgaPbqabaaabaGaemiAaGgaaaGccaGLOaGaayzkaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaOGaay5waiaaw2faaiabc+caVmaabmaabaWaaabCaeaacqWGibasdaWgaaWcbaGaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaOGaayjkaiaawMcaaiabcYcaSaaa@5809@
(2)

in which we replace the unknown H i 's by their conditional expectation E MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFecFraaa@37B5@ [H i |X i ] = Pr [H i = 1|X i ] = 1 - τ i .

These expectations are themselves thanks to

τ i ^ = π ^ 0 f 0 ( x i ) / f ^ ( x i ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqaHepaDdaWgaaWcbaGaemyAaKgabeaaaOGaayPadaGaeyypa0ZaaecaaeaacqaHapaCaiaawkWaamaaBaaaleaacqaIWaamaeqaaOGaemOzay2aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabc+caViqbdAgaMzaajaGaeiikaGIaemiEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGGSaalaaa@4400@
(3)

where π ^ 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaeGimaadabeaaaaa@2EBC@ is a given estimator of the unknown proportion and f ^ ( x ) = π ^ 0 f 0 ( x ) + ( 1 π ^ 0 ) f 1 ^ ( x ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOzayMbaKaacqGGOaakcqWG4baEcqGGPaqkcqGH9aqpcuaHapaCgaqcamaaBaaaleaacqaIWaamaeqaaOGaemOzay2aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG4baEcqGGPaqkcqGHRaWkcqGGOaakcqaIXaqmcqGHsislcuaHapaCgaqcamaaBaaaleaacqaIWaamaeqaaOGaeiykaKYaaecaaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaaaOGaayPadaGaeiikaGIaemiEaGNaeiykaKcaaa@47CA@ . Thus, we obtain

f 1 ^ ( x ) = [ i = 1 n 1 τ i ^ h k ( x X i h ) ] / ( n j = 1 n τ j ^ ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaaaOGaayPadaGaeiikaGIaemiEaGNaeiykaKIaeyypa0ZaamWaaeaadaaeWbqaaKqbaoaalaaabaGaeGymaeJaeyOeI0YaaecaaeaacqaHepaDdaWgaaqaaiabdMgaPbqabaaacaGLcmaaaeaacqWGObaAaaGccqWGRbWAdaqadaqcfayaamaalaaabaGaemiEaGNaeyOeI0IaemiwaG1aaSbaaeaacqWGPbqAaeqaaaqaaiabdIgaObaaaOGaayjkaiaawMcaaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdaakiaawUfacaGLDbaacqGGVaWldaqadaqaaiabd6gaUjabgkHiTmaaqahabaWaaecaaeaacqaHepaDdaWgaaWcbaGaemOAaOgabeaaaOGaayPadaaaleaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaOGaayjkaiaawMcaaiabc6caUaaa@5F2D@
(4)

As τ i ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqaHepaDdaWgaaWcbaGaemyAaKgabeaaaOGaayPadaaaaa@2FED@ 's and f 1 ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaaaOGaayPadaaaaa@2F12@ depend on each other, we alternate the computation of (3) and (4) until convergence, which is proved in [10].

Implementation

The method may require to apply a transformation to the sample of p-values (optional), to estimate the proportion of null hypotheses (π0), to determine an optimal value for the bandwidth (h) used in the kernel estimator and to compute the estimation of f1. These technical points are further developed and discussed in the Methods section.

Moreover, the corresponding R package allows a simple and straightforward use. For instance the command try = kerfdr(pv) for a given sample of p-values (pv) returns the estimates of π0 and ℓFDR in try$pi0 and try$localfdr respectively. In addition the running time is very fast thanks to an efficient implementation using convolution through fast Fourier transforms and a list of customizable options for more advanced users such as the choice of π0, h or the kernel function. The complete R code and a pseudo-R code of kerfdr are available on the webpage.

Practical generalizations

Semi-supervised cases

Prior information is actually available in many experiments. Among all the null hypotheses to be tested, some are known to be true (control genes in microarray experiments) while some others are known to be false (test genes in spike-in settings). Such a knowledge is taken into account in the estimation procedure described previously: known a priori the τ i s are kept fixed throughout the steps of the algorithm. They contribute to the estimation of f1 in Eq. (4), but are not updated in Eq. (3).

Truncation

Let us suppose now that we have at hand truncated data within an interval I = [a, b]. By 'truncated', we mean that the support of the p-values distribution is strictly smaller than [0,1]. For instance, if B denotes the number of simulations, p-values smaller than 1/B are often truncated to 0.0. How this will affect our method?

In order to deal with densities, the restrictions of f0, f1 and f to I need to be normalized. Denoting by q0, q1 and q the corresponding normalization factors, the mixture definition gives:

q = I f ( x ) d x = π 0 I f 0 ( x ) d x q 0 + π 1 I f 1 ( x ) d x q 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyCaeNaeyypa0Zaa8qeaeaacqWGMbGzcqGGOaakcqWG4baEcqGGPaqkcqWGKbazcqWG4baEcqGH9aqpcqaHapaCdaWgaaWcbaGaeGimaadabeaaaeaacqWGjbqsaeqaniabgUIiYdGcdaagaaqaamaapebabaGaemOzay2aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG4baEcqGGPaqkcqWGKbazcqWG4baEaSqaaiabdMeajbqab0Gaey4kIipaaSqaaiabdghaXnaaBaaameaacqaIWaamaeqaaaGccaGL44pacqGHRaWkcqaHapaCdaWgaaWcbaGaeGymaedabeaakmaayaaabaWaa8qeaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaakiabcIcaOiabdIha4jabcMcaPiabdsgaKjabdIha4bWcbaGaemysaKeabeqdcqGHRiI8aaWcbaGaemyCae3aaSbaaWqaaiabigdaXaqabaaakiaawIJ=aaaa@60DA@

Despite q0, q1 can not be easily computed as f1 is unknown. Fortunately, we can estimate q from a sample X1,..., X n of non-truncated data using

q ^ = 1 n i = 1 n I X i I MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyCaeNbaKaacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaWefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFicFsdaWgaaWcbaGaemiwaG1aaSbaaWqaaiabdMgaPbqabaWccqGHiiIZcqWGjbqsaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaaa@4A0D@

from which we derive

q 1 ^ = q ^ π 0 q 0 π 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGXbqCdaWgaaWcbaGaeGymaedabeaaaOGaayPadaGaeyypa0tcfa4aaSaaaeaacuWGXbqCgaqcaiabgkHiTiabec8aWnaaBaaabaGaeGimaadabeaacqWGXbqCdaWgaaqaaiabicdaWaqabaaabaGaeqiWda3aaSbaaeaacqaIXaqmaeqaaaaaaaa@3B96@

One should note that this estimator does not necessarily belong to [0, 1]. In order to overcome this, we replace its value by 0 if q 1 ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGXbqCdaWgaaWcbaGaeGymaedabeaaaOGaayPadaaaaa@2F28@ < 0 and by 1 if q 1 ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGXbqCdaWgaaWcbaGaeGymaedabeaaaOGaayPadaaaaa@2F28@ > 1.

For example, if the p-values are estimated through Monte-Carlo using B = 500 simulations, the smallest non-null p-value is 1/B = 0.002 and I = [0.002, 1.000]. Let us assume that among a set of n = 1000 p-values, 54 are equal to 0.0, π0 = 0.9 and π1 = 0.1. We hence have q ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGXbqCaiaawkWaaaaa@2E50@ = (n - 54)/n = 946/1000 and as q0 = 1 - 1/B = 499/500 = 0.998 we easily get the expression of q 1 ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGXbqCdaWgaaWcbaGaeGymaedabeaaaOGaayPadaaaaa@2F28@ (= 0.478).

Simulation study

A comparison with other estimation methods of ℓFDR is provided in [10]. It shows that the semi-parametric approach we propose performs as well as the empirical Bayes approach [13] and the Gaussian mixture model [8] when the distributions f1 and f0 are well separated. However, it outperforms them in more difficult situations, especially in terms of stability. We focus here on the particular cases described below (semi-supervised and truncation) that are not handle by the aforementioned methods.

Simulation design

We simulated sets of p-values according to the mixture model (1), where f0 is the uniform distribution over [0; 1]. We considered 4 different proportions of false null hypotheses (1 - π0 = 0.01, 0.05, 0.1 and 0.3), 2 different means for the p-values coming from the alternative distribution f1 (μ = 0.01 and 0.001). f1 is either an exponential distribution (1/μ) or a uniform distribution over [0, 2 μ]. The exponential distribution can provide values greater than one and a beta distribution as used in [6] can appear more appropriate; however it occurs very rarely with the taken value for μ. For each of the 4 × 2 × 2 = 16 configurations, S = 500 samples of size n = 1,000 were generated.

For each proportion π0 and distribution f1, the ℓFDR of the i-th p-value τ i has a theoretical expression that is computed. Denoting by τ ^ i s MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiXdqNbaKaadaqhaaWcbaGaemyAaKgabaGaem4Camhaaaaa@30A1@ , the local FDR estimate of the i-th p-value for the simulation s (s = 1,..., S), the performances of the method are assessed by means of the root mean square error

R M S E ( π 0 , f ) = 1 S s 1 n i ( τ ^ i s τ i ) 2 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOuaiLaemyta0Kaem4uamLaemyrauKaeiikaGIaeqiWda3aaSbaaSqaaiabicdaWaqabaGccqGGSaalcqWGMbGzcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabdofatbaakmaaqafabaWaaOaaaeaajuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqafabaGaeiikaGIafqiXdqNbaKaadaqhaaWcbaGaemyAaKgabaGaem4CamhaaOGaeyOeI0IaeqiXdq3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkdaahaaWcbeqaaiabikdaYaaaaeaacqWGPbqAaeqaniabggHiLdaaleqaaaqaaiabdohaZbqab0GaeyyeIuoakiabc6caUaaa@5234@

The smaller the RMSE, the better the performances.

Semi-supervised

To see how prior information improves the estimation of ℓFDR, we randomly select some hypotheses for which the status is known. The proportion κ of these hypotheses is fixed, so that the true value of the local FDR is also known (and equal either to 0 or 1). Figure 1 shows that even a small proportion (κ = 1% or 5%) of known hypotheses improves significantly the ℓFDR estimation.

Figure 1
figure 1

Semi-supervised. Root Mean Square Error (RMSE) between the true local FDR τ and the estimates as a function of the proportion 1 - π0 (log-log scale). Proportion of known hypothesis: κ = 0 (dotted), 1% (cross), 5% (asterix), 10% (circle) and 50% (square). Top: exponential shape for f1. Bottom: uniform shape. Left: μ = 0.001. Right: μ = 0.01. Variance of the RMSE lies between 1e-4 and 5e-4 with 500 simulations.

Truncation

In purpose of comparison, we truncate p-values to a given threshold p* (p* = 10-2, 10-3) and compare the generalized method that takes account of truncation with the naive one, in terms of the RMSE criterion. In Figure 2, the original non-truncated p-values provide a reference that can not be outperformed. We see that the correction improves the quality of the estimates, especially when the truncation is severe (p* = 10-2) and that the corrected estimates can be almost as good as the best achievable.

Figure 2
figure 2

Truncation. Root Mean Square Error (RMSE) between the true local FDR τ and the estimates as a function of the proportion 1 - π0 (log-log scale). Truncation: p* = 0 (untruncated: asterix), 10-3 (circle), 10-2 (cross). Estimation: naive (dotted), corrected (solid). Top: exponential shape for f1. Bottom: uniform shape. Left: μ = 0.001. Right: μ = 0.01. Variance of the RMSE lies between 1e-4 and 5e-4 with 500 simulations.

Applications

Gene expression data

As a first illustration, we apply our method to the classical example of Hedenfalk [14] in which the expression levels of n = 3,226 genes are studied. The aim is to compare patients with two different breast cancers: 7 BRCA1 (7 patients) and BRCA2 (8 patients) corresponding to two different gene mutations predisposing to the disease. We use the modified t-test statistic proposed in [15] which avoids false-positives due to bad variance estimates.

Applying our method, we obtain a proportion of null genes of π ^ 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaeGimaadabeaaaaa@2EBC@ = 66.4% which is consistent with the proportion estimated in [8] ( π ^ 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaeGimaadabeaaaaa@2EBC@ = 65%). Figure 3 displays the estimated densities: although the proportion of modified genes is quite high (1 - π ^ 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaeGimaadabeaaaaa@2EBC@ = 33.6%), the local FDR is lower than 1% for only 5 genes; it is below 5% for only 69. This shows that the local FDR is an efficient tool to reduce the type-I error-rate in difficult cases.

Figure 3
figure 3

Genes expression: estimated densities for the Hedenfalk dataset. The expression levels of n = 3,226 genes for 7 BRCA1 and 8 BRCA2 patients (corresponding to two different gene mutations predisposing to the disease) are studied [14]; p-values are computed by using the modified t-test statistic proposed in [15].

The choice of the bandwidth is known to be a crucial step in density estimation problems. In this example, we selected a bandwidth of 0.27. To check to influence of this choice on the results, we tried several values of h between 0.20 and 0.35. Figure 4 shows that the estimated local FDR is not sensitive to this choice.

Figure 4
figure 4

Genes expression: sensitivity of local FDR estimates to the choice of the bandwidth. h takes the values 0.20 (dotted), 0.27 (dashes) and 0.35 (line); local FDR are given in log10 scale.

DNA sequence patterns

It is well known that most biological patterns in DNA sequences have unusual frequencies due to selection mechanisms. It is hence natural to search for new functional patterns among those whose number of occurrences is statistically significant. In order to do so, it is classical to adopt a test framework where the null hypothesis is that the DNA sequence is generated according to a order m 0 Markov model (the parameters of this Markov model are usually estimated over the observed sequence).

We consider here the complete genome of the pathogen bacteria Mycoplasma genitallium (575 kb) on which we estimate an order m = 3 homogeneous Markov model. For each of the 46 = 4,096 oligomers (DNA words) of length 6, we compute the exact expectation ( E MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFecFraaa@37B5@ [N]) and standard deviation ( V [ N ] MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=vj8wjabcUfaBjabd6eaojabc2faDbWcbeaaaaa@3B97@ ) of its frequency N from which we derive the z-score:

Z = N obs E [ N ] V [ N ] ~ H 0 N ( 0 , 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOwaOLaeyypa0tcfa4aaSaaaeaacqWGobGtdaahaaqabeaacqqGVbWBcqqGIbGycqqGZbWCaaGaeyOeI0Yefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFecFrcqGGBbWwcqWGobGtcqGGDbqxaeaadaGcaaqaaiab=vj8wjabcUfaBjabd6eaojabc2faDbqabaaaaOWaaCbmaeaacqGG+bGFaSqaaiabhIeainaaBaaameaacqaIWaamaeqaaaWcbaaaamrtHrhAL1wy0L2yHvtyaeXbnfgDOvwBHrxAJfwnaGqbaOGae4xdX7KaeiikaGIaeGimaaJaeiilaWIaeGymaeJaeiykaKcaaa@5F21@

where Nobs is the observed frequency of the oligomer in the genome.

Thanks to a simple CLT argument, we get that the distribution of Z is approximately a standard Gaussian under the null hypothesis. It is hence possible to use this approximation either by working directly with the z-score or by computing the two-sided p-value associated to each observation:

p -value = ( N ( 0 , 1 ) < | Z | ) + ( N ( 0 , 1 ) > + | Z | ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeeyla0IaeeODayNaeeyyaeMaeeiBaWMaeeyDauNaeeyzauMaeyypa0Zefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFzecucqGGOaakt0uy0HwzTfgDPnwy1egarCqtHrhAL1wy0L2yHvdaiuaacqGFneVtcqGGOaakcqaIWaamcqGGSaalcqaIXaqmcqGGPaqkcqGH8aapcqGHsislcqGG8baFcqWGAbGwcqGG8baFcqGGPaqkcqGHRaWkcqWFzecucqGGOaakcqGFneVtcqGGOaakcqaIWaamcqGGSaalcqaIXaqmcqGGPaqkcqGH+aGpcqGHRaWkcqGG8baFcqWGAbGwcqGG8baFcqGGPaqkaaa@68EA@

The natural approach is to estimate the densities from the p-values (Figure 5) where all the 'exceptional' oligomers (under and over-represented) accumulate on the left side of the resulting density. But the flexibility of our method allows us to make the estimations directly on the basis of the z-scores (Figure 6) by taking into account their bimodal distribution under H1 and distinguishing the oligomers that are under-represented (on the left side of the resulting density) from those that are over-represented (on the right side). If both strategies provide the same estimation for the proportion of 'null' oligomers ( π ^ 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaeGimaadabeaaaaa@2EBC@ = 57.3%), ℓFDR estimations are sensibly different in particular for the ligomers that are over-represented (data not shown).

Figure 5
figure 5

Patterns in DNA sequences: estimated densities for all 4,096 oligomers of size 6 using p -values. We consider here the complete genome of the pathogen bacteria Mycoplasma genitallium (575 kb); For each of the 46 = 4,096 oligomers of length 6, we compute the exact expectation ( E MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFecFraaa@37B5@ [N]) and standard deviation ( V [ N ] MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=vj8wjabcUfaBjabd6eaojabc2faDbWcbeaaaaa@3B97@ ) of its frequency N from which we derive the z-score and the corresponding p-value.

Figure 6
figure 6

Patterns in DNA sequences: estimated densities for all 4,096 oligomers of size 6 using z-scores. This is the same dataset than Figure 5 with the difference that Local FDR is estimated from the z-scores directly instead of p-values. It results in a bimodal density for f1.

Quality control in genome-wide association studies

In association studies, deviations from Hardy-Weinberg equilibrium (HWE) can be due to inbreeding, population stratification or selections. They can also be a symptom of lack of quality in genotyping because of a tendency to misscall heterozygous genotypes as homozygous for instance [16]. As a result, testing for HWE has often been proposed as a data quality check with the aim to discard loci that deviate from the equilibrium. Testing for deviations from HWE can be carried out using the Pearson chi-square statistic (XHW) that quantifies the distance between the observed genotype proportions and the ones expected under the equilibrium.

Here, the HWE test is applied to controls of genome-wide case-control data on the multiple sclerosis from France (Rennes). The data set consists in 74,067 Single Nucleotide Polymorphisms (SNPs). Since the usual chi-square approximation can be poor when there are low genotype counts, p-values are computed via Monte-Carlo simulations (number of simulations B = 10,000) which represents a typical case of truncation of p-values for those that are below the level of precision given by the number of simulations.

Applying our method, we obtain a proportion of null SNPs of π ^ 0 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaeGimaadabeaaaaa@2EBC@ = 99.44%. Figure 7 displays the estimated densities, showing a large overlap between the two distributions f0 and f1. By considering a threshold of 1%, then 29 SNPs would be declared to deviate from HWE, and up to 537 for a threshold of 5%. These quantities come down to 454 and 576 respectively when local FDR are estimated in the naive way (not accounting for the truncation). Consequently and in addition to our simulations, this application underlines an inflation of excluded SNPs when the information about a truncation, when it exists, is not taken into account in the estimation procedure.

Figure 7
figure 7

Association studies: estimated densities for the Hardy-Weinberg test applied to a set of 74,067 SNPs. DNA were genotyped using a 100 K Affymetrix chip. The algorithm used for making genotype calls has been previously described by Affymetrix. Local FDR is computed from the p-values resulting from an Hardy-Weinberg equilibrium test applied to each SNP. Note that f0 is almost perfectly overlapping f since π0 is close to 1.

Conclusion

A simple computational approach to local FDR considers a two-components normal mixture model for modeling the observed empirical distribution (f) where the null distribution (f0) is the standard normal and the alternative distribution (f1) is a normal density with unspecified mean and variance. But the reliability of this approach obviously depends on how well the proposed two-components normal mixture model approximates the real distribution.

Our semi-parametric approach does not assume any constrained alternative distribution and is hence much more flexible. Nonetheless it requires a complete specification of the null distribution, the a priori proportion of true null hypotheses (π0), as well asthe bandwidth (h) for which efficient estimation methods have been developed. The performances of the approach compared to existing methods were assessed in a preceding publication [10] which showed its advantages in difficult situations where the distributions f0 and f1 are not well separated. We focused here on the implementation of the approach, and on two interesting extensions such as the possibility to use prior information in the estimation procedure (semi-supervised) and the ability to handle truncated distribution such as those generated by Monte-Carlo estimation of p-values. Our simulation showed that these informations can significantly improve the quality of estimates. As an illustration, we analyzed three high-throughput biological dataset concerning genes expressions, DNA sequence patterns, and genome-wide association studies. The corresponding R package available at http://stat.genopole.cnrs.fr/software/kerfdr is fast, thanks to fast Fourier transforms, straightforward to use and propose customizable options to advanced users.

Finally, most of the local FDR estimation procedures derived from the Benjamini and Hochberg framework, including our approach, assume that p-values testing true null hypotheses are independent observations. If it may well be the case for patterns, in practice this assumption does not hold for all the genes or SNPs. A proposed solution is to cluster highly correlated genes (or SNPs) together, and to represent a cluster by a single gene or a linear combination of the associated genes [8]. Theses approaches also generally assume that p-values testing true null hypotheses are continuous and uniform over [0,1]. These issues are likely to be alive fields of research in the near future.

Methods

Probit or logarithm transformations

While it is obviously possible to work directly with a sample of p-values (in this case, f0 is simply the uniform density over [0, 1]) this option is seldom used in practice. This comes from the fact that most H1 p-values are concentrated near 0 while H0 ones are uniformly distributed between 0 and 1. Working with the rough p-values will hence favor estimation of f0 over f1 which is precisely our opposite goal. In order to overcome this problem it is then classical to introduce a transformation that will allow us to "zoom" on the interesting part of the distribution. We propose here to consider two such transformations:

Probit transformation

X = probit(P) = Φ-1(P)

where P is a p-value and F is the cumulative distribution function of the normal distribution. If P ~ U MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hfXxfaaa@3771@ ([0, 1]), X follows a normal distribution and

f 0 ( x ) = φ ( x ) = 1 2 π e x 2 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG4baEcqGGPaqkcqGH9aqpcqaHgpGAcqGGOaakcqWG4baEcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaamaakaaabaGaeGOmaiJaeqiWdahabeaaaaGccqWGLbqzdaahaaWcbeqaaiabgkHiTKqbaoaalaaabaGaemiEaG3aaWbaaeqabaGaeGOmaidaaaqaaiabikdaYaaaaaaaaa@439C@

Logarithmic transformation

X = log10(P)

If P ~ U MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hfXxfaaa@3771@ ([0, 1]) the - log(10) × X has an exponential distribution and we easily get that

f 0 ( x ) = { log ( 10 ) × e log ( 10 ) x if  x 0 0 else MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWG4baEcqGGPaqkcqGH9aqpdaGabaqaauaabaqaciaaaeaacyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqaIXaqmcqaIWaamcqGGPaqkcqGHxdaTcqWGLbqzdaahaaWcbeqaaiabgkHiTiGbcYgaSjabc+gaVjabcEgaNjabcIcaOiabigdaXiabicdaWiabcMcaPiabdIha4baaaOqaaiabbMgaPjabbAgaMjabbccaGiabdIha4nrtHrhAL1wy0L2yHndaryqtHrhAL1wy0L2yHnttV52BaGabaiab=1NkIjabicdaWaqaaiabicdaWaqaaiabbwgaLjabbYgaSjabbohaZjabbwgaLbaaaiaawUhaaaaa@620E@

Two assets of this transformation are to give more weight to small p-values and to be easier to interpret than the probit transformation (X = -2 correspond to P = 10-2, X = -5 to P = 10-5).

Estimation of π0

For all 0 ≤ λ ≤ 1 we have

q = ( X T ( λ ) ) = π 0 T ( λ ) + f 0 ( x ) d x q 0 + π 1 T ( λ ) + f 1 ( x ) d x q 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyCaeNaeyypa0Zefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFzecucqGGOaakcqWGybawt0uy0HwzTfgDPnwyZaqeh0uy0HwzTfgDPnwyZ00BU9gaiuaacqGF+PICcqWGubavcqGGOaakcqaH7oaBcqGGPaqkcqGGPaqkcqGH9aqpcqaHapaCdaWgaaWcbaGaeGimaadabeaakmaayaaabaWaa8qmaeaacqWGMbGzdaWgaaWcbaGaeGimaadabeaakiabcIcaOiabdIha4jabcMcaPiabdsgaKjabdIha4bWcbaGaemivaqLaeiikaGIaeq4UdWMaeiykaKcabaGaey4kaSIaeyOhIukaniabgUIiYdaaleaacqWGXbqCdaWgaaadbaGaeGimaadabeaaaOGaayjo+dGaey4kaSIaeqiWda3aaSbaaSqaaiabigdaXaqabaGcdaagaaqaamaapedabaGaemOzay2aaSbaaSqaaiabigdaXaqabaGccqGGOaakcqWG4baEcqGGPaqkcqWGKbazcqWG4baEaSqaaiabdsfaujabcIcaOiabeU7aSjabcMcaPaqaaiabgUcaRiabg6HiLcqdcqGHRiI8aaWcbaGaemyCae3aaSbaaWqaaiabigdaXaqabaaakiaawIJ=aaaa@8116@

where T is either the probit or the log10 function. We hence get

π 0 = q q 1 q 0 q 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiWda3aaSbaaSqaaiabicdaWaqabaGccqGH9aqpjuaGdaWcaaqaaiabdghaXjabgkHiTiabdghaXnaaBaaabaGaeGymaedabeaaaeaacqWGXbqCdaWgaaqaaiabicdaWaqabaGaeyOeI0IaemyCae3aaSbaaeaacqaIXaqmaeqaaaaaaaa@3B5F@

We have q0 = 1 - λ but q1 is unknown. We notice that the higher λ, the closer to 0 q1 will be. As we can estimate q from a sample X1,..., X n by

q ^ = 1 n i = 1 n I X i λ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyCaeNbaKaacqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUbaakmaaqahabaWefv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqWFicFsdaWgaaWcbaGaemiwaG1aaSbaaWqaaiabdMgaPbqabaWenfgDOvwBHrxAJf2maeXbnfgDOvwBHrxAJf2mn9MBVbacfaWccqGF+PICcqaH7oaBaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aaaa@55B3@

we obtain the following (conservative) estimator:

π 0 ^ = q ^ 1 λ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqaHapaCdaWgaaWcbaGaeGimaadabeaaaOGaayPadaGaeyypa0tcfa4aaSaaaeaacuWGXbqCgaqcaaqaaiabigdaXiabgkHiTiabeU7aSbaaaaa@3676@

which satisfies π0 = π 0 ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqaHapaCdaWgaaWcbaGaeGimaadabeaaaOGaayPadaaaaa@2F78@ + O(q1).

It is therefore necessary to find a tradeoff between the magnitude of the error O(q1) (lowest for λ = 1.0) and the quality of the estimation q ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGXbqCaiaawkWaaaaa@2E50@ (best for λ = 0.0).

Storey [17] first proposed to use λ = 0.5 which appears to be a good choice in most cases.

Determination of the bandwidth

About the choice of the bandwidth, our first approach consists in selecting h as if we were applying a kernel estimation over the whole sample.

For that matter, the literature proposes many methods already implemented in R: biased and unbiased cross-validation estimations (bcv and ucv), method using estimation of derivatives from [18] (sj-ste for solve-the-equation and st-dpi for direct-plugin) and, in two simple heuristics in the special case of Gaussian kernels: nrd0 from [19] (page 48) and nrd from [20].

Estimation of f1: Convolution and Fast Fourier Transforms

If we have an observed sample x1,..., x n with weights τ1,..., τ n we get for all x

f 1 ^ ( x ) = 1 h i = 1 n τ i τ K ( x x i h ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaaaOGaayPadaGaeiikaGIaemiEaGNaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGObaAaaGcdaaeWbqaaKqbaoaalaaabaGaeqiXdq3aaSbaaeaacqWGPbqAaeqaaaqaaiabes8a0baakiabdUealnaabmaajuaGbaWaaSaaaeaacqWG4baEcqGHsislcqWG4baEdaWgaaqaaiabdMgaPbqabaaabaGaemiAaGgaaaGccaGLOaGaayzkaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoaaaa@4D2C@

where τ = ∑ i τ i and K states for the kernel function.

The naive computation of all f 1 ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaecaaeaacqWGMbGzdaWgaaWcbaGaeGymaedabeaaaOGaayPadaaaaa@2F12@ (x i ) requires a quadratic complexity. Fortunately, [21] introduced an algorithm (later modified by [22]) based on Fast Fourier Transform (FFT, see [23] chapter 12) allowing to perform the same computation with a far more efficient linear complexity (see [23] chapter 13 for more details on fast discrete convolution through FFT).

kerfdr and discrete p-values

In developing their original FDR-control procedure, Benjamini and Hochberg [2] assumed that p-values testing true null hypotheses are independent observations from a continuous uniform distribution over [0,1]. A large family of succeeding methods requires the same conditions, to which kerfdr belongs. However, how the performance of these methods are affected when the assumption of continuity or uniformity are violated has not been often considered, contrary to the assumption of independence (see [24] and [25] for instance). Discrete p-values that become more frequently encountered in practice as categorical genomic data, such as Single-Nucleotide-Polymorphisms, Comparative-Genomic-Hybridation and Copy-Number-Variation become more widely available, clearly violate the assumption of uniformity and introduces instability into FDR-like and local FDR estimates.

In kerfdr, π0 and the shape of f0 are parameters of the method. Since with discrete p-values, correct estimators of π0 and f0 are tricky to obtain with classical methods included in the package, it is still feasible to use methods more adapted to each situation, such as those proposed by [2629], in order to pre-compute π0 and/or f0 before running kerfdr and to minimize the problems generated by discrete p-values. However, how our algorithm behaves exactly in this context has still to be considered along with its extension dependent data.

For instance in Figure 7, the short decrease in local FDR observed for the p-values near 1 should be interpreted as a nuisance effect that can happen due to a more severe discreteness of p-values near 1 (here computed by Monte-Carlo simulations) and hence should be ignored by the user.

Availability and requirements

Project name: kerfdr

Project home page: http://stat.genopole.cnrs.fr/software/kerfdr

Operating system: platform independent

Programming language: R

License: GNU GPL

References

  1. Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18: 71-103. 10.1214/ss/1056397487.

    Article  Google Scholar 

  2. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerfull approach to multiple testing. JRSSB. 1995, 57: 289-300.

    Google Scholar 

  3. Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment. J Amer Statist Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.

    Article  Google Scholar 

  4. Aubert J, Bar-Hen A, Daudin JJ, Robin S: Determination of the regulated genes in microarray experiments using local FDR. BMC Bioinformatics. 2004, 5 (125): 1-

    Google Scholar 

  5. McLachlan G, Peel D: Finite Mixture Models. 2000, Wiley

    Book  Google Scholar 

  6. Allison DB, Gadbury G, Heo M, Fernandez J, Lee CK, Prolla TA, Weindruch RA: Mixture model approach for the analysis of microarray gene expression data. Comput Statist and Data Analysis. 2002, 39: 1-20. 10.1016/S0167-9473(01)00046-9.

    Article  Google Scholar 

  7. Liao JG, Lin Y, Selvanayagam ZE, Weichung JS: A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics. 2004, 20 (16): 2694-2701. 10.1093/bioinformatics/bth310.

    Article  CAS  PubMed  Google Scholar 

  8. McLachlan G, Bean R, Ben-Tovim Jones L: A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics. 2006, 22: 1608-1615. 10.1093/bioinformatics/btl148.

    Article  CAS  PubMed  Google Scholar 

  9. Pounds S, Morris SW: Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003, 19: 1236-42. 10.1093/bioinformatics/btg148.

    Article  CAS  PubMed  Google Scholar 

  10. Robin S, Bar-Hen A, Daudin JJ, Pierre L: A semi-parametric approach for mixture models: Application to local false discovery rate estimation. Comput Statist and Data Analysis. 2007, 51: 5483-5493. 10.1016/j.csda.2007.02.028.

    Article  Google Scholar 

  11. Broët P, Lewin A, Richardson S, Dalmasso C, Magdelenat H: A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics. 2004, 20: 2562-2571. 10.1093/bioinformatics/bth285.

    Article  PubMed  Google Scholar 

  12. Newton MA, Noueiry A, Sarkar D, Ahlquist P: Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004, 5: 155-176. 10.1093/biostatistics/5.2.155.

    Article  PubMed  Google Scholar 

  13. Efron B: Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Amer Statist Assoc. 2004, 99: 96-104. 10.1198/016214504000000089.

    Article  Google Scholar 

  14. Hedenfalk I, Duggan D, Chen YD, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J: Gene expression profiles in hereditary breast cancer. New Engl Jour Medicine. 2001, 344: 539-548. 10.1056/NEJM200102223440801.

    Article  CAS  Google Scholar 

  15. Delmar P, Robin S, Daudin JJ: VarMixt: efficient variance modelling for the differential analysis of replicated gene expression data. Bioinformatics. 2005, 21 (4): 502-8. 10.1093/bioinformatics/bti023. doi:10.1093/bioinformatics/bti023

    Article  CAS  PubMed  Google Scholar 

  16. Balding DJ: A tutorial on statistical methods for population association studies. Nature Reviews Genetics. 2006, 7: 781-791. 10.1038/nrg1916.

    Article  CAS  PubMed  Google Scholar 

  17. Storey JD: A direct approach to false discovery rate. Journal of the Royal Statistical Society: Series B. 2001, 64 (3): 479-498.

    Article  Google Scholar 

  18. Sheather SJ, Jones MC: A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B. 1991, 53 (3): 683-690.

    Google Scholar 

  19. Silverman BW, Silverman BS: Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. 1986, Chapman and Hall

    Book  Google Scholar 

  20. Scott DW: Multivariate density estimation. 1992, Wiley, New York

    Book  Google Scholar 

  21. Silverman BW: Kernel density estimation using the fast fourier transform. Journal of the Royal Statistical Society: Series C. 1982, 31: 93-99.

    Google Scholar 

  22. Jones MC, Lotwick HW: A remark on algorithm AS 176. Kernel density estimation using the fast fourier transform. Journal of the Royal Statistical Society: Series C. 1984, 33: 120-122.

    Google Scholar 

  23. Press WH, Teukolsky SA, Vettering WT, Flannery BP: Numerical recipes in C. 1997, Cambridge University Press

    Google Scholar 

  24. Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Annals of Statistics. 2001, 29: 1165-1188. 10.1214/aos/1013699998.

    Article  Google Scholar 

  25. Pounds S: Estimation and control of multiple testing error rates for microarray studies. Brief in Bioinformatics. 2006, 12: 25-36. 10.1093/bib/bbk002.

    Article  Google Scholar 

  26. Gilbert P: A modified false discovery rate multiplecomparisons procedure for discrete data, applied to human immunodeficiency virus genetics. Applied Statistics. 2005, 54: 143-158.

    Google Scholar 

  27. Pounds S, Cheng C: Robust estimation of the false discovery rate. Bioinformatics. 2006, 22: 1979-1987. 10.1093/bioinformatics/btl328.

    Article  CAS  PubMed  Google Scholar 

  28. Ferreira J: The Benjamini-Hochberg methods in the case of discrete test statistics. International Journal of Biostatistics. 2007, 3: 11-

    Article  Google Scholar 

  29. Forner K, Lamarine M, Guedj M, Dauvillier J, Wojcik J: Universal false discovery rate estimation methodlogy for genome-wide association studies. Human Heredity. 2008, 65: 183-194. 10.1159/000112365.

    Article  CAS  PubMed  Google Scholar 

  30. Matsuzaki H, Dong S, Loi H, Di X, Liu G: Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nature Methods. 2004, 1: 109-111. 10.1038/nmeth718.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Merck-Serono in the person of Jérôme Wojcik for allowing the use of the genome-wide association dataset.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mickael Guedj or Gregory Nuel.

Additional information

Authors' contributions

MG most of the redaction, management of the R package (CRAN), application to genome-wide association data. AC estimation of π0, redaction. SR simulation study, application to gene expression data. GN the kerfdr algorithm (based on FFT convolution), extension of the mixture model to truncated data, application of kerfdr to patterns in DNA sequences.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Guedj, M., Robin, S., Celisse, A. et al. Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation. BMC Bioinformatics 10, 84 (2009). https://doi.org/10.1186/1471-2105-10-84

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-10-84

Keywords