Abstract
Background
An important goal of wholegenome studies concerned with single nucleotide polymorphisms (SNPs) is the identification of SNPs associated with a covariate of interest such as the casecontrol status or the type of cancer. Since these studies often comprise the genotypes of hundreds of thousands of SNPs, methods are required that can cope with the corresponding multiple testing problem. For the analysis of gene expression data, approaches such as the empirical Bayes analysis of microarrays have been developed particularly for the detection of genes associated with the response. However, the empirical Bayes analysis of microarrays has only been suggested for binary responses when considering expression values, i.e. continuous predictors.
Results
In this paper, we propose a modification of this empirical Bayes analysis that can be used to analyze highdimensional categorical SNP data. This approach along with a generalized version of the original empirical Bayes method are available in the R package siggenes version 1.10.0 and later that can be downloaded from http://www.bioconductor.org webcite.
Conclusion
As applications to two subsets of the HapMap data show, the empirical Bayes analysis of microarrays cannot only be used to analyze continuous gene expression data, but also be applied to categorical SNP data, where the response is not restricted to be binary. In association studies in which typically several ten to a few hundred SNPs are considered, our approach can furthermore be employed to test interactions of SNPs. Moreover, the posterior probabilities resulting from the empirical Bayes analysis of (prespecified) interactions/genotypes can also be used to quantify the importance of these interactions.
Background
Wholegenome experiments comprise data of hundreds of thousands of single nucleotide polymorphisms (SNPs), where a SNP is the most common type of genetic variations that occurs when at a single base pair position different base alternatives exist in a population. SNPs are typically biallelic. Therefore, SNPs can be interpreted as categorical variables having three realizations: the homozygous reference genotype (if both chromosomes show the more frequent variant), the heterozygous genotype (if one chromosome shows the more frequent, and the other the less frequent variant), and the homozygous variant genotype (if both bases explaining the SNP are of the less frequent variant).
Since SNPs can alter the risk for developing a disease, an important goal in studies concerned with SNPs is the identification of the SNPs that show a distribution of the genotypes that differs substantially between different groups (e.g., cancer vs. noncancer). Detecting such SNPs requires methods that can cope with this vast multiple testing problem in which hundreds of thousands of hypotheses are tested simultaneously. Naturally, the value of a statistic appropriate for the considered testing situation and the corresponding pvalue are computed for each variable, where in the case of SNPs Pearson's χ^{2}statistic is an appropriate test score. These raw pvalues are then adjusted for multiple comparisons such that a Type I error rate is strongly controlled at a prespecified level of significance α.
The classical example for a Type I error rate is the familywise error rate
where V is the number of false positives, i.e. the number of rejected null hypotheses that are actually true – or in biological terms, the number of SNPs found by the procedure to differ between groups that actually do not differ between the groups. This error rate is strongly controlled at a level α so that Prob(V ≥ 1) ≤ α by approaches such as the Bonferroni correction or the procedures of Westfall and Young [1]. An overview on such methods is given in [2]. In [3], procedures for controlling this and other error rates are compared in an application to gene expression data.
In classical multiple testing situations in which rarely more than 20 hypotheses are tested simultaneously, it is reasonable to keep down the probability of one or more false positives. However, in the analysis of data from wholegenome studies, hundreds of thousands of SNPs are considered simultaneously. Moreover, a few false positives are acceptable in such experiments as long as their number is small in proportion to the total number R of rejected null hypotheses, i.e. identified SNPs. This situation for which the familywise error rate might be too conservative is thus similar to the multiple testing problem in studies concerned with gene expression data. In the analysis of such DNA microarray data, another error rate, namely the false discovery rate
proposed by Benjamini and Hochberg [4], has hence become popular which in turn is a reasonable choice in the analysis of highdimensional SNP data.
Apart from adjusting pvalues, there also exist other approaches for adjusting for multiple comparisons such as the significance analysis of microarrays (SAM [5]) and the empirical Bayes analysis of microarrays (EBAM [6]) that have been developed particularly for the analysis of gene expression data.
In the original versions of both SAM and EBAM, a moderated tstatistic is computed. In SAM, the observed values of this test score are then plotted against the values of the statistic expected under the null hypothesis of no difference between the two groups, and a gene is called differentially expressed if the point representing this gene in this QuantileQuantile plot is far away from the diagonal. In EBAM, the density f of the observed values z of the moderated tstatistic is modeled by a mixture of the density f_{1 }of the differentially expressed genes and the density f_{0 }of the not differentially expressed genes, i.e. by
where π_{1 }and π_{0 }= 1  π_{1 }are the prior probabilities that a gene is differentially expressed or not, respectively. Following Efron et al. [6], a gene having a zvalue of z* is detected to be differentially expressed if the posterior probability
for being differentially expressed is larger than or equal to 0.9.
In [7], a generalized version of the SAM algorithm is presented, whereas in [8,9] SAM is adapted for categorical data such as SNP data.
In the following section, we first present a generalized EBAM algorithm. Then, we propose an adaption of EBAM enabling the analysis of categorical data. As computing the values of the test statistic for all SNPs individually would be very timeconsuming, we further suggest an approach based on matrix algebra that allows to compute all values simultaneously. Afterwards, EBAM for categorical data is applied, on the one hand, to two subsets of the highdimensional SNP data from the HapMap project [10], and on the other hand, to simulated data that mimic data from a typical association study in which several ten SNPs are considered. In the latter application, it is also shown how EBAM can be applied to identify SNP interactions associated with the response, and how it can be used to specify the importance of prespecified SNP interactions.
Methods
Generalized EBAM algorithm
In Algorithm 1, a generalized version of the empirical Bayes analysis of microarrays (EBAM [6]) is presented. This algorithm makes use of the fact that for a given rejection region Γ, the FDR can be estimated by
where z_{i }is the observed value of the test statistic Z_{i }for variable i = 1 ⋯ m, π_{0 }is the prior probability that a gene is not differentially expressed – or more generally, that a variable is not associated with the response – and (#{Z_{i }∈ Γ}) is the number of values expected under the null hypothesis to fall into Γ [11].
Several procedures have been suggested to estimate the prior probability π_{0}[6,11,12]. Efron et al. [6], e.g., propose to use a narrow interval around z = 0, and to estimate π_{0 }by the ratio of the number of observed zvalues in to the number of zvalues that are expected under the null hypothesis to fall into . However, the narrower , the more instable is this estimate. To stabilize this estimate, we use the procedure of Storey and Tibshirani [12] in which a natural cubic spline h with three degrees of freedom is fitted through the data points
where
and q_{λ }denotes the λ quantile of the (estimated) null distribution. The estimate of π_{0 }is then given by
Algorithm 1 (Generalized EBAM Procedure)
Let X be an m × n matrix comprising the values of m variables and n observations, y be a vector of length n composed of the values of the response for the n observations, and B be the number of permutations.
1. For each variable i = 1, ..., m, compute the value z_{i }of a statistic appropriate for testing if the values of this variable are associated with the response.
2. If the null density f_{0}, is known, use a density estimation procedure to obtain and compute . Otherwise, estimate the ratio ϕ = f_{0}/f directly by
(a) determining the m permuted zvalues z_{ib }for each permutation b = 1, ..., B of the n values of the response,
(b) binning the m observed and mB permuted zvalues into an appropriate number of intervals,
(c) fitting a logistic regression model with repeated observations through these intervals using an appropriate regression function.
3. Estimate π_{0 }by the procedure of Storey and Tibshirani [12].
4. For each variable i, compute the posterior probability
5. Order the observed zvalues to obtain z_{(1) }≤ ... ≤ z_{(m)}, and set
6. For a prespecified probability Δ or a set of appropriate values for Δ,
(a) set , and compute the upper cutoff c_{U }by
(b) set , and compute the lower cutoff c_{L }by
(c) call all variables i with significant, where denotes the complement of the rejection region Γ_{Δ},
(d) estimate the FDR of Γ_{Δ }by
where
The original version of EBAM is of course a special case of Algorithm 1: Efron et al. [6] compute the moderated tstatistic
for each gene i = 1, ..., m, where d_{i }is the difference of the groupwise mean expression values and s_{i }is the corresponding standard deviation such that d_{i}/s_{i }is the ordinary tstatistic. The fudge factor a_{0 }is computed by the quantile of the m standard deviations that leads to the largest number of genes called differentially expressed in a standardized EBAM analysis (see [6] for details on this standardized analysis). Since the null distribution of (1) is unknown, the response is permuted repeatedly to generate mB permuted zvalues. Efron et al. [6] then bin the m observed and mB permuted zvalues into 139 intervals. Treating the observed scores as successes and the permuted values as failures, a logistic regression model is fitted through the binned data points using a natural cubic spline with five degrees of freedom as regression function. For details on this logistic regression, see Remark (D) in [6].
Algorithm 1 also comprises the approach used by Efron and Tibshirani [13] to test twogroup gene expression data with Wilcoxon rank statistics.
The main difference between Algorithm 1 and the original version of EBAM is that Efron et al. [6] call all genes differentially expressed that have a posterior probability larger than or equal to Δ = 0.9, whereas we only call a variable i with _{1}(z_{i}) ≥ Δ significant if there is no other variable with a more extreme zvalue (a larger zvalue if z_{i }> 0, or a smaller zvalue if z_{i }< 0) that has a posterior probability less than Δ. This approach that is comparable to the proceeding in SAM, therefore, ensures that all variables with a zvalue exceeding some threshold are called significant, whereas in the original version of EBAM it might happen that a variable is not called significant, even though it has a more extreme zvalue than some of the identified variables.
Another difference is that Efron et al. [6] consider one fixed posterior probability, namely Δ = 0.9, for calling genes differentially expressed, whereas we allow both to prespecify one probability Δ and to consider a set of reasonable values for Δ. The latter again is similar to the SAM procedure in which the number of genes called differentially expressed and the estimated FDR is determined for several values of the SAM threshold, and then the value is chosen that provides the best balance between the number of identified genes and the estimated FDR. This approach can be helpful when the detection of interesting variables is just an intermediate aim, and the actual goal of the analysis is, e.g., the construction of a classification rule. In such a case, prespecifying the value of Δ might work poorly, as this might lead to either a too small number of identified variables, or a too high FDR. For an example of this proceeding in the context of the empirical Bayes analysis, see the application of EBAM for categorical data to the HapMap data set.
EBAM for categorical data
We now assume that our data consist of m categorical variables each exhibiting C levels denoted by 1, ..., C, and n observations each belonging to one of R groups denoted by 1, ..., R. If these variables are SNPs, C = 3.
A statistic appropriate for testing each of the m categorical variables if its null distribution differs between the R groups is Pearson's χ^{2}statistic
where n_{rc }and _{rc }are the observed number of observations and the number of observations expected under the null hypothesis in group r = 1, ..., R, respectively, showing level c = 1, ..., C.
Since the small denominator problem [5,6,14], which is the reason for adding the fudge factor a_{0 }to the denominator of the ordinary tstatistic in (1), does not show up in this case, it is not necessary to add a fudge factor to the denominator of (2). Therefore, Algorithm 1 can be applied to SNPs – or to any other type of (genetic) categorical data – by employing Pearson's χ^{2}statistic as test score.
In EBAM, it is assumed that all variables follow the same null distribution. In the permutation based approach of Algorithm 1, this, e.g., means that not only the B permuted zvalues corresponding to a particular variable, but all mB permutations of all m variables are considered in the estimation of the null distribution of this variable. Normally, this is an advantage in the analysis of highdimensional data [6,15]. In the analysis of categorical data, this, however, might lead to a loss of a large number of variables, as only variables showing the same number of levels can be considered together in an EBAM analysis.
Approximation to χ^{2}distribution
Since the null distribution of (2) can be approximated by a χ^{2}distribution with (R  1)(C  1) degrees of freedom, only the density f of the observed test statistics needs to be estimated. This can be done by applying a (nonparametric) kernel density estimator to the observed zvalues [16]. However, the standard kernels are typically symmetric such that negative values of z will have a positive estimated density, even though f(z) = 0 for z < 0. A solution to this problem is to use asymmetric kernels that only give nonnegative values of z a positive density [17,18]. Another solution, which we will use, is a semiparametric method proposed by Efron and Tibshirani [19].
In the first step of this procedure, a histogram of the observed zvalues is generated. To obtain a reasonable number of bins for the histogram, we employ the onelevel bin width estimator of Wand [20]. Although other bin width estimators such as the approaches of Scott [21] or of Freedman and Diaconis [22] lead to different bin widths, the densities resulting from the method of Efron and Tibshirani [19] are virtually identical. The approach of Sturges [23], however, which is, e.g., the default method for estimating the number of bins in the R function hist, typically leads to a much too small number of intervals when considering large numbers of observations [24], and is therefore an inappropriate procedure in our application.
In the second step of the procedure of Efron and Tibshirani [19], a Poisson regression model is fitted in which the midpoints of the bins are used as explanatory variables, and the numbers of observations in the intervals are the values of the response. As most of the SNPs are assumed to show the same distribution in the different groups, the density f of the observed zvalues typically looks similar to the null density f_{0}, but has a heavier right tail (see Figure 1). We therefore use a natural cubic spline with three degrees of freedom as regression function if (R  1)(C  1) ≤ 2. For (R  1)(C  1) ≥ 3, a natural cubic spline with five degrees of freedom would be a reasonable regression function. However, in functions such as the R function ns for generating the basis matrix of the spline, the inner knots by default are given by the 20%, 40%, 60%, and 80% quantile of the midpoints of the bins. These inner knots work well for symmetric densities. But the χ^{2}distribution is asymmetric – in particular for a small value of the degrees of freedom. If (R  1)(C  1) ≥ 3, we hence specify the inner knots directly by centering them around the mode and not around the median. The inner knots are thus given by the 0.4q_{M}, 0.8q_{M}, 1  0.8(1  q_{M}), and 1  0.4(1  q_{M}) quantile of the midpoints, where q_{M} is the quantile of the midpoints that corresponds to the mode estimated by the midpoint of the bin of the histogram containing the most observations. If there is more than one bin showing the largest number of observations, then the smallest of the corresponding midpoints is used as estimate. Other mode estimators such as the halfrange mode [25,26] might lead to better estimates than this ad hoc methods, but the estimation of f is typically only slightly influenced by the choice of the mode estimator.
Figure 1. Densities of the test scores in the analyses of the HapMap data. On the left hand side, the histograms and the estimated densities (marked by red lines) of the values of Pearson's χ^{2}statistic of the SNPs from the two subsets of the HapMap data (upper panel: JPT vs. CHB, lower panel: all four HapMap populations) are shown. The cyan line marks the estimated density when the inner knots are centered around the median in the natural cubic spline used in the density estimation. On the right hand side, the estimated densities (again, marked by red lines) and the corresponding null densities (black lines) are displayed.
In Figure 2, the estimated densities of four χ^{2}distributions with different degrees of freedom resulting from the application of this procedure to 100,000 values randomly drawn from the respective χ^{2}distribution are displayed, where the inner knots are centered, on the one hand, around the mode (red lines), and on the other hand, around the median (cyan lines). This figure reveals that the former leads to a better estimation than using the standard inner knots. In fact, the densities estimated using the former approach are very similar to the true densities.
Figure 2. Estimating the density of the χ^{2}distribution. For different degrees of freedom, the true (black line) and the estimated density (red line) of the χ^{2}distribution are shown, where the density is estimated by applying the procedure of Efron and Tibshirani [19] to 100,000 values randomly drawn from the χ^{2}distribution. The cyan line marks the estimated density when the inner knots of the natural cubic spline are centered around the median in the df ≥ 3 case.
Having estimated f, is determined, and the remaining steps 3 to 6 of Algorithm 1 are processed.
Permutation based estimation of the null density
If the assumptions for the approximation to the χ^{2}distribution are not met [27], the null density f_{0 }also has to be estimated. In this case, we calculate the ratio directly by permuting the group labels B times, computing the mB permuted zvalues, dividing these scores and the m observed zvalues into intervals, and fitting a logistic regression model through the binned data points. Similar to the application of the procedure of Efron and Tibshirani [19] (see previous section), the estimation of ϕ does not depend on the number of intervals used in the binning as long as this number is not too small or too large. We therefore follow Efron et al. [6], and split the observed and permuted zvalues into 139 intervals. Since the rejection region is onesided when considering Pearson's χ^{2}statistic as test score, a natural cubic spline with three degrees of freedom is used as regression function.
Implementation
Wholegenome studies comprise the genotypes of hundreds of thousands of SNPs for each of which the value of Pearson's χ^{2}statistic (2) has to be computed. Since calculating these values onebyone is very timeconsuming, we employ matrix algebra for determining all the scores simultaneously.
Assume that we have given an m × n matrix X in which each row corresponds to a categorical variable exhibiting the levels 1, ..., C, and a vector y comprising the group labels 1, ..., R of the n observations represented by the columns of X.
Firstly, C m × n indicator matrices X^{(c) }for the C levels are constructed by setting the elements of these matrices to
i = 1, ..., m, j = 1, ..., n. Furthermore, an n × R matrix Y with entries y_{jr }= I(y_{j }= r) is built in which each column represents one of the R group labels. Then, we set
and
c = 1, ..., C, where 1_{n }is a vector of length n consisting only of ones, so that the ith column and rth row of the m × R matrices N^{(c) }and ^{(c) }comprise the observed and the expected number of observations, respectively, that belong to the rth group and show the cth level at the ith variable. Afterwards, the m × R matrices
are determined by elementwise matrix calculation, i.e. by setting
Finally, the vector z comprising the value of Pearson's χ^{2}statistic for each of the m variables is given by
If the permutation based version of EBAM for categorical data is used, then not "just" m, but m(B + 1) zvalues have to be computed. Again, matrix algebra can help to speed up computation by considering all B permutations at once, or – if the number of variables or permutations is too large – subsets of the B permutations.
For this, suppose that L is a B × n matrix in which each row corresponds to one of the B permutations of the n group labels. If the B × n indicator matrices L^{(r)}, r = 1, ..., R, are defined analogously to X^{(c)}, then the m × B matrix Z^{0 }= {z_{ib}} containing the mB permuted zvalues can be determined by
where is the rth column of , ⊗ is the symbol for the Kronecker product, and * and the fraction line denote elementwise matrix calculation.
Processing time
To evaluate how much the matrix calculation procedure presented in the previous section can speed up the computation in comparison to an individual determination of Pearson's χ^{2}statistic, both approaches are applied to several numbers of variables. In Table 1, the resulting processing times are summarized. This table shows that employing matrix algebra leads to an immense reduction of time needed for computation – in particular if the number m of variables is large. If, e.g., 100,000 variables are considered, it takes just 6.2 seconds to determine the values of Pearson's χ^{2}statistic when employing matrix calculation, but more than 4.5 minutes when calculating the values onebyone.
Table 1. Comparison of computation times (in seconds) on an AMD Athlon XP 3000+ machine with one GB of RAM for both the matrix algebra based calculation and the individual determination of the values of Pearson's χ^{2}statistic for different numbers of variables and observations. Each of the m variables can take C = 3 levels, and each of the n observations belongs to one of R = 2 classes.
Note that the main reason for this immense reduction in computation time is not that the matrix calculation approach is algorithmically less complex than an individual computation, but that the implementation of this approach makes essential use of the way how vectorization and matrix multiplication are implemented in R [28].
Results
To exemplify that EBAM can be used to analyze highdimensional categorical data, it is first applied to two subsets of the genotype data from the International Hapmap Project [10]. Afterwards, it is shown how EBAM can be employed to identify SNP interactions associated with the response in association studies, and to quantify the importance of genotypes. R code for reproducing the results of all analyses performed in this section is available in Additional file 1.
Additional file 1. scriptEBAMSNP.R. This file that can be opened either in R or in any txteditor contains the R code that has been used to generate the results presented in this paper.
Format: R Size: 16KB Download file
Application to HapMap data
In the International HapMap Project, millions of SNPs have been genotyped for each of 270 people from the four populations Japanese from Tokyo (abbreviated by JPT), Han Chinese from Beijing (CHB), Yoruba in Ibadan, Nigeria (YRI), and CEPH (Utah residents with ancestry from northern and western Europe, abbreviated by CEU).
About 500,000 of these SNPs have been measured using the Affymetrix GeneChip Mapping 500 K Array Set that consists of two chips. In this paper, we focus on the BRLMM (Bayesian Robust Linear Models with Mahalanobis distance) genotypes [29] of the 262,264 SNPs from one of these chips, namely the Nsp array (see [30] for these genotypes).
JPT vs. CHB
Since we are mainly interested in casecontrol studies, or more generally in binary responses, EBAM is applied to the 45 JPT and the 45 CHB to detect the SNPs that show a distribution that differs substantially between these two population. Another reason is that both the JPT are unrelated, and the CHB are unrelated, whereas the other two populations consist each of 30 trios each of which is composed of genotype data from a mother, a father and their child.
Since in EBAM it is assumed that all variables follow the same null distribution, only SNPs showing the same number of genotypes are considered in the same EBAM analysis. Moreover, the current implementation of EBAM in the R package siggenes cannot handle missing values such that either missing genotypes have to be imputed, or SNPs with missing genotypes have to be removed prior to the EBAM analysis. Therefore, 54,400 SNPs showing one or more missing genotypes and 75,481 SNPs for which not all three genotypes are observed at the 90 persons are excluded from the analysis leading to a data set composed of the genotypes of 132,383 SNPs.
Using an AMD Athlon XP 3000+ machine with one GB of RAM on which Windows XP is installed, an application of EBAM to this data set takes 11.62 seconds if the null density f_{0 }is approximated by the χ^{2}density with two degrees of freedom, whereas it takes about 182 seconds if f_{0 }is estimated using 100 permutations.
In the upper left panel of Figure 1, a histogram and the estimated density of the observed test scores is displayed. For many of the SNPs the assumptions for an approximation to the χ^{2}distribution might not be met [27], as some of the expected numbers in the corresponding contingency table are smaller than 5. We therefore prefer not to use the approximation to the χ^{2}distribution, but the permutation based approach of EBAM for categorical data.
Employing the threshold Δ = 0.9 as suggested by Efron et al. [6], i.e. calling all SNPs significant that have a posterior probability of being significant larger than or equal to 0.9, leads to the identification of 193 SNPs with an estimated FDR of 0.08.
It is, however, also possible to use EBAM similarly to SAM [5,7]. For this, assume that we aim, on the one hand, to control the FDR at a level of about 0.05, and on the other hand, to identify about 200 SNPs for further analyses with, e.g., discrimination methods [9,31] such as logic regression [32]. In Table 2, the numbers of detected SNPs and the corresponding FDRs are summarized for six reasonable values of Δ. This table reveals that it is not possible to attain both goals simultaneously, as calling 200 SNPs significant would lead to an FDR larger than 0.08, whereas controlling the FDR at 0.05 would result in the identification of about 42 SNPs. This table also shows that Δ = 0.90 (or Δ = 0.91) provides a good tradeoff between the two goals. Hence, Δ = 0.90 will be also a good choice here if EBAM is used similarly to SAM.
Table 2. Estimated FDRs and numbers of identified SNPs for several values of the threshold Δ.
A list of the 193 SNPs with a posterior probability of being significant larger than or equal to 0.9 along with links to dbSNP [33] is available in the Additional file 2. Besides the zvalues and the posterior probabilities _{1}(z), this file also contains an estimate for the local FDR for each SNP [6]. Contrary to the FDR employed to quantify the overall accuracy of a list of variables, the local FDR proposed by Efron et al. [6] is a variablespecific measure that can be estimated by
Additional file 2. ebam.jpt.chb.html. This htmlfile contains information about the significant SNPs found in the EBAM analysis of JPT vs. CHB.
Format: HTML Size: 49KB Download file
Multiclass case
EBAM for categorical variables is not restricted to binary responses. It, e.g., can also be used to identify the SNPs showing a distribution that differs strongly between the four HapMap populations.
For this analysis, the most obvious dependencies are removed by excluding the child from each of the 60 trios such that 45 JPT, 45 CHB, 60 YRI, and 60 CEU are considered. Again, all SNPs for which at least one of the 210 values are missing (104,872 SNPs), or for which not all three genotypes are observed (14,273 SNPs), are excluded from the analysis resulting in a data set composed of the genotypes of 143,119 SNPs. In the lower right panel of Figure 1, the estimated density of the zvalues of these SNPs and the estimated null density are displayed. This figure reveals that a huge number of these SNPs exhibit a distribution that differs substantially in at least one of the populations. In fact, 131,336 SNPs show a posterior probability _{1}(z) larger than or equal to 0.9, whereas 33,101 SNPs even have a posterior probability of 1.
To examine which of the populations are responsible for this huge number of significant SNPs, we perform a twoclass EBAM analysis for each pair of the four HapMap populations. In Table 3, the numbers of SNPs exhibiting a posterior probability _{1}(z) ≥ 0.9 are summarized for all these analyses. This table reveals that only JPT and CHB show a small number of SNPs that differ between these two populations. In all other twoclass comparisons, a huge number of SNPs are called significant, where CEU differs the most from the other populations. These results do not seem to be that surprising, since JPT and CHB are both populations from Asia, whereas the other two populations come from two other continents.
Table 3. Numbers of significant SNPs found in pairwise EBAM analyses of the four HapMap populations.
Identification of interactions
When considering complex diseases, e.g., sporadic breast cancer, it is assumed that not individual SNPs, but interactions of SNPs have a high impact on the risk of developing the disease [34,35]. In such a case, it would therefore be of interest to also test interactions of SNPs. However, in wholegenome studies in which the number m of SNPs is in the tens or even hundreds of thousands, it would take – depending on the order of the interactions – hours, days or even weeks to compute the test scores for all pway interactions comprised by the m variables. For strategies on testing twoway interactions comprised by data from a simulated wholegenome study on a cluster of computers and their computation times, see [36]. Here, we focus our interest on the EBAM analysis of interactions of SNPs from association studies such as the GENICA study [9,37] in which typically several ten SNPs are examined.
For the simulation of such a study, data for 50 SNPs and 1,000 observations are generated by randomly drawing the genotypes 1 (for the homozygous reference), 2 (heterozygous), and 3 (homozygous variant) for each SNP S_{i}, i = 1 ,..., 50, where the minor allele frequency of the SNP is chosen uniformly at random from the interval [0.25, 0.4]. Afterwards, the casecontrol status y is randomly drawn from a Bernoulli distribution with mean Prob(Y = 1), where
l
such that the probability of being a case is 62.25% if SNP S_{6 }is not of the homozygous reference genotype and SNP S_{7 }is of this genotype.
In the left panel of Figure 3, the result of the application of EBAM to these 50 SNPs is displayed. This figure shows that S_{6 }is the only SNP with a posterior probability larger than or equal to 0.9, and thus the only SNP called significant. This figure also reveals that S_{7 }shows the eighth largest zvalue with a posterior probability of 0.313. If, however, the m(m  1)/2 = 1,225 twoway interactions of the m = 50 SNPs are considered, then the interaction of S_{6 }and S_{7 }shows the by far largest zvalue (see right panel of Figure 3). Most of the other features found to be significant are interactions of S_{6 }with another SNP. In this analysis, not all 1,225, but 1,224 of the twoway interactions are included, since one of the interactions shows only seven of the nine genotypes comprised by the respective two SNPs, and is thus excluded from the EBAM analysis of interactions showing all nine genotypes.
Figure 3. EBAM analysis of the simulated data. Scatter plots of the posterior probabilities vs. the zvalues resulting from the applications of EBAM to both the simulated SNPs themselves (left panel) and the twoway interactions comprised by these SNPs (right panel). Red points mark SNPs or SNP interactions called significant by EBAM, as their posterior probability is larger than or equal to 0.9 (dashed line).
This analysis is repeated several times using different simulated data sets each generated randomly with the above settings. In each of the applications of EBAM to the individual SNPs, either one of S_{6 }and S_{7}, or both are identified to be significant. Rarely, also other SNPs show a posterior probability larger than 0.9. In all of the analyses of the twoway interactions, the interaction of S_{6 }and S_{7 }is detected to be the most important one.
Measuring the importance of genotypes
EBAM cannot only be used to detect interesting variables or interactions. The posterior probabilities estimated by EBAM can also be employed to quantify the importance of features found by other approaches such as logicFS [38].
Logic regression [32] – which is employed as base learner in logicFS – is an adaptive regression and classification procedure that searches for Boolean combinations of binary variables associated with the response. Since this method has shown a good performance in comparison to other discrimination [9,39] and regression [40,41] approaches, a bagging [42] version of logic regression is used in logicFS to identify interactions of SNPs that are potentially interesting, i.e. associated with the response. While some of the found genotypes/interactions, that are of a similar form as the one intended to be influential for the disease risk in the previous section, have a high impact on the disease risk, others are only found at random by logicFS. It is therefore necessary to quantify the importance of the detected genotypes.
Since logic regression and thus logicFS can only handle binary predictors, each SNP has to be split into (at least) two binary dummy variables. We follow [32,38] and code each SNP S_{i}, i = 1, ..., m, by
S_{i1}: "S_{i }is not of the homozygous reference genotype."
S_{i2}: "S_{i }is of the homozygous variant genotype."
such that S_{i1 }codes for a dominant and S_{i2 }for a recessive effect. The genotype intended to be influential in the simulated data set described in the previous section can thus also be specified by the logic expression
where ^{C }denotes the complement of a binary variable with outcome true or false, and ⋀ represents the ANDoperator.
Contrary to the previous section in which each of the distributions of the values of the 3^{p }levels comprised by the respective combination of p of the m SNPs is tested whether it differs between groups of persons, EBAM is here applied to conjunctions, i.e. ANDcombinations, of binary variables with outcome true or false which are in turn binary variables such that genotypes of different orders, i.e. combinations of genotypes of different numbers of SNPs, can be considered together in the same EBAM analysis.
Applying the single tree approach of logicFS, see [38], with 50 iterations to the data set composed of the 100 dummy variables coding for the 50 simulated SNPs from the previous section leads to the detection of 84 potentially interesting interactions. For each of these genotypes which are conjunctions of one to four binary variables, the importance is then determined by the posterior probability estimated by EBAM. The importances, however, should not be quantified using the same data set on which the genotypes are identified, as it is very likely that almost any of the found genotypes is called significant, since it already has shown up as potentially interesting. In fact, if EBAM is applied to the 84 genotypes evaluated on the data set on which they were detected, 70 of them are called significant using Δ = 0.9 and 15 show a posterior probability of 1 (see left panel of Figure 4). While these 15 genotypes are composed of and one or two other binary variables, 32 of the genotypes called significant do neither contain S_{6 }nor . Moreover, two genotypes exist that exhibit a larger zvalue than .
Figure 4. EBAM applied to the genotypes identified by logicFS. Scatter plots of the posterior probabilities vs. the zvalues resulting from the applications of EBAM to the genotypes found in an application of logicFS to the simulated data. On the left hand side, the results of the application of EBAM to the data set on which the genotypes are found is shown, whereas on the right hand side, an independent data set is used to test the genotypes. Red points mark SNPs called significant by EBAM using Δ = 0.9 (dashed line).
It is therefore more appropriate to test the found genotypes on an independent data set. Thus, a new (test) data set is randomly generated as described in the previous section. Afterwards, the values of the 84 detected genotypes for the observations from the new data set are computed, and EBAM is applied to these values.
The same 15 genotypes as in the application to the original data set show a posterior probability of 1, where is found to be the genotype with the largest zvalue. The other three genotypes also called significant using Δ = 0.9 either contain or S_{61}. All the other genotypes not intended to have an impact on the disease risk, but called significant in the application to the data set on which they were found show a posterior probability less than 0.9, and thus are not called significant anymore in the application to the test data set.
Again, this analysis is repeated several times with different training and test data sets leading to similar results in each of the applications.
Conclusion and Discussion
Using the Bayesian framework to adjust for multiple comparisons is an attractive alternative to adjusting pvalues – in particular if the data are highdimensional. Thus, Efron et al. [6] have suggested an empirical Bayes analysis of microarrays (EBAM) for testing each gene if its mean expression value differs between two groups with a moderated tstatistic.
In this paper, we have proposed an algorithm that generalizes this procedure. This algorithm comprises the original EBAM analysis of Efron et al. [6] as well as the EBAM analysis based on Wilcoxon rank sums [13], and allows for other types of EBAM analyses in other testing situations. For this, it is only necessary to choose an appropriate test statistic, and, if the null density is known, a method for estimating the density of the observed test scores. The EBAM approach for categorical data proposed in this paper is one example for such an analysis. Another example would be to use an Fstatistic for performing an EBAM analysis of continuous data (e.g., gene expression data) when the response shows more than two levels. In this case, the zvalues of the genes would be given by the values of the Fstatistic, and the density of the observed zvalues might be estimated by the procedure of Efron and Tibshirani [19] if an Fdistribution with appropriate degrees of freedom is assumed to be the null distribution.
The generalized EBAM algorithm along with functions for using (moderated) tstatistics (one and twoclass, paired and unpaired, assuming equal or unequal group variances), (moderated) Fstatistics and Wilcoxon rank sums is implemented in the R package siggenes version 1.10.0 and later that can be downloaded from the webpage [43] of the BioConductor project [44] (see also the section Availability and requirements).
siggenes version 1.11.7 and later also contains a function for the EBAM analysis of categorical data proposed in this paper. Note that siggenes 1.10.× already comprises a preversion of this function. The main difference between these versions is the estimation of the density f of the observed test scores: While in siggenes 1.10.× the default version of the R function ns is used to generate the basis matrix for the natural cubic spline that is employed in the estimation of f, the inner knots of this spline are centered around the mode (and not the median) in siggenes 1.11.7 and later which leads to a better estimate of f as Figure 2 shows.
To exemplify how EBAM for categorical data can be applied to SNP data from wholegenome studies, it has been used to analyze two subsets of the HapMap data. In the first application aiming to identify SNPs showing a distribution that differs substantially between JPT and CHB, 193 of the 132,383 considered SNPs show a posterior probability larger than or equal to 0.9, and are therefore called significant by EBAM, where the estimated FDR of this set of SNPs is 0.08.
The number of identified SNPs and the corresponding FDR resulting from this EBAM analysis are identical to the results of the application of SAM to this HapMap data set [9] when the same permutations of the group labels are used in both methods. This is due to the fact that both EBAM and SAM employ the same approach to estimate the FDR. Moreover, the same set of SNPs is identified by both methods, since the same nonnegative test statistic is used in both applications. Virtually the same applies to the usage of the qvalues [11,12] as implemented, e.g., in John Storey's R package qvalue. For example, each of the 193 SNPs found by EBAM exhibit a qvalue less than or equal to 0.08.
In the second application to the HapMap data set in which all four populations are considered, most of the 143,119 SNPs show a distribution that differs substantially in at least one of the four groups. This huge number of differences does not seem to be that surprising, as the four HapMap populations come from three different continents. Pairwise EBAM analyses of the four populations show that CEU is the population that differs the most from the other populations. Again, a SAM analysis would lead to the same estimated FDR as the EBAM analysis if the same number of SNPs is identified, where this set of significant variables will contain the same SNPs in both analyses.
An advantage of EBAM over other approaches is that it not only estimates the FDR for a set of detected variables, but also naturally provides a variablespecific estimate for the probability that a variable is associated with the response.
The two applications to the HapMap data, however, also reveal two restrictions of the EBAM procedure. Since in EBAM it is assumed that all variables follow the same null distribution, a large number of SNPs have to be removed prior to both analyses, as these SNPs either exhibit missing values or only show (one or) two of the three genotypes. A solution to the former problem would be to replace the missing genotypes using imputation methods such as KNNcatImpute [45] or – when considering Affymetrix SNP chips – to employ genotype calling algorithms such as RLMM [46] or CRLMM [47] that allow to obtain genotypes for all SNPs.
An idea to solve the second problem is to perform two EBAM analyses – one for the SNPs showing only two genotypes, and one for the SNPs with data available for all three genotypes. Having computed the posterior probabilities for the two sets of SNPs separately and called all SNPs significant that exhibit a posterior probability of being significant larger than or equal to Δ in any of the analyses, a combined FDR needs to be estimated for both analysis, since we are interested in one estimate for the FDR of all detected SNPs. How such a combined estimate of the FDR can be obtained is an open question that will be part of future research.
EBAM cannot only be used to test individual categorical variables such as SNPs, but can also be applied to interactions of these variables.
However, two problems occur when considering interactions. The first problem is that pway interactions have to be tested. Although the functions implemented in siggenes allow to split the variables into subsets, an EBAM analysis of interactions in highdimensional data is not feasible in a reasonable amount of time. It is thus restricted to data from association studies in which several ten to a few hundred SNPs are considered.
The second problem is the empty cell problem: The number of observations available in a study is limited such that when considering pway interactions of SNPs some of the 3^{p }cells of the pdimensional contingency tables of some of the interactions will be empty leading to features with different numbers of categories and thus with different null distributions. Hence, EBAM cannot be applied to all of these features at once. In the analysis of the twoway interactions from the simulated data set, e.g., one interaction exhibits values only for seven of the nine genotypes comprised by two SNPs. This interaction therefore has to be removed from the EBAM analysis.
The abovementioned idea of performing separate EBAM analyses for variables with different numbers of levels and computing a combined FDR might not be ideal in the case of interactions, as many different numbers of level could exist. In such a situation, a better solution is not to consider the pway interactions as variables with 3^{p }categories, but to test each of the 3^{p }genotypes comprised by p SNPs that are observed at at least a particular number of persons. Furthermore, it might make sense to include the complements of the genotypes, as, e.g., "Not the homozygous reference genotype" corresponds to a recessive effect of a SNP. This, however, would increase the multiple testing problem by a factor of up to 6^{p }such that a filtering prior to the EBAM analysis might be advisable/necessary.
Boulesteix et al. [48] propose a multiple testing procedure for the identification of the combination of genotypes in a prespecified subset of (interacting) SNPs that shows the largest association with the response. Another solution to this multiple testing problem that does not require a prespecification of a subset of SNPs has been described in this paper: Firstly, a search algorithm such as logicFS is used to identify potentially interesting genotypes, where these genotypes can be composed of the genotypes from any of the SNPs considered in the study. Afterwards, the detected genotypes are tested on an independent data set using EBAM, where the posterior probability of being significant resulting from this EBAM analysis can be interpreted as an importance measure for the genotypes. For this analysis, it is not necessary that all genotypes are composed of the genotypes of the same number of SNPs, as they are coded as binary variables. Quantifying the importance of (combinations of) binary variables is implemented in the R packages logicFS version 1.7.6 and later [49].
Availability and requirements
Project name: siggenes – Multiple testing using SAM and Efron's empirical Bayes approach
Project home page: http://bioconductor.org/packages/2.1/bioc/html/siggenes.html webcite (for siggenes 1.12.0)
Operating system(s): Platform independent
Programming language: R
Licence: Free for noncommercial use
Any restrictions to use by nonacademics: See the licence in the siggenes package
Abbreviations
CEPH – Utah residents with ancestry from northern and western Europe (CEU). Han Chinese from Beijing (CHB). Empirical Bayes Analysis of Microarrays (EBAM). False Discovery Rate (FDR). Japanese from Tokyo (JPT). Significance Analysis of Microarrays (SAM). Single Nucleotide Polymorphism (SNP). Yoruba in Ibadan, Nigeria (YRI).
Authors' contributions
HS had the idea to generalize EBAM and to adapt EBAM to SNPs, implemented the software, and wrote the paper. KI was involved in the development of EBAM for categorical data and the design of the applications. Both authors read and approved the final manuscript.
Acknowledgements
Financial support of the Deutsche Forschungsgemeinschaft (SFB 475, "Reduction of Complexity in Multivariate Data Structures") is gratefully acknowledged. The authors also would like to thank the reviewers for their helpful comments.
References

Westfall PH, Young SS: Resamplingbased multiple testing: examples and methods for pvalue adjustments. New York, NY: Wiley; 1993.

Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments.

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98:51165124. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment.

Schwender H, Krause A, Ickstadt K: Identifying interesting genes with siggenes.

Schwender H: Modifying microarray analysis methods for categorical data – SAM and PAM for SNPs. In Classification – The Ubiquitous Challenge. Weihs C, Gaul W edition. Springer, Heidelberg; 2005:370377.

Schwender H: Statistical analysis of genotype and gene expression data. PhD thesis. University of Dortmund, Department of Statistics; 2007.

The International HapMap Consortium: The International HapMap Project.
Nature 2003, 426:789796. PubMed Abstract  Publisher Full Text

Storey JD, Tibshirani R: Statistical significance of genomewide studies.
Proc Natl Acad Sci USA 2003, 100:94409445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R: Empirical Bayes methods and false discovery rates for microarrays.
Genet Epidemiol 2002, 23:7086. PubMed Abstract  Publisher Full Text

Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
Stat Appl Genet Mol Biol 2004, 3:Article 3. PubMed Abstract

Storey JD, Tibshirani R: SAM thresholding and false discovery rates for detecting differential gene expression in DNA microarrays. In The Analysis of Gene Expression Data: Methods and Software. Edited by Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. Springer, New York; 2004:272290.

Silverman BW: Density estimation for statistics and data analysis. London: Chapman and Hall; 1986.

Chen SX: Probability density functions estimation using gamma kernels.

Scaillet O: Density estimation using inverse and reciprocal inverse Gaussian kernels.

Efron B, Tibshirani R: Using specially designed exponential families for density estimation.

Freedman D, Diaconis P: On the histogram as a density estimator: L_{2 }theory.

Scott DW: Multivariate density estimation: theory, practice, and visualization. New York: Wiley; 1992.

Bickel DR: Robust estimators of the mode and skewness of continuous data.

Hedges SB, Shah R: Comparison of mode estimation methods and application in molecular clock analysis.
BMC Bioinformatics 2003, 4:31. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cochran WG: Some methods for strengthening the common χ^{2 }tests.

R Development Core Team: [http://www.Rproject.org] webcite
R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2007.
ISBN 3900051070

Affymetrix: BRLMM: an improved genotype calling method for the GeneChip Human Mapping 500 k array set. Tech rep, Affymetrix, Santa Clara, CA; 2006.

Affymetrix – Mapping 500 k genotype calls on 270HapMap samples [http:/ / www.affymetrix.com/ support/ technical/ sample_data/ 500k_hapmap_genotype_data.affx] webcite

Schwender H, Zucknick M, Ickstadt K, Bolt HM: A pilot study on the application of statistical classification procedure to molecular epidemiological data.

The single nucleotids polymorphism database (dbSNP) [http://www.ncbi.nlm.nih.gov/projects/SNP] webcite

Garte S: Metabolic susceptibility genes as cancer risk factors: time for a reassessment?
Cancer Epidemiol Biomarkers Prev 2001, 10(12):12331237. PubMed Abstract  Publisher Full Text

Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect.
Am J Hum Genet 2002, 70:461471. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marchini J, Donnely P, Cardon RC: Genomewide strategies for detecting multiple loci that influence complex diseases.
Nat Genet 2005, 37:413416. PubMed Abstract  Publisher Full Text

Justenhoven C, Hamann U, Pesch B, Harth V, Rabstein S, Baisch C, Vollmert C, Illig T, Ko Y, Brüning T, Brauch H: ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population.
Cancer Epidemiol Biomarker Prev 2004, 13(12):20592064. PubMed Abstract  Publisher Full Text

Schwender H, Ickstadt K: Identification of SNP interactions using logic regression.
Biostat 2008, 9(1):187198. PubMed Abstract  Publisher Full Text

Ruczinski I, Kooperberg C, LeBlanc M: Exploring interactions in highdimensional genomic data: an overview of logic regression, with applications.

Kooperberg C, Ruczinski I, LeBlanc M, Hsu L: Sequence analysis using logic regression.
Genet Epidemiol 2001, 21(Suppl 1):S626S631. PubMed Abstract

Witte JS, Fijal BA: Introduction: analysis of sequence data and population structure.

BioConductor project [http://www.bioconductor.org] webcite

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. [http://genomebiology.com/2004/5/10/R80] webcite
Genome Biol 2004, 5:R80. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schwender H, Ickstadt K: Imputing missing genotypes with k nearest neighbors. Tech rep., Collaborative Research Center 475, Department of Statistics, University of Dortmund; 2008.

Rabbee N, Speed TP: A genotype calling algorithm for Affymetrix SNP arrays.
Bioinformatics 2006, 22:712. PubMed Abstract  Publisher Full Text

Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls for highdensity oligonucleotide SNP array data.
Biostat 2007, 8(2):485499. PubMed Abstract  Publisher Full Text

Boulesteix AL, Strobl C, Weidinger S, Wichmann HE, Wagenpfeil S: Multiple testing for SNPSNP interactions.
Stat Appl Genet Mol Biol 2007., 6(37) PubMed Abstract

logicFS version 1.8.0 [http://bioconductor.org/packages/2.1/bioc/html/logicFS.html] webcite