Abstract
Background
Complex binary traits are influenced by many factors including the main effects of many quantitative trait loci (QTLs), the epistatic effects involving more than one QTLs, environmental effects and the effects of geneenvironment interactions. Although a number of QTL mapping methods for binary traits have been developed, there still lacks an efficient and powerful method that can handle both main and epistatic effects of a relatively large number of possible QTLs.
Results
In this paper, we use a Bayesian logistic regression model as the QTL model for binary traits that includes both main and epistatic effects. Our logistic regression model employs hierarchical priors for regression coefficients similar to the ones used in the Bayesian LASSO linear model for multiple QTL mapping for continuous traits. We develop efficient empirical Bayesian algorithms to infer the logistic regression model. Our simulation study shows that our algorithms can easily handle a QTL model with a large number of main and epistatic effects on a personal computer, and outperform five other methods examined including the LASSO, HyperLasso, BhGLM, RVM and the singleQTL mapping method based on logistic regression in terms of power of detection and false positive rate. The utility of our algorithms is also demonstrated through analysis of a real data set. A software package implementing the empirical Bayesian algorithms in this paper is freely available upon request.
Conclusions
The EBLASSO logistic regression method can handle a large number of effects possibly including the main and epistatic QTL effects, environmental effects and the effects of geneenvironment interactions. It will be a very useful tool for multiple QTLs mapping for complex binary traits.
Keywords:
QTL mapping; Binary traits; Epistatic effects; Bayesian shrinkage; Logistic regressionBackground
Quantitative traits are usually influenced by multiple quantitative trait loci (QTLs), environmental factors and their interactions [1]. Although complex binary traits only show binary phenotypic variation different from the continuous variation in quantitative traits, they do not follow a simple Mendelian pattern of inheritance and also have a polygenic basis similar to that of quantitative traits. Therefore, like QTL mapping for quantitative traits, mapping for complex binary traits aims to identify multiple genomic loci that are associated with the trait and to estimate the genetic effects of these loci possibly including any of the following effects: main effects, genegene interactions (epistatic effects) and effects of geneenvironment interactions.
A number of statistical methods have been developed to identify QTLs for binary traits in experimental crosses. SingleQTL mapping methods [28] analyze the association between each individual genetic locus and the trait independently. However, singleQTL mapping method can only find the main effect of a QTL and cannot detect epistatic effects involving more than one locus. Moreover, it has been shown both theoretically and empirically that multipleQTL methods can improve power in detecting QTLs and eliminate possible biases in the estimates of QTL locations and genetic effects introduced by a singleQTL model [9,10]. Therefore, several multiple QTL mapping for binary traits have been developed. These include Bayesian methods [1116] that rely on Markov Chain Monte Carlo (MCMC) simulation to infer a multiple QTL threshold model for binary, ordinal or longitudinal traits, the multipleinterval mapping (MIM) methods [17,18] that use the expectationmaximization (EM) algorithm to infer the threshold model or a generalized linear model for binary or ordinal traits, and a method [19] that employs a probability model based on classical transmission genetics to identify binary trait loci. However, all these methods require large computation when the QTL model includes a relatively large number of loci.
Since hundreds or thousands of genomic loci or markers are usually genotyped and involved in QTL mapping studies, including all these markers and their possible interactions in a single model for multipleQTL mapping leads to a huge number of model variables, typically much larger than the sample size. This not only entails huge computation that is not affordable to existing QTL mapping methods mentioned earlier but also may reduce power of detection and/or increase false discover rate. Two techniques have been proposed to handle the second problem: variable or model section and shrinkage. More specifically, model selection using the Akaike or Bayesian information criterion (AIC or BIC) and variable selection based on stepwise logistic regression and the BIC have been proposed in [19] and [18], respectively, to restrict the model space and to reduce the number of variables included in the model. Bayesian shrinkage method that was first applied to QTL mapping for continuous traits [15,2024] has also been used in QTL mapping for binary traits [15,22], which employed MCMC for the inference of the QTL model. To reduce computational burden, more efficient methods [25,26] were developed to infer Bayesian QTL models for binary traits. Another wellknown method for shrinking variables is the least absolute shrinkage and selection operator (LASSO) [27], which has been applied to QTL mapping for continuous traits [24] and investigated for genomewide association study (GWAS) of complex diseases [28,29].
Recently, we developed an efficient empirical Bayesian LASSO (EBLASSO) algorithm for multipleQTL mapping for continuous traits, which is capable of handling a large number of markers and their interactions simultaneously [30]. In this paper, we extend the linear Bayesian LASSO model [23,30,31] to logistic regression to map multiple QTLs for binary traits. We consider a threelevel and a twolevel hierarchical model for the prior distributions of the regression coefficients. Building on the EBLASSO algorithm [30], we develop efficient empirical Bayesian algorithms to infer the Bayesian LASSO logistic regression model with two different priors. We then use simulations to compare the performance of our EBLASSO with that of five other QTL mapping methods for binary traits, that include the LASSOlogistic regression [27,28], the HyperLasso [25], the Bayesian hierarchical generalized linear models (BhGLM) [26], the relevant vector machine (RVM) [32,33], and the singleQTL mapping method based on logistic regression. Simulation results show that our EBLASSO offers best overall performance among the examined methods in terms of power of detection and false positive rate. Analysis of a real dataset with our algorithms also identifies several QTLs.
Methods
Logistic regression model
Let y_{i} = 0 or 1 denote the binary trait of the ith sample of n individuals in a study. Let us define y = [y_{1}, y_{2}, ···, y_{n}]^{T} as the binary phenotypes for all the n individuals. The probability of observing y_{i} = 1 is written as p_{i} = Pr(y_{i} = 1), i = 1, ···, n, which are further collected into a vector p = [p_{1}, p_{2}, ···, p_{n}]^{T}. Suppose that m genetic markers of these n individuals are genotyped and the genotype of marker j of individual i is x_{ij}. Taking main and epistatic effects of all markers into consideration, we have the following logistic regression model for multipleQTL mapping:
where β_{j} and β_{jj’} are regression coefficients, and logit(p_{i}) is defined as:
The widely adopted Cockerham genetic model [34] will be used in this paper. For a backcross design, the Cockerham model assigns −0.5 and 0.5 to x_{ij} for two possible genotypes at marker j. For an intercross (F_{2}) design, there are two possible main effects named additive and dominance effects. The Cockerham model defines the values of the additive effect as −1, 0 and 1 for the three genotypes and the values of the dominance effect as −0.5 and 0.5 for homozygotes and heterozygotes, respectively. For simplicity, we only consider additive effects in (1), although the methods developed in this paper are also applicable to the model with dominance effects.
Let us define x_{Gi} = [x_{i1}, x_{i2}, ⋯, x_{im}]^{T}, and
From (3), we can express p_{i} as follows:
Note that there are k = 1 + m(m + 1)/2 unknown regression coefficients in (1) or (3). Typically, we have k ≫ n. If dominance effects of the markers are considered, k is even larger. Simultaneously estimation of all possible genetic effects with k ≫ n is a challenging problem. However, we would expect that most elements of β are zeros and thus we have a sparse model. We will exploit this sparsity to develop efficient methods to infer β.
Prior distributions for the regression coefficients
We assign a noninformative uniform prior to β_{0}, i.e., p(β_{0}) ∝ 1. For β_{i}, i = 1, 2,···, k, we will consider two hierarchical models for the prior distribution. The first model is the one used in both the linear Bayesian LASSO model [23,30,31] and the HyperLasso logistic regression [25,35], which has three levels. At the first level, β_{i}, i = 1, 2,···, k, follows an independent normal distribution with mean zero and variance σ_{i}^{2}: β_{i} ∼ N(0, σ_{i}^{2}). At the second level, σ_{i}^{2}, i = 1, 2, ···, k, follows an independent exponential distribution with probability density function p(σ_{i}^{2}) = λexp(−λσ_{i}^{2}), with parameter λ > 0. At the third level, λ follows a gamma distribution gamma(a, b) with a shape parameter a and an inverse scale parameter b. We name the logistic regression model with this normalexponentialgamma (NEG) prior as BLASSONEG.
The threelevel hierarchical model has two hyperparameters a and b. While these two parameters give much flexibility of adjusting the degree of shrinkage, significant computation is required for cross validation to properly choose their values. To reduce the computational burden of cross validation, we will also consider a model with only the first two levels in the BLASSONEG. This two level hierarchical model only has one hyperparameter λ to be adjusted, and thus, it requires less computation. We name the logistic regression model with the twolevel prior as BLASSONE. We will next develop two empirical Bayes (EB) algorithms to infer these two models.
Empirical Bayesian algorithm for the BLASSONEG model (EBLASSONEG)
Using the EB approach [36], we will first estimate σ_{i}^{2}, i = 1, 2,⋯, k, from the data, and then find the posterior distribution of β based on the estimated σ_{i}^{2}. As shown in [30], the prior distribution of σ_{i}^{2} can be found as
Let us define
where
Let α_{i} = 1/σ_{i}^{2}, i = 1, 2,⋯, k, and collect them in a vector
The gradient g and Hessian matrix H of
where B_{MAP} is obtained with
If we postulate a linear model
Therefore, in the ith iteration we form the following linear model:
where ŷ and the distribution of ε are defined earlier, but β follows the threelevel BLASSONEG model. Then based on the linear model (9), we
use the EBLASSO algorithm
[30] to get a new estimate of A, which replaces Â obtained in the (i1)th iteration. The iteration process goes on until certain convergence criterion
is satisfied, which gives the final estimate of A and Laplace approximation
Algorithm 1 (EBLASSONEG)
1. Initialization: choose a > −1.5 and b > 0. The first basis in the model, x_{j}, is identified by
2. While the convergence criteria are not satisfied
3. Given Â, use the Newton–Raphson method to find
4. Calculate
5. Apply the EBLASSO algorithm [30] to linear model (9) to update Â.
6. End while
7. Output
Note that the algorithm starts with a logistic regression model with only one variable and then iteratively adds variables with a finite α_{i} to the model. The number of variables in the model k_{m} is typically much smaller than the total number of possible variables k. Since we only need to calculate the gradient g and the Hessian matrix H for the k_{m} variables in the model, the computation required in step 3 and in the calculation of Σ in (8) is relatively small. Moreover, the EBLASSO algorithm in step 5 is very efficient due to the fact that the variance components can be estimated in a closed form and other algorithmic techniques as discussed in [30].
The convergence criteria in Algorithm 1 are defined as: 1) no effect can be added to or delete from the model, 2) the likelihood change between two consecutive iterations is less than a prespecified small value and 3) the total change of α between two consecutive iterations is less than a prespecified small value. In step 5, the EBLASSO algorithm [30] needs to be modified slightly. First, since the noise covariance is known as B^{1}, we do not estimate noise covariance. Second, since the mean of ŷ in (9) is zero, we do not need to estimate it. Third, we use the formula Σ = (X^{T}BX + A)^{− 1} to update the covariance of β.
The values of hyperparameters (a, b) are determined by cross validation as will be described in the Result section. The
first basis in the initialization step is determined using the method in LASSOlogistic
regression
[28]. The initial value
Suppose that
We next derive an efficient EB algorithm for the BLASSONE model, which simplifies the hyperparameter selection, since we only need to determine the optimal value of one hyperparameter λ using cross validation.
Empirical Bayesian algorithm for the BLASSONE model (EBLASSONE)
The prior distribution of σ_{i}^{2}, i = 1, 2, …, k or equivalently α_{i}, i = 1, 2, …, k is used only in step 5 of the EBLASSONEG algorithm. Because the EBLASSONE model uses a different prior distribution from the one used in the BLASSONEG model, we will derive a new formula for estimating A in each iteration. Then the EBLASSONE algorithm uses the same steps in the EBLASSONEG except that the new formula is used in step 5 to find Â.
Suppose that the postulated linear model after step 4 is
where
where
where
Additional file 1. Derivation of equation (12). Replicates 2 and 3 for the simulations with only main effects.
Format: PDF Size: 1.2MB Download file
This file can be viewed with: Adobe Acrobat Reader
The EBLASSONE algorithm has the same steps as those in Algorithm 1 but with the following
two modifications. First, we need to choose a value for parameter λ instead of a and b in step 1, which can be done using cross validation. In the LASSOlogistic regression
[28], an upper bound of λ was estimated to be
Note that since the EBLASSONEG model for logistic regression considered in this paper uses the same prior as the one used by the EBLASSO for linear regression considered in [30], the EBLASSONEG algorithm is essentially a combination of Laplace approximation and the EBLASSO algorithm [30]. The EBLASSONE model considered here however uses a prior different from the one used by the EBLASSO [30]. Therefore, we employ (12) to modify the EBLASSO algorithm in [30], and then the EBLASSONE algorithm is a combination of Laplace approximation and the modified EBLASSO algorithm.
Results
Simulation study
A total of m = 481 genetic markers were simulated to be evenly spaced on a large chromosome of 2400 centiMorgan (cM) with an interval of d = 5 cM, which gives rise to a correlation of R = e^{2d} = 0.9048 since the Haldane map function [40] was assumed. The simulated population was an F_{2} family derived from cross of two inbred lines. The dummy variable for the three genotypes, A_{1}A_{1}, A_{1}A_{2} and A_{2}A_{2} of individual i at marker j was defined as x_{ij} = 1, 0, 1, respectively. Two simulation setups were employed. In the first setup, 20 markers were QTLs with main effects but without interactions. In the second setup, 10 main and 10 epistatic effects were simulated; a marker could have both main and epistatic effects, while two markers involving in an interaction effect did not necessarily have main effects. The QTLs were selected randomly with varying distances (5 cM  500 cM) and effect sizes (in the range between −1.28 and 2.19). Note that QTLs were assumed to be coincided with markers in both simulation setups. If QTLs are not on markers, they may still be detected since correlation between a QTL and a nearby marker is high, although a slightly larger sample size may be needed to give the same power of detection.
EBLASSONEG and EBLASSONE algorithms were implemented in C and could be called from the R environment [41], and thus QTL mapping with these two algorithms were carried out in R. To compare the performance of our algorithms with that of other relevant algorithms, we also analyzed the simulated data with the following five QTL mapping methods: the LASSO regularized logistic regression implemented in the glmnet package [42], the HyperLasso [25], the BhGLM method [26], the RVM [32,33], and the single QTL logistic regression test using the glm procedure in R. Simulation results from these methods are presented next.
Simulation result for the dataset with only main effects
The genotypes of m = 481 markers of n = 500 individuals were generated using the procedure described earlier. Twenty markers
were chosen as QTLs; their IDs and effect sizes are given in Table
1. Let x_{i} be a 20 × 1 vector containing the genotypes of 20 QTLs of individual i, and β contain corresponding effect sizes. Then probability p_{i} was calculated from
Table 1. True and estimated effects for the simulated data with main effects
The average log likelihood (denoted as logL) from tenfold cross validation was used to select the optimal value of the hyperparameter(s) in the EBLASSONE, the EBLASSONEG, the LASSO. Specifically, the dataset was first divided into 10 subsets. Nine subsets were used as the training data to estimate model parameters and the log likelihoods of the remaining testing data were calculated using the estimated parameters. This process was repeated ten times until every subset had been tested. The logL was the average of all the likelihoods obtained from 10 testing datasets.
For the EBLASSONE, we first calculated λ_{max} as described earlier. We then chose a set of values for λ decreasing from λ_{max} to 0.001 at a step of 0.35 on the logarithmic scale. Tenfold cross validation using this set of values identified an optimal value denoted as λ_{1.} We next zoomed in the interval of length 0.01 centering at λ_{1}, and performed cross validation using ten more values equally spaced in the interval. This procedure identified the largest logL at the optimal λ = 0.050. A summary of results for several values of λ and the corresponding logL was given in Table 2. Using the optimal values of λ, the EBLASSONE detected 11 true and 2 false positive effects that have pvalue ≤ 0.05. The estimated sizes of the true effects and their standard errors were given in Table 1.
Table 2. Crossvalidations of the EBLASSONE, EBLASSONEG and LASSO for the simulation with only main effects
The optimal values of parameters a and b in the EBLASSONEG were obtained with cross validation in three steps. In the first step, a = b = 0.001, 0.01, 0.1, 1 were examined and a pair (a_{1}, b_{1}) corresponding to the largest logL was obtained. In the second step, b was fixed at b_{1} and a was chosen from the set [−0.5, 0.4, 0.3, 0.2, 0.1, 0.01, 0.01, 0.05, 0.1, 0.5, 1], which yielded an a_{2} corresponding to the largest logL. Note that when fixing one of the two parameters, the degree of shrinkage is a monotonic function of the other parameter. In the third step, a = a_{2} was fixed and b was varied from 0.01 to 10 with a step size of one for b > 1 and a step size of one on the logarithmic scale for b < 1. This threestep procedure identified the optimal pair of parameters that maximized logL at (a, b) = (0.01, 6). A summary of results for several pairs of a and b and the corresponding logL was given in Table 2. The whole dataset was then analyzed with the optimal parameters, which identified 11 true and 1 false positive effects that have pvalue ≤ 0.05. The estimated sizes of true effects and their standard errors were depicted in Table 1.
For the LASSOlogistic regression, the cross validation procedure in the glmnet package [42] automatically identified a maximum value λ_{max} for λ that gave one nonzero effect and then λ was decreased from λ_{max} to λ_{min} = 0.001λ_{max} with a step size of [log(λ_{max}) − log(λ_{min})]/100 on the logarithmic scale. The largest λ that yielded a logL within one standard error of the maximum logL was determined as the optimal value. This gave the optimal values λ = 0.0257. The whole dataset was then analyzed using the optimal λ to estimate QTL effects, which identified 40 markers with nonzero regression coefficients. The LASSO only estimates regression coefficients without giving a confident interval or a pvalue for the estimates. If we counted all nonzero coefficients as detected effects and considered them as one effect if they were in 20 cM from a QTL, the LASSO yielded 14 true positive effects and 12 false positive effects. Moreover, the LASSO typically gives a biased estimate of nonzero coefficient toward zero. To reduce the false positive rate and the bias, we refitted an ordinary logistic regression model with the markers selected by the LASSO. Among those markers with a pvalue ≤ 0.05 in the refitted model, 6 markers were true positive effects 4 were false positive effects. A summary of results from cross validation for the EBLASSONE, the EBLASSONEG and LASSO was given in Table 2. The estimated sizes of true effects and their standard errors were depicted along with all 20 true effects in Table 1.
The HyperLasso employed the same Bayesian NEG hierarchical prior for the marker effects and estimated the posterior modes using a numerical algorithm [25]. Hoggart et al. did not propose a cross validation procedure to determine the values of a and b but suggested a range from 0.01 to 10 for a and gave a formula to calculate b from the level of the typeI error controlled at α. In our simulations, we used three values for a and four values for α as listed in Table 3. The values of b calculated from α using the method in [25] is also included in Table 3. Similar to the LASSO, the HyperLasso outputs a point estimate of β without a confidence interval or a pvalue. Therefore, we refitted the markers selected by the HyperLasso with ordinary logistic regression and identified markers with a pvalues ≤ 0.05 as QTLs. The number of effects identified with different values of a and b are presented in Table 3. The best results in Table 3 include 10 true and 1 false positive effects. We would emphasize here that these best results may not be achievable in the analysis of real data because the optimal values of a and b cannot be determined. The estimated sizes of true effects and their standard errors for the best results in Table 3 were depicted along with all 20 true effects in Table 1.
Table 3. Summary of results of the HyperLasso for the simulated data with only main effects
The BhGLM method [26] employed a twolevel Bayesian hierarchical prior for the marker effects: the ith entry of β follows a normal distribution N(0, σ_{i}^{2}) and σ_{i}^{2} obeys an inverse χ^{2} distribution Inv  χ^{2}(υ_{i}, τ_{i}^{2}), and it used the EM algorithm to estimate the posterior mode of the Bayesian QTL model. The default value for hyperparameters ν_{i} and τ_{i} are 0.01 and 10^{4}, respectively. The variance of regression coefficients in BhGLM method was treated as missing data and estimated in the Estep of the EM algorithm. The pvalue of each nonzero effect was calculated from the tdistribution of one degree of freedom using the tstatisrics calculated from the estimated regression coefficient and corresponding variance. We examined 15 different pairs of values for ν_{i} and τ_{i} as listed in Table 4. The values ν_{i} = 10^{3}, 10^{4} and 10^{5} all gave the best result which includes 9 true and 0 false positive effects. The corresponding estimated sizes of the true effects and their standard errors were depicted in Table 1.
Table 4. Summary of results of the BhGLM for the simulated data with only main effects
The RVM for classification [33] assumed a uniform prior for the variance of regression coefficients, and thus it did not involve parameter selection. We utilized the MATLAB implementation of the RVM from the authors [33] to analyze the datasets and identified 35 nonzero effects with a pvalue ≤ 0.05. Among these effects, we have 17 true and 18 false positives. The estimated sizes of true effects and their standard errors were depicted in Table 1. The false positive rates were much higher that the EBLASSONEG, EBLASSONE, LASSO, and HyperLasso. When we reduced the number of false positive effects to three, at a level slightly higher than that of our EBLASSONEG and EBLASSONE, by reducing the cutoff for the pvalue, we obtained 10 true positive effects, which were smaller than the number of true effects identified with our EBLASSONEG and EBLASSONE.
Single QTL mapping with logistic regression was performed using the glm procedure in R [41]. After Bonferroni correction, effects with a pvalue ≤ 0.05/481 = 1.04 × 10^{4} were considered as significant, which identified 8 true and 25 false positive effects. The estimated sizes of true effects and their standard errors were depicted in Table 1. The small pvalue cutoff used by Bonferroni correction was expected to yield a small false positive rate. However, the single QTL mapping method with Bonferroni correction still gave much more false positive effects than other methods. If we had used another popular permutation technique for multiple test correction, we effectively employed a larger pvalue cutoff. Although this could increase power of detection, it would also increase the false positive rate. To see this, we increased the cutoff for the pvalue to 6 × 10^{4} so that the number of true positive effects detected was increased to 11, at a level same as that of our EBLASSONEG and EBLASSONE methods, but then the number of false positive effects was increased to 27.
As shown in Table 1, the EBLASSONE and the EBLASSONEG identified more true effects than other four methods except RVM, and yielded a number of false positive effects comparable to those of the LASSO, the HyperLasso and the BhGLM but much smaller than those of the RVM and the singleQTL mapping method. Note that the false negative rate can be easily calculated from 20 simulated true effects and the true positive effects detected by each method. While the EBLASSONE and the EBLASSONEG offered similar performance in terms of power of detection and the false positive rate, several true effects were detected by either of them, which implies that the power of detection could be improved if the results of two methods were combined. Similar observations can be seen from the simulation results for two more independent replicates described in Additional file 1: Table S1 and Table S2.
It is well known that the LASSO typically selects only one variable among a set of highly correlated variables. This phenomenon is indeed observed from the results in Tables 1, Additional file 1: Tables S1 and S2 for two pairs of highly correlated markers, (72,73) and (181,182). It turns out that all methods compared except the single QTL mapping method have the same problem, although the problem with the EBLASSONE tends to be less severe. The LASSO is also known to bias the regression coefficients toward zero. This is observed from the results in Table 1, Additional file 1: Tables S1 and S2. Since the EBLASSONE uses the same prior distribution as the LASSO, it exhibits the same trend. However, the RVM inflated all the detected effects likely due to its small degree of shrinkage. On the other hand, the EBLASSONEG, the HyperLasso and the BhGLM tend to detect only one of the two highly correlated effects with an inflated effect size.
All simulations were performed on a personal computer (PC) with a 3.4 GHz Intel PentiumD CPU and 2Gb memory running Windows XP, except that the HyperLasso was ran on an IBM BladeCenter cluster including computing nodes with 2.6 GHz Xeon or 2.2 GHz Opteron CPU running Linux. The speeds of the EBLASSONEG, the LASSO and the HyperLasso are comparable and faster than the other methods. The speeds of the EBLASSONE, the BhGLM and the singleQTL mapping method are comparable.
Simulation results for the model with main and epistatic effects
In the second simulation setup, 10 main and 10 epistatic effects were simulated. The
genotypes of m = 481 markers of n = 1,000 individuals were generated using the procedure described earlier. Locations
and effects of the QTLs and QTL pairs are shown in Table
5. Some of the markers had only main or epistatic effect, while the others had both
main and epistatic effects. The status of the binary trait of each individual was
generated using the logistic regression model as described in the first simulation
setup. The QTL model contained a total of
Table 5. True and estimated effects for the simulated data with main and epistatic effects
The same crossvalidation procedures described earlier were performed to choose the optimal values of the hyperparameters for the EBLASSONE, the EBLASSONEG and LASSO, and the results for several values of hyperparameters are presented in Table 6. Optimal values of the hyperparameters were then used to analyze the whole dataset. The number of true and false positive effects identified and the estimated effect sizes of the detected true effects are given in Table 5.
Table 6. Crossvalidations of the EBLASSONE, EBLASSONEG and LASSO for the simulation with main and epistatic effects
For the HyperLasso, we again examined 12 pairs of values for hyperparameters a and b as listed in Table 7, and identified a = 0.1 and b = 4.9 × 10^{4} as the optimal values that gave best tradeoff between the true and false positive effects. We also used a twolocus logistic regression model logit(p_{i}) = β_{0} + β_{1}x_{i} + β_{2}x_{j} + β_{3}x_{i} ⋅ x_{j}, i = 1, ⋯, m1, j > i, to test the epistatic effect of locus i and j. The logistic regression model was fitted with the glm procedure in R [41]. Effects with a pvalue ≤ 0.05/k = 4.31 × 10^{7} were considered as significant. Detailed QTL mapping results for the HyperLasso and the twolocus test are also given in Table 5.
Table 7. Summary of results of the HyperLasso for the simulated data with main and epistatic effects
As shown in Table 5, the EBLASSONE and the EBLASSONEG detected the same or a larger number of true effects but a smaller number of false positive than the other methods, which clearly demonstrates that our EBLASSONE and EBLASSONEG offer the best overall performance in terms of power of detection and the false positive rate. The LASSO is the fastest, while the EBLASSONEG and the HyperLasso are the second and the third fastest. Similar observations were obtained from two more independent simulations, as shown in the results for replicates 2 and 3 whose details were presented in the Additional file 1.
Real data analysis
We used a mouse data published by Masinde et al.[43] as an example to test our methods. The trait was wound healing speed of mice measured as a binary trait, fast healer denoted by 1 and slow healer denoted by 0. There were 633 F_{2} mice derived from the cross of a faster healer inbred line (MRL/MPj) and a slow healer inbred line (SJL/J). At age 3 weeks, each F_{2} mouse was punched a 2mm hole in the lower cartilaginous part of each ear using a metal ear puncher. The fast healer mice completely healed in 21 days after ear punch (complete closure of the holes) while the slow healer mice remained open for the holes after 21 days of ear punch. Some of the F_{2} mice healed partially and these mice were phenotypically coded as 1 if the holes were < 0.7 mm and 0 if the holes were > 0.7 mm. This dataset consisted of the genotypes of 119 markers across the mouse genome from 633 samples. Samples with more than 10% of missing markers or with missing phenotype were removed, resulting in a 532 × 119 genotype matrix with 3.28% missing values. These missing genotypes were inferred from neighboring markers. Total number of possible effects is k = 7141.
We carried out QTL mapping for this dataset using the EBLASSONE, the EBLASSONEG, the LASSO and the HyperLasso, since simulation results presented earlier show these four methods offer better performance than the other methods. Tenfold cross validation for the EBLASSONE, the EBLASSONEG and the LASSO were performed with the same procedures used in simulation studies to obtain optimal values of the hyperparameters. For the EBLASSONE, λ = 0.4 was determined as the optimal value, which resulted in 7 main and 4 epistatic effects with a pvalue ≤ 0.05. For the EBLASSONEG, (a, b) = (0.01, 0.5) were the optimal values which resulted in 4 main and 4 epistatic effects with a pvalue ≤ 0.05. For the LASSO, λ = 0.0715 was the optimal value, which yielded 13 nonzero effects. Refitting an ordinary logistic regression model with these 13 effects to the data identified 7 main and 3 epistatic effects with a pvalue ≤ 0.05. When α = 0.05/k and 0.01/k, the HyperLasso only identified 1 or 2 main effects; when α = 0.05, it identified more than 40 effects. On the other hand, when a = 0.1 and α = 0.01, it identified 8 main and 17 epistatic effects, all having a pvalue ≤ 0.05, which seemed a more reasonable result than the ones obtained with other values of a and α. Therefore, these 25 effects were regarded as the effects identified by the HyperLasso. The effects identified by at least 3 of the 4 methods are listed in Table 8, which contain 7 main and 3 epistatic effects.
Table 8. Results for the real data obtained with EBLASSONE, EBLASSONEG, LASSO and HyperLasso
Masinde et al.[43] previously identified 10 main effects using a singleQTL mapping method, and 8 epistatic effects using twoway ANOVA. Among the 7 main effects and 3 epistatic effects identified in our analysis, 6 of the main effects are identical to those identified by Masinde et al. but all 3 epistatic effects are different from the ones identified by Masinde et al. One of the markers involved in an epistatic effect (D4mit31) identified in our analysis was a main effect identified by Masinde et al. Since our QTL model considers the main and epistatic effects jointly, it may account for both effects more reliably, comparing with the singleQTL mapping approach and the two way ANOVA used by Masinde et al. Therefore, the three epistatic effects identified in our analysis may be novel effects that are worth further experimental investigation.
Some of the identified QTLs are in positions close to the genes that are upregulated in expression profiles obtained during the inflammation stage of wound healing [44]. Loci D3mit217 and D9mit270 were identified as main effects in both our analysis and the study of Masinde et al. It turns out that D3mit217 (chr3, 34.7 cM) is close to genes calgranulin A (chr3, 43.6 cM), CD53 (chr3, 50.5 cM), and small prolinerich protein 1A (chr3, 45.2 cM), and that D9mit270 (chr9, 41.5 cM) is close to gene annexin A2 (chr9, 37.0 cM). Locus D9mit182 was identified as a main effect in our study but not identified as an effect in the study of Masinde et al. It was found that D9mit182 (chr9, 53.6 cM) is close to chemokine receptor 2 (chr9, 71.9 cM). Among the loci in the three epistatic effects we identified, D11mit242 (chr11, 31.7 cM) is close to chemokine (CC motif) ligase 4 (chr11, 47.6 cM) and chemokine (CC motif) ligase 6 (chr11, 41.5 cM). Genes related to growth factors are known to play an important role in wound healing [43,45]. D7mit246 and D17mit176 are involved in the epistatic effects we identified; and D7mit246 is 5.0 cM away from the fibroblast growth factor receptor 2 (FGFR 2), and D17mit176 is 12.2 cM away from the vascular endothelial growth factor (VEGF).
Discussion
Our EBLASSONEG algorithm is based on a Bayesian logistic regression model that uses the same threelevel hierarchical prior for the regression coefficients as the one used in the Bayesian LASSO linear regression model [23,30,31], the Bayesian hyperLasso linear regression model [35] and the HyperLasso logistic regression model [25]. The HyperLasso of Hoggart et al.[25] uses a numerical algorithm to estimate the mode of the posterior distribution, whereas our EBLASSONEG first estimates the variance of the regression coefficients and then find an approximation of the posterior distribution of the regression coefficients based on the estimated variance. As shown in our simulation study, our EBLASSONEG offers better performance than the HyperLasso of Hoggart et al. in terms of power of detection, false positive rate and speed, especially when the number of possible effects is very large in QTL models with both main and epistatic effects.
The LASSOlogistic regression was applied to GWAS to identify genomic loci associated with complex disease [28,29], and it can be directly employed in QTL mapping for binary traits as shown in our simulation study and real data analysis. The LASSOlogistic regression particularly implemented with the glmnet algorithm [42] is very efficient. Our EBLASSONE algorithm and the LASSOlogistic regression essentially employ the same twolevel prior for the regression coefficients. However, the major difference is that the LASSOlogistic regression estimates the mode of the posterior distribution, whereas our EBLASSONE algorithm first estimates the variances of the regression coefficients and then finds the posterior distributions of the regression coefficients. In our simulation study, we demonstrated that both our EBLASSONE and EBLASSONEG algorithms outperform the LASSOlogistic regression in terms of power of detection and false positive rate, although their speed is lower than that of the LASSO. The good performance of our EBLASSONE and EBLASSONEG may be due to the fact that our model inference using the variance and posterior distribution of the regression coefficients provides more information than the point estimation of the regression coefficients yielded by the LASSOlogistic regression.
Another prior distribution commonly employed in Bayesian shrinkage is the mixture of normal and inverseχ^{2} distributions as used in the Bayesian linear regression model for continuous traits [23,24,46] and the generalized linear model for continuous or binary traits in the BhGLM method [26]. The BhGLM method uses the EM algorithm to estimate the mode of the posterior distribution treating the variance of regression coefficients as missing data. As shown in our simulations for QTL models with only main effects, our EBLASSONEG and EBLASSONE offer power of detection better than and a false positive rate comparable to the BhGLM method. The speed of the EBLASSONEG is much higher than that of the BhGLM method, while the speed of the EBLASSONE is comparable to that of the BhGLM method. However, the BhGLM method was not able to handle a QTL model with 115,921 variables and 1,000 samples in our simulation, whereas our EBLASSONEG and EBLASSONE completed the analysis of this model within 5 min and 35 min, respectively.
Our EBLASSONEG and EBLASSONE use the same Laplace’s method originally proposed in [38] as the one used in the RVM for classification [32,33]. However the prior distributions used by three methods are different. Although both the uniform prior and the inverse gamma prior for the variance of regression coefficients were considered in [32,33], the more efficient RVM algorithm [33] employs the uniform prior. The uniform prior does not have any hyperparameter and lacks flexibility of adjusting the degree of shrinkage that our EBLASSONEG and EBLASSONE enjoy. The uniform prior of the RVM causes at least two problems. First, it often does not provide sufficient degree of shrinkage in multiple QTL mapping that includes a very large number of possible effects, which results in a large number of false positive, as observed in our simulations in this paper and in the simulation study for QTL mapping for continuous traits in [30,46]. Second, because of the relatively low degree of shrinkage, the regression model usually contains a relatively large number of nonzero regression coefficients, which reduces the speed of the algorithm, as seen in our simulations.
Our EBLASSONEG and EBLASSONE estimate the variance of regression coefficients iteratively. In each iteration, Laplace’s method is first used to obtain an approximation of the posterior distribution, which results in an equivalent linear regression model. Then, the EBLASSONEG uses the EBLASSO algorithm we developed in [30] to estimate the variance. Therefore, the EBLASSONEG essentially is a combination of Laplace’s method and the EBLASSO algorithm. However, the EBLASSO does not consider the prior of the EBLASSONE, and thus, we derive a novel formula in (12) and modify the EBLASSO algorithm to estimate the variance for EBLASSONE. The EBLASSONEG has two hyperparameters, whereas the EBLASSONE has only one hyperparameter, which simplifies crossvalidation for selecting the optimal value of the hyperparameter. Moreover, simulation studies demonstrated that the EBLASSONE identified some QTLs that were not detected by the EBLASSONEG, suggesting that combination of the two methods can lead to increased detection power.
The full Bayesian methods [1113,15,47] use the threshold model that employs a latent liability variable to link the binary trait with the QTLs and then apply MCMC simulation to infer the model. It is well known that MCMC simulation requires very large computation for the model with a relatively large number of variables. Therefore, these fully Bayesian methods may not be computationally efficient for QTL mapping with both main and epistatic effects from a relatively large number of QTLs.
The MIM methods [17,18] for mapping binary trait loci either does not employ any variable selection technique [17] or uses stepwise logistic regression and BIC to select variables [18]. Hence, it is difficult for them to deal with a QTL model with a relatively large number of loci and epistatic effects due to their demanding computation burden and possible large false discovery rate. The probability model in [19] for binary trait mapping entails a model space exponentially increasing with the number of loci. Although the BIC is used for model selection, its computational complexity increases dramatically with the number of loci.
We demonstrated that our EBLASSONE and EBLASSONEG can easily handle a model with 115,922 variables. If much more variables are involved, in e.g. GWAS or QTL mapping with high order interactions, our methods can be combined with a variable screening method such as the sure independence screening (SIS) [48,49] to facilitate computation. It is not difficult to extend our EBLASSONE and EBLASSONEG algorithms to QTL mapping for ordinal traits. We can first apply Laplace’s method to get an approximately equivalent linear model by deriving the gradient and the Hessian matrix of the posterior distribution and then apply our efficient EBLASSO algorithm to infer the linear model. Since the multinomial logistic regression model includes more variables than the binary logistic regression model, it will require more computation.
Conclusions
We have developed two algorithms named EBLASSONEG and EBLASSONE for the inference of Bayesian logistic regression models for multiple QTL mapping. While the EBLASSONEG is an extension of the EBLASSO [30] to handle binary traits, the simulations demonstrate that the EBLASSONEG algorithm provides superior performance in terms of power of detection and false positive rate, comparing with five other existing methods. Moreover, the EBLASSONEG is faster than four other methods tested but only slower than the LASSO algorithm. The hierarchical prior in the EBLASSONE in this paper was not considered in the EBLASSO [30]. Here we derive a novel formula given in equation (12) to estimate the variance of regression coefficients. Our simulations show that the EBLASSONE provides comparable or better power of detection and false positive rate comparing with five other existing methods, and the power of detection could be improved if the results are combined with that of the EBLASSONEG. In summary, our EBLASSONE and EBLASSONEG algorithms provide an efficient tool for QTL mapping for binary traits involving a large number of both main and epistatic effects, and they can be extended to QTL mapping for ordinal traits.
Abbreviations
AIC: Akaike information criterion; BIC: Bayesian information criterion; BhGLM: Bayesian hierarchical generalized linear model; BLASSO: Bayesian LASSO; EB: Empirical Bayes; EBLASSO: Empirical Bayesian LASSO; EM: Expectationmaximization; GWAS: Genomewide association study; LASSO: Least absolute shrinkage and selection operator; MAP: Maximum a posterior; MCMC: Markov Chain Monte Carlo; MIM: Multipleinterval mapping; NE: Normal exponential prior; NEG: Normal exponential gamma prior; PC: Personal computer; QTL: Quantitative trait locus; RVM: Relevant vector machine.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AH participated in the design of the algorithms, developed the computer programs, performed the simulations and data analyses, and drafted the manuscript. SX participated in the development of the algorithms, designed simulation study, participated in the data analyses and helped to draft the manuscript. XC conceived of and coordinated the study, developed the algorithms, participated in the data analyses and drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgments
This work was supported by the National Science Foundation (NSF) under NSF CAREER Award no. 0746882 to XC and by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome, Genetics and Breeding Program 20073530018285 to SX.
References

Falconer DS, Mackay TFC: Introduction to Quantitative Genetics. 4th edition. Boston: AddisonWesley; 1996.

Hackett CA, Weller JI: Genetic mapping of quantitative trait loci for traits with ordinal distributions.
Biometrics 1995, 51(4):12521263. PubMed Abstract  Publisher Full Text

Xu S, Atchley WR: Mapping quantitative trait loci for complex binary diseases using line crosses.
Genetics 1996, 143(3):14171424. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rao S, Xu S: Mapping quantitative trait loci for ordered categorical traits in fourway crosses.
Heredity 1998, 81(2):214224. PubMed Abstract  Publisher Full Text

Xu S, Yi N, Burke D, Galecki A, Miller RA: An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a fourway cross mouse family.
Genet Res 2003, 82(2):127138. PubMed Abstract  Publisher Full Text

Xu C, Zhang YM, Xu S: An EM algorithm for mapping quantitative resistance loci.

Xu C, Li Z, Xu S: Joint mapping of quantitative trait loci for multiple binary characters.
Genetics 2005, 169(2):10451059. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Deng W, Chen H, Li Z: A logistic regression mixture model for interval mapping of genetic trait loci affecting binary phenotypes.
Genetics 2006, 172(2):13491358. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Haley CS, Knott SA: A simple regression method for mapping quantitative trait loci in line crosses using flanking markers.
Heredity 1992, 69(4):315324. PubMed Abstract  Publisher Full Text

Martínez O, Curnow RN: Estimating the locations and the sizes of the effects of quantitative trait loci using flanking markers.

Yi N, Xu S: Bayesian mapping of quantitative trait loci for complex binary traits.
Genetics 2000, 155(3):13911403. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yi N, Xu S: Mapping quantitative trait loci with epistatic effects.
Genet Res 2002, 79(2):185198. PubMed Abstract

Yi N, Xu S, George V, Allison DB: Mapping multiple quantitative trait loci for ordinal traits.
Behav Genet 2004, 34(1):315. PubMed Abstract

Yi N, Banerjee S, Pomp D, Yandell BS: Bayesian mapping of genomewide interacting quantitative trait loci for ordinal traits.
Genetics 2007, 176(3):18551864. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huang H, Eversley CD, Threadgill DW, Zou F: Bayesian multiple quantitative trait loci mapping for complex traits using markers of the entire genome.
Genetics 2007, 176(4):25292540. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yang R, Li J, Wang X, Zhou X: Bayesian functional mapping of dynamic quantitative traits.
Theor Appl Genet 2011, 123(3):483492. PubMed Abstract  Publisher Full Text

Chen Z, Liu J: Mixture generalized linear models for multiple interval mapping of quantitative trait loci in experimental crosses.
Biometrics 2009, 65(2):470477. PubMed Abstract  Publisher Full Text

Li J, Wang S, Zeng ZB: Multipleinterval mapping for ordinal traits.
Genetics 2006, 173(3):16491663. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Coffman CJ, Doerge RW, Simonsen KL, Nichols KM, Duarte CK, Wolfinger RD, McIntyre LM: Model selection in binary trait locus mapping.
Genetics 2005, 170(3):12811297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xu S: Estimating polygenic effects using markers of the entire genome.
Genetics 2003, 163(2):789801. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang H, Zhang YM, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S: Bayesian shrinkage estimation of quantitative trait loci parameters.
Genetics 2005, 170:465480. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hoti F, Sillanpää MJ: Bayesian mapping of genotype x expression interactions in quantitative and qualitative traits.
Heredity 2006, 97(1):418. PubMed Abstract  Publisher Full Text

Yi N, Xu S: Bayesian LASSO for quantitative trait loci mapping.
Genetics 2008, 179(2):10451055. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xu S: An empirical Bayes method for estimating epistatic effects of quantitative trait loci.
Biometrics 2007, 63(2):513521. PubMed Abstract  Publisher Full Text

Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies.
PLoS Genet 2008, 4(7):e1000130. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yi N, Banerjee S: Hierachical generalized linear models for multiple quantitative trait locus mapping.
Genetics 2009, 181:11011133. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tibshirani R: Regression shrinkage and selection via the lasso.

Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression.
Bioinformatics 2009, 25:714721. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ayers KL, Cordell HJ: SNP selection in genomewide and candidate gene studies via penalized logistic regression.
Genet Epidemiol 2010, 34(8):879891. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cai X, Huang A, Xu S: Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping.
BMC Bioinformatics 2011, 12(1):211. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Park T, Casella G: The Bayesian lasso.
J Am Stat Assoc 2008, 103(482):681686. Publisher Full Text

Tipping ME: Sparse Bayesian learning and the relevance vector machine.

Tipping ME, Faul AC: Fast marginal likelihood maximisation for sparse Bayesian models. Key West, FL: Proc 9th International Workshop on Artificial Intelligence and Statistics; 2003.

Cockerham CC: An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present.
Genetics 1954, 39(6):859882. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Griffin JE, Brown PJ: Bayesian hyperlassos with nonconvex penalization.
Aust N Z J Stat 2011, 53(4):423442. Publisher Full Text

Carlin BP, Louis TA: Bayesian methods for data analysis. 3rd edition. London/New York: Chapman & Hall/CRC; 2008.

Bishop CM: Pattern recognition and machine learning. New York: Springer; 2006.

MacKay DJC: The evidence framework applied to classification networks.
Neural Comput 1992, 4(5):720736. Publisher Full Text

Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. 2nd edition. New York: Springer; 2009.

Wu R, Ma CX, Casella G: Statistical genetics of quantitative traits: linkage, maps, and QTL. LLC: Springer Science + Business Media; 2007.

R Development Core Team: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2012.

Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent.

Masinde GL, Li X, Gu W, Davidson H, Mohan S, Baylink DJ: Identification of wound healing/regeneration Quantitative Trait Loci (QTL) at multiple time points that explain seventy percent of variance in (MRL/MpJ and SJL/J) mice F2 population.
Genome Res 2001, 11(12):20272033. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li X, Mohan S, Gu W, Baylink DJ: Analysis of gene expression in the wound repair/regeneration process.
Mamm Genome 2001, 12(1):5259. PubMed Abstract  Publisher Full Text

Kunimoto BT: Growth factors in wound healing: the next great innovation?
Ostomy Wound Manage 1999, 45(8):5664. PubMed Abstract

Xu S: An expectation maximization algorithm for the Lasso estimation of quantitative trait locus effects.

Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS: An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects.
Genetics 2007, 176(3):18651877. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fan J, Song R: Sure independence screening in generalized linear models with NPdimensionality.
Ann Stat 2010, 38(6):35673604. Publisher Full Text

Fan J, Lv J: Sure independence screening for ultrahigh dimensional feature space.
J R Stat Soc Series B Stat Methodol 2008, 70(5):849911. PubMed Abstract  Publisher Full Text  PubMed Central Full Text