Bioinformatics Unit, Institute for Crop Science, University of Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany

Abstract

Background

Genomic selection (GS) is emerging as an efficient and cost-effective method for estimating breeding values using molecular markers distributed over the entire genome. In essence, it involves estimating the simultaneous effects of all genes or chromosomal segments and combining the estimates to predict the total genomic breeding value (GEBV). Accurate prediction of GEBVs is a central and recurring challenge in plant and animal breeding. The existence of a bewildering array of approaches for predicting breeding values using markers underscores the importance of identifying approaches able to efficiently and accurately predict breeding values. Here, we comparatively evaluate the predictive performance of six regularized linear regression methods-- ridge regression, ridge regression BLUP, lasso, adaptive lasso, elastic net and adaptive elastic net-- for predicting GEBV using dense SNP markers.

Methods

We predicted GEBVs for a quantitative trait using a dataset on 3000 progenies of 20 sires and 200 dams and an accompanying genome consisting of five chromosomes with 9990 biallelic SNP-marker loci simulated for the QTL-MAS 2011 workshop. We applied all the six methods that use penalty-based (regularization) shrinkage to handle datasets with far more predictors than observations. The lasso, elastic net and their adaptive extensions further possess the desirable property that they simultaneously select relevant predictive markers and optimally estimate their effects. The regression models were trained with a subset of 2000 phenotyped and genotyped individuals and used to predict GEBVs for the remaining 1000 progenies without phenotypes. Predictive accuracy was assessed using the root mean squared error, the Pearson correlation between predicted GEBVs and (1) the true genomic value (TGV), (2) the true breeding value (TBV) and (3) the simulated phenotypic values based on fivefold cross-validation (CV).

Results

The elastic net, lasso, adaptive lasso and the adaptive elastic net all had similar accuracies but outperformed ridge regression and ridge regression BLUP in terms of the Pearson correlation between predicted GEBVs and the true genomic value as well as the root mean squared error. The performance of RR-BLUP was also somewhat better than that of ridge regression. This pattern was replicated by the Pearson correlation between predicted GEBVs and the true breeding values (TBV) and the root mean squared error calculated with respect to TBV, except that accuracy was lower for all models, most especially for the adaptive elastic net. The correlation between the predicted GEBV and simulated phenotypic values based on the fivefold CV also revealed a similar pattern except that the adaptive elastic net had lower accuracy than both the ridge regression methods.

Conclusions

All the six models had relatively high prediction accuracies for the simulated data set. Accuracy was higher for the lasso type methods than for ridge regression and ridge regression BLUP.

Introduction

Genomic selection (GS), the prediction of genomic breeding values (GEBVs) using dense molecular markers, is rapidly emerging as a key component of efficient and cost-effective breeding programs. The prediction of GEBVs is currently undertaken using multiple methods with varying degrees of complexity, computational efficiency and predictive accuracy. Comparative evaluation of the performance of the existing methods is thus essential to identify those best suited to GS and determine when their performance is optimal. Here, we evaluate the relative performance of six regularized (penalized) linear regression models for GS. The methods comprise ridge regression (RR)

Methods

Data

An outbred population of 1000 individuals was simulated over 1000 generations, followed by 150 individuals over 30 generations, using the LDSO software

The marker information was stored in a matrix _{1 }and _{2 }coded as 1 for _{1}_{1}, -1 for _{2 }_{2 }and 0 for _{1 }_{2 }or _{2 }_{1}.

The regularization models

The basic linear regression model used to predict GEBVs with all the six regularization models is:

where **y**=(y_{1},...,y_{n}^{T }**1**_{n }**X **is a **β **is the vector of the regression coefficients of the markers and **e **is the vector of the residual errors with

Ridge regression

Ridge regression

The ridge regression estimator solves the regression problem in (1) using ℓ_{2 }penalized least squares:

where _{2 }-norm (quadratic) loss function (i.e. residual sum of squares), **X**, _{2 }-norm penalty on **β**, and λ ≥ 0 is the tuning (penalty, regularization, or complexity) parameter which regulates the strength of the penalty (linear shrinkage) by determining the relative importance of the data-dependent empirical error and the penalty term. The larger the value of λ, the greater is the amount of shrinkage. As the value of λ is dependent on the data it can be determined using data-driven methods, such as cross-validation. The intercept is assumed to be zero in (2) due to mean-centering of the phenotypes.

Ridge regression BLUP

Ridge regression BLUP uses the same estimator as ridge regression but estimates the penalty parameter by REML as

Lasso

Lasso regression methods are widely used in domains with massive datasets, such as genomics, where efficient and fast algorithms are essential _{1 }penalized least squares criterion to obtain a sparse solution to the following optimization problem:

where _{1 }-norm penalty on **β**, which induces sparsity in the solution, and λ ≥ 0 is a tuning parameter.

The ℓ_{1 }penalty enables the lasso to simultaneously regularize the least squares fit and shrinks some components of

An oracle procedure can estimate the subset of true parameters with zero coefficients as exactly zero with probability tending to 1; that is, as well as if the true subset model were known beforehand

Adaptive lasso

To remedy the problem of the lack of the oracle property, the adaptive lasso estimator was proposed

where **β **obtained through least squares or ridge regression if multicolinearity is important

Elastic net

The elastic net (ENET) is an extension of the lasso that is robust to extreme correlations among the predictors _{1 }(lasso) and ℓ_{2 }(ridge regression) penalties and can be formulated as:

On setting _{2}/(_{1}+_{2}), the ENET estimator (5) is seen to be equivalent to the minimizer of:

where _{α}**β**) is the ENET penalty _{1 }part of the ENET does automatic variable selection, while the ℓ_{2 }part encourages grouped selection and stabilizes the solution paths with respect to random sampling, thereby improving prediction. By inducing a grouping effect during variable selection, such that a group of highly correlated variables tend to have coefficients of similar magnitude, the ENET can select groups of correlated features when the groups are not known in advance. Unlike the lasso, when

Adaptive elastic net

The adaptive elastic net is a mixture of the adaptive lasso and the elastic net that confers the oracle property to the elastic net and alleviates the instability of the adaptive lasso with high-dimensional data inherited from the lasso _{1 }penalty of the adaptive lasso and the elastic net regularization to confer the oracle property to the lasso and enhance its stability (selection consistency and asymptotic normality) with high-dimensional data by solving the optimization problem

where the elastic-net estimator

and

Fitting and comparing models

The entire path of solutions (in λ) for the ridge regression, lasso and elastic net models were computed using the pathwise cyclical coordinate descent algorithms-- computationally efficient methods for solving these convex optimization problems-- in

Results and discussion

Predictive accuracy, expressed as the Pearson correlation between predicted GEBVs and the true genomic values (TGV) and the root mean squared error derived from TGV, ranked the elastic net, lasso and adaptive lasso above the adaptive elastic net, ridge regression and ridge regression BLUP (Table

Accuracy of predictions of the six models

**Model**

**Pearson correlation**

**Root mean squared error**

**5-fold cross-validation**

**TGV**

**TBV**

**TGV**

**TBV**

**Mean**

**Min**

**Max**

Elastic Net

0.5071

0.4486

0.5308

0.9233

0.8659

2.2276

3.4618

Lasso

0.5062

0.4466

0.5293

0.9240

0.8705

2.1642

3.5478

Adaptive Lasso

0.4951

0.4454

0.5152

0.9195

0.8759

2.0757

3.9911

RR

0.4717

0.4050

0.5037

0.8246

0.8213

2.9046

3.3767

RR-BLUP

0.4628

0.3905

0.4951

0.8455

0.8315

2.9894

3.6487

Adaptive Elastic Net

0.4285

0.4013

0.4667

0.8968

0.8112

2.3404

4.2325

Pearson correlation between GEBVs and (1) the observed values from the 5-fold cross-validation, (2) the true expectation of the phenotypes of the 1000 non-phenotyped candidates (TGV), (3) the true expectation of the phenotypes of the progenies of the 1000 non-phenotyped candidates (TBV); and the root mean squared error with respect to TGV and TBV.

The fivefold CV also ranked the models similarly to the correlations based on TGV and TBV. Based on CV, the elastic net and lasso performed better than ridge regression, ridge regression BLUP and the adaptive extensions of lasso and the elastic net. A previous study also found that the elastic net often outperforms RR and the lasso in terms of model selection consistency and prediction accuracy

The RR and RR-BLUP penalties admit all markers into the model, resulting in a very large number of non-zero coefficients. The two ridge penalties shrink parameter estimates and will perform well for many markers with small effects but are less effective in forcing many predictors to vanish, as was the case for the data set simulated for the 2011 QTLMAS workshop, and cannot therefore produce parsimonious and interpretable models with only the relevant markers. All the models with the lasso penalty perform simultaneous automatic variable selection and shrinkage. The elastic net penalty provides a compromise between the lasso and ridge penalties and has the effect of averaging markers that are highly correlated and then entering the averaged marker into the model. Since they are numerous, the non-zero coefficients for the ridge regression are far smaller than the coefficients for the other methods.

If the number of markers (

The six methods we considered are closely related to many other regularized statistical learning procedures, many of which are also promising for GS. Examples of such models include boosted ridge regression _{1}-penalty with an _{q}

The presence of epistatic interactions, nonlinear effects, or non-independent observations may lower the performance of the regularized linear models. In such cases, performance may be enhanced by using lasso type models that allow for interactions between predictors and correlated observations

Conclusions

All the six models are additive and performed well for the simulated data set and may be expected to perform similarly well for traits where additive effects predominate and epistasis is less relevant.

List of abbreviation used

ADAENET: Adaptive Elastic net; CV: Cross Validation; ENET: Elastic net; GS: Genomic Selection; GEBV: Genomic Estimated Breeding Value; GWAS: Genome-Wide Association Study; lasso: least absolute shrinkage and selection operator; RR: Ridge Regression; SNP: Single Nucleotide Polymorphisms; TBV: True Breeding Value; TGV: True Genomic Value.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JOO conceived the study, conducted the statistical analysis and drafted the manuscript.

TSS participated in data preparation, analysis, and writing of the manuscript. HPP participated in discussions that helped improve the manuscript and oversaw the project. All the authors read and approved the manuscript.

Acknowledgements

The German Federal Ministry of Education and Research (BMBF) funded this research within the AgroClustEr "Synbreed - Synergistic plant and animal breeding" (Grant ID: 0315526).

This article has been published as part of