Bioinformatics Unit, Institute of Crop Science, University of Hohenheim, Fruwirthstrasse 23, 70599 Stuttgart, Germany

Abstract

Background

Accurate prediction of genomic breeding values (GEBVs) requires numerous markers. However, predictive accuracy can be enhanced by excluding markers with no effects or with inconsistent effects among crosses that can adversely affect the prediction of GEBVs.

Methods

We present three different approaches for pre-selecting markers prior to predicting GEBVs using four different BLUP methods, including ridge regression and three spatial models. Performances of the models were evaluated using 5-fold cross-validation.

Results and conclusions

Ridge regression and the spatial models gave essentially similar fits. Pre-selecting markers was evidently beneficial since excluding markers with inconsistent effects among crosses increased the correlation between GEBVs and true breeding values of the non-phenotyped individuals from 0.607 (using all markers) to 0.625 (using pre-selected markers). Moreover, extension of the ridge regression model to allow for heterogeneous variances between the most significant subset and the complementary subset of pre-selected markers increased predictive accuracy (from 0.625 to 0.648) for the simulated dataset for the QTL-MAS 2010 workshop.

Background

Genomic selection (GS) is a method for predicting breeding values on the basis of a large number of molecular markers

We compare different methods for selecting the most relevant markers for GS. Genomic breeding values (GEBVs) were estimated using different BLUP methods and number of pre-selected markers. Besides ridge regression (RR), spatial models were also used. The best model was selected using cross-validation (CV).

Methods

Data

A simulated dataset of 3226 individuals in five generations generated for the QTL-MAS 2010 workshop was analysed. A total of 2326 individuals belonging to the first four generations were phenotyped and genotyped with 10031 SNP markers. Moreover, 900 individuals in the fifth generation were genotyped but had no phenotypic records. We focus here only on the quantitative trait. A SNP was included in the analysis only if its minor allele frequency exceeded 2.5%. This resulted in the exclusion of 461 SNPs.

The marker covariate _{ik}_{1} and _{2} was set to 1 for _{1}_{1}, -1 for _{2}_{2} and 0 for _{1}_{2}. Covariates were stored in a matrix **Z**_{ik}

Pre-selection of SNPs

We tested the effect of each SNP on the quantitative trait using three different methods.

Method 1

Each SNP was tested using a linear regression, like in Macciotta et al.

_{i}_{k}z_{ik}_{i}

where _{i}_{ik}_{k}_{i}

Method 2

Each SNP was analysed for consistency among crosses using the model

_{ic}_{k}z_{ik}_{c}_{ck}z_{ik}_{ic}

where _{c}_{ck}_{ck}_{k}

Method 3

Each SNP was analysed for consistency among generations using the model

_{ig}_{k}z_{ik}_{g}_{gk}z_{ik}_{ig}

where _{g}_{gk}_{k}_{gk}

The

GEBVs estimation

The genotypic effect was estimated using the following linear mixed model:

_{i}_{i}_{i}

where _{i}_{i}

The genotypic value (

where _{ik}_{k}

This model was extended to incorporate heterogeneous variances between the

where

The regression coefficients were predicted by best linear unbiased prediction (BLUP) and the variance components estimated by restricted maximum likelihood (REML). For each fitted model we obtained BLUPs for _{i}

Spatial models

We considered different models for the variance of ** g**′ = (

Empirical semivariogram of the QTL-MAS 2010 dataset and theoretical models (Quadratic, Linear, Gaussian and Exponential) fitted by weighted least squares. Genotypic covariance models of the form **Γ** = {_{ii}_{′})}, where ^{2}; Linear: ^{2}/^{2}); Exponential:

**Empirical semivariogram of the QTL-MAS 2010 dataset and theoretical models (Quadratic, Linear, Gaussian and Exponential) fitted by weighted least squares.** Genotypic covariance models of the form **Γ** = {_{ii}_{′})}, where ^{2}; Linear: ^{2}/^{2}); Exponential:

Cross-validation

A 5-fold cross-validation (CV) was performed to evaluate model performance. All phenotyped individuals were included in the CV, except those in the first generation. Overall 75 crosses (full sib families) were included. The dataset was randomly split into 5 subsamples each of which contained 15 crosses. In each CV round the phenotypic records for one of the five subsamples was held out and used as a validation set. Each subsample was held out and used as a validation set only once.

The mean Pearson correlations between the GEBVs and observed values in the 5 replicates of the validation sets and between the true breeding values (TBVs) of the non-phenotyped individuals of the fifth generation and GEBVs were used as measures of accuracy.

All mixed models were fitted using the REML method in the SAS MIXED procedure and the theoretical semivariograms in the SAS NLIN procedure.

Results

A high correlation was established between the semivariance and the genetic distance between pairs of individuals (Fig.

Selection of different genetic covariance models using Pearson correlations between GEBVs and observed values in the validation sets (CV), and between GEBVs and TBVs for non-phenotyped individuals (TBV). Considered were either all (

**Ridge Regression**

**Gaussian**

**Exponential**

**Linear**

n

CV

TBV

CV

TBV

CV

TBV

CV

TBV

9570

0.530

0.607

0.530

0.600

0.530

0.607

Did not converge

500

0.570

0.599

0.569

0.596

0.572

0.599

0.572

0.596

1000

0.583

0.623

0.583

0.614

0.583

0.620

0.584

0.614

2000

0.579

0.625

0.580

0.614

0.582

0.621

0.582

0.614

3000

0.576

0.617

0.577

0.608

0.580

0.615

0.580

0.608

Pre-selection of markers was evidently beneficial, with methods 1 and 2 achieving similar predictive accuracies and outperforming method 3 (Fig.

Mean Pearson correlation between GEBVs and TBVs for non-phenotyped individuals

**Mean Pearson correlation between GEBVs and TBVs for non-phenotyped individuals**. GEBVs were estimated by ridge regression.

Moreover, the extended model with heterogeneous variances between lowly and highly significant markers increased accuracy (Table

Selection of different combinations of pre-selected markers by method 2 (

Combination

Pearson correlation

n

a

CV

TBV

1000

0

0.583

0.623

1000

5

0.582

0.625

1000

10

0.586

0.632

1000

50

0.587

0.635

1000

100

0.586

0.637

1000

250

0.584

0.630

2000

0

0.579

0.625

2000

5

0.580

0.628

2000

10

0.588

0.640

2000

50

0.589

0.645

2000

100

0.590

0.648

2000

250

0.588

0.640

Overall, RR with 2000 markers selected by method 2 and allowing for heterogeneous variances among the 100 most significant and the remaining 1900 markers gave the most accurate prediction of GEBVs for the fifth generation.

Discussion

We have evaluated how pre-selection of markers influences predictive accuracy in GS using RR and its spatial extensions via genetic distances. The spatial models differed in terms of the theoretical models used to model the empirical semivariogram among the genotypes as a function of their genetic distances of separation. All the fitted theoretical semivariogram models were remarkably similar within the range of the observed semivariogram values, and so were their predictions. This suggests that further study is needed to decide if modelling genetic covariances using non-linear spatial models is beneficial compared to RR, especially for non-additive genetic effects.

Our results reinforce findings of other studies suggesting that pre-selecting markers may enhance predictive accuracy

The extended model with two variance components for the markers increased predictive accuracy because it better approximated the simulated genetic model with a few QTLs with different variances. Heterogeneous variance models may, however, not always exhibit superior performance. In particular, simulating many QTLs with small effects may lower the performance of models allowing for heterogeneous variances among individual markers

Conclusions

Pre-selection of markers was beneficial and increased predictive accuracy from 0.607 to 0.625. Partitioning markers into two groups with heterogeneous variances further increased accuracy up to 0.648 for the simulated dataset.

Competing interests

The authors declare no competing interests.

Authors' contributions

TSS participated in the design of the study, performed all analyses and drafted the manuscript. JOO helped draft the manuscript and interpret the results. HPP conceived the study, participated in its design, and helped in the final editing of the manuscript.

Acknowledgements

We are thankful to the reviewers for constructive comments. This research was funded by AgReliant Genetics and the German Federal Ministry of Education and Research (BMBF) within the AgroClustEr “Synbreed – Synergistic plant and animal breeding” (Grant ID: 0315526).

This article has been published as part of