Settore Genetica e Biotecnologie, AGRIS-Sardegna, Olmedo 07040, Italy

Abstract

Background

The least absolute shrinkage and selection operator (LASSO) can be used to predict SNP effects. This operator has the desirable feature of including in the model only a subset of explanatory SNPs, which can be useful both in QTL detection and GWS studies. LASSO solutions can be obtained by the least angle regression (LARS) algorithm. The big issue with this procedure is to define the best constraint (

Methods

The first (strategy 1), was based on 1,000 cross-validations carried out by randomly splitting the reference population (2,000 individuals with performance) into two halves. The value of

Results

The size of the subset of selected SNPs was 46, 189 and 64 for the classical approach, strategy 1 and 2 respectively. Classical and strategy 2 gave similar results and indicated quite clearly the regions were QTL with additive effects were located. Strategy 1 confirmed such regions and added further positions which gave a less clear scenario. Correlation between GEBVs estimated with the three strategies and TBVs in progenies without phenotypes were 0.9237, 0.9000 and 0.9240 for classical, strategy 1 and 2 respectively.

Conclusions

This suggests that the Cp-type selection criterion is a valid alternative to the cross-validations to define the best constraint for selecting subsets of predicting SNPs by LASSO-LARS procedure.

Background

A method to estimate the SNP (Single Nucleotide Polymophism) effects would be to use the least absolute shrinkage and selection operator (LASSO) approach

Where _{i }^{th }_{ij }^{th }^{th }_{j }^{th }

The LASSO problem can be solved by quadratic programming

In this study we propose two alternative strategies to define the

Methods

Data

A simulated data set of 3,220 individuals generated for the 15th QTL-MAS workshop was used. The first generation consisted in 220 founders (20 males and 200). The second generation consisted in 3,000 individuals organized in 20 sire families of 150 individuals each and 200 dam families of 15 individuals each. The dam families were nested in the sire families. The genome consisted in five chromosomes. Each chromosome was 1 Morgan long and carried 1998 SNPs evenly distributed. Genotypes were available for all the individuals. Phenotypes were available only for 2,000 progenies (1/3 of each sire and dam family) which represented the reference population. The further 1,000 progenies had genotypes but no phenotypes and represented the candidate population.

LASSO-LARS classical strategy

At each cross-validation replication the reference population was randomly split into training (T) and validation (V) samples of equal size. This strategy corresponded to that suggested by Usai

Strategy 1

The only difference with respect to the classical strategy was that here the best

Strategy 2

In this strategy the best ^{th }

where _{i }_{i }^{th }_{e}^{2 }

GEBV estimation

Once the best

Results

Best t definition

The number of SNPs selected at each cross-validation replication ranged from 15 to 92 and was on average 46. This value was taken as best ^{2 }was 0.298. Among the 9,990 available SNPs only 2,169 occurred at least once overall cross-validations and 189 occurred more than 5% of the times. The latter value was taken as best

Cp-type selection criterion profile for increasing number of active SNPs

**Cp-type selection criterion profile for increasing number of active SNPs**.

QTL mapping

Figure

Comparison of SNP effects estimated by classical, strategy1 and strategy2

**Comparison of SNP effects estimated by classical, strategy1 and strategy2**. SNP frequency of occurrence. True QTL positions.

GEBV estimation

The candidate population GEBV accuracies corresponding to the three t estimation strategies are shown in Table

Genomic breeding value (GEBV) accuracy (r) and regression coefficient (b) of true breeding value (TBV) on GEBV for the three tested strategies

**Strategy**

**r(TBV,GEBV)**

**b(TBV,GEBV)**

Classical

0.9237

1.2512

Strategy1

0.9000

1.0220

Strategy2

0.9240

1.1877

Discussion

Our results demonstrated that LASSO-LARS performs well estimating SNPs associated to QTL with additive effects. The detection of QTL with different action was rather poor. However it suggests the presence of the imprinted QTL and of the first epistatic QTL. The second epistatic QTL was neglected since LASSO-LARS just selects the SNPs which underline the main portion of the variability explained by both QTL. Concerning the choice of the best constraint for LASSO-LARS, classical and strategy2 although based on different procedures gave very similar results. This suggests that a valid estimation of the best constraint can be obtained without cross-validation with a large computing time saving. Indeed, while the cross-validation procedure took 3 hours and 35 minutes, strategy2 just took 8 seconds. Nevertheless, the current data set did not allowed to verify if the constraint estimation based on Cp-type minimization can overcome the underestimation of

Conclusions

We conclude that the strategy based on the Cp-type selection criterion is a valid alternative to the cross-validations to define the best constraint for selecting subsets of predicting SNPs by LASSO-LARS procedure.

List of Abbreviations used

GBLUP: Genomic Best Linear Unbiased Prediction; GEBV: Genomic Breeding Values; GWS: Genome Wise Selection; LARS: Least Angle Regression; LASSO: Least Absolute Shrinkage and Selection Operator; QTL: Quantitative Trait Locus; REML: REstricted Maximum Likelihood; SNP: Single Nucleotide Polymophism; TBV: True Breeding Value.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MGU, AC and SC carried out the analyses and drafted the manuscript. All the authors have read and contributed to the final text of the manuscript.

Acknowledgements

Research funded by the program APQ "Attivazione del Centro Biodiversità al servizio dell'allevamento" of Regional Government of Sardinia.

This article has been published as part of