Department of Animal Science and Center for Integrated Animal Genomics, Iowa State University, Ames, Iowa 50011, USA

Abstract

Background

Bayesian methods allow prediction of genomic breeding values (GEBVs) using high-density single nucleotide polymorphisms (SNPs) covering the whole genome with effective shrinkage of SNP effects using appropriate priors. In this study we applied a modification of the well-known BayesA and BayesB methods to estimate the proportion of SNPs with zero effects (π) and a common variance for non-zero effects. The method, termed BayesCπ, was used to predict the GEBVs of the last generation of the QTLMAS2010 data. The accuracy of GEBVs from various methods was estimated by the correlation with phenotypes in the last generation. The methods were BayesCPi and BayesB with different π values, both with and without polygenic effects, and best linear unbiased prediction using an animal model with a genomic or numerator relationship matrix. Positions of quantitative trait loci (QTLs) were identified based on the variances of GEBVs for windows of 10 consecutive SNPs. We also proposed a novel approach to set significance thresholds for claiming QTL in this specific case by using pedigree-based simulation of genotypes. All analyses were focused on detecting and evaluating QTL with additive effects.

Results

The accuracy of GEBVs was highest for BayesCπ, but the accuracy of BayesB with π equal to 0.99 was similar to that of BayesCπ. The accuracy of BayesB dropped with a decrease in π. Including polygenic effects into the model only had marginal effects on accuracy and bias of predictions. The number of QTL identified was 15 when based on a stringent 10% chromosome-wise threshold and increased to 21 when a 20% chromosome-wise threshold was used.

Conclusions

The BayesCπ method without polygenic effects was identified to be the best method for the QTLMAS2010 dataset, because it had highest accuracy and least bias. The significance criterion based on variance of 10-SNP windows allowed detection of more than half of the QTL, with few false positives.

Background

Genomic prediction of breeding values of individuals is based on a large number of SNPs across the whole genome giving high-density coverage. Each QTL is expected to be in linkage disequilibrium (LD) with at least one SNP because of the high marker density, hence the effects of all QTL are expected to be captured by SNPs

The availability of genome-wide SNP panels enables detection of statistical associations between a trait and any SNP in terms of a genome-wide association study (GWAS), enhancing the possibility of mapping QTL across the genome

Against this background, in this study we aimed to: (i) identify the Bayesian approach that most accurately predicts GEBV for the QTLMAS2010 data; (ii) develop a new criterion based on the 10-SNP window variance for QTL detection to concentrate signals from high density SNP panels; and (iii) set significance thresholds for the window variance criterion to claim QTL when pedigree relationships exist among individuals.

Methods

Dataset

The simulated dataset was provided in advance of the 14^{th} European QTL-MAS Workshop

Predicting GEBVs

Four methods were used and compared for estimation of the marker effects and GEBV: BayesB

where _{j}_{j}

where _{ij}

Method G-BLUP fitted all SNPs in the model, assuming that every SNP explained an equal proportion of the total genetic variance. Model BayesCπ was a modification of model BayesB of Meuwissen _{a}_{j}) have a common variance, i.e. _{a} and scale

Effects of SNPs were estimated using the phenotypes and genotypes of individuals in the first three generations (training), which were then used to predict GEBVs of individuals in the fourth generation (validation) to evaluate the accuracy of GEBVs of the marker-based methods. The method giving the highest correlation of GEBVs with phenotypes in the validation population was used to predict the GEBVs of the fifth generation, for which only SNP genotypes but no phenotypes were available. For the fifth generation predictions, the first four generations were used to estimate SNP effects.

Detecting QTL

The parameter that was used for QTL detection was the variance of the GEBV of chromosome segments comprised of 10 adjacent SNPs, which we termed windows. First, SNP effects and variances were estimated using individuals in the first four generations by BayesCπ, as described above. The GEBV for the 10-SNP window

and the variance of this prediction was calculated across individuals in the first four generations. For 1-SNP windows, this method is equivalent to calculating SNP variance as _{j} . Windows with variance of GEBVs above a predefined threshold were identified as QTL regions. Significant windows that overlapped were considered to identify the same QTL if there was only one variance peak among the SNPs covered by them. The variance for each window was graphically presented against genomic location of the SNP on the x-axis. Within each selected region, the SNP with the largest variance was used to quantify the position and variance of the QTL.

The threshold for the window variance for declaring presence of a QTL was determined by deriving the distribution of the window variance in data simulated under the null hypothesis of no LD between QTL and SNPs. Three strategies were used to generate data sets without LD between QTL and SNPs but using the original phenotypes, so as to maintain the distribution of phenotypes. The first strategy was to simply permute phenotypes against SNP genotypes across individuals in the training data. This strategy maintains LD relationships among SNPs in the original data but breaks all pedigree relationships and prevents SNPs to account for polygenic effects in the permuted data, in contrast to what happens in real data from pedigree populations _{e}

To account for multiple testing across a chromosome, significance levels for the window variance were adjusted by dividing desired comparison-wise type I error rates by the effective number of loci (_{e}_{e}

Results and discussion

Accuracy of GEBV prediction

The accuracy of GEBVs was estimated in three ways: (i) the correlation of GEBVs with phenotypes divided by the square root of heritability (estimated from the full dataset with pedigree relationships using ASREML

Prediction accuracy of GEBV, correlation of GEBV with TBV, correlation of GEBV with genotypic value (

Methods

Correlation of GEBV with

Regression coefficient on GEBV of

^{1}

TBV

TBV

**P-BLUP**

0.545

0.410

0.538

1.156

1.003

1.005

**G-BLUP**

No Poly

0.746

0.610

0.753

1.006

0.949

0.895

Poly

0.737

0.597

0.752

0.961

0.898

0.863

**BayesB, π = 0.75**

No Poly

0.781

0.632

0.776

1.018

0.950

0.892

Poly

0.778

0.628

0.783

0.984

0.916

0.873

**BayesB, π = 0.95**

No Poly

0.788

0.640

0.787

1.023

0.960

0.901

Poly

0.784

0.634

0.793

0.983

0.916

0.875

**BayesB, π = 0.99**

No Poly

0.793

0.646

0.795

1.031

0.967

0.909

Poly

0.790

0.636

0.797

0.981

0.911

0.872

**BayesCπ**

No Poly

0.796

0.650

0.800

1.011

0.952

0.895

Poly

0.796

0.642

0.804

0.989

0.921

0.880

**BayesCπ gen 5**
^{2}

No Poly

–

0.679

0.894

–

0.959

0.965

Results are based on training on the first three generations and validation on generation 4 using P-BLUP, G-BLUP, BayesB with different π's, and BayesCπ, and without (No Poly) and with (Poly) polygenic effects.

^{1}Calculated as correlation of phenotype (y) with GEBV, divided by the square root of estimated heritability.

^{2}Training on the first 4 generations and predicting generation 5.

The simulated QTLMAS2010 dataset had 30 biallelic additive QTL, 2 pairs of epistatic QTL and 3 paternally imprinted QTL. The QTL from each pair of epistatic QTL were close together and behaved as a single multi-allelic additive QTL. Each of the epistatic QTL-pairs and the imprinted QTL had the same effect as the largest additive QTL. The genotypic value of an individual was the sum of the genotypic value expressed in the phenotype at each of the QTL but the TBV also accounted for the imprinting effects that the individual had on its progeny. Thus, the TBV could deviate considerably from the genotypic values because the imprinted QTLs had large effects. In this study, all marker-based methods only fitted additive effects of SNPs derived based on the regression of SNP genotype on phenotype, which includes the effect of the imprinted QTL. As a result, as shown in Table

The accuracy of P-BLUP was lowest among all methods, as expected. Method G-BLUP, which always fitted all SNPs in the model, had lower accuracy than BayesB and BayesCπ. The Bayesian methods had quite similar accuracies, but BayesCπ tended to be the most accurate. Methods that fitted fewer SNPs performed better than those that fitted more. This might be explained by the fact that under the marker density of QTLMAS2010 data (measured as average ^{2}=0.22 between adjacent markers on chromosome 1, following Calus

The posterior mean of π in BayesCπ was 0.988, that is, on average 124 SNPs were fitted in the model, which was similar to that of BayesB when π = 0.99 (Table

Average number of SNPs (#SNP) fitted in the model, estimated variance components, and estimated heritability (Heritability).

Methods

#SNP

Estimated variance components

Heritability

Marker

Polygenic

Genetic^{1}

Residual

Total

**True value**
^{2}

–

–

51.76

51.76

103.52

0.500

**P-BLUP**

–

–

54.44

54.44

48.68

103.12

0.528

**G-BLUP**

10031

No Poly

44.54

–

44.54

54.84

99.38

0.448

Poly

38.53

12.09

50.62

49.04

99.66

0.508

**BayesB, π = 0.75**

2508

No Poly

44.28

–

44.28

54.08

98.36

0.450

Poly

39.05

11.06

50.11

48.32

98.43

0.509

**BayesB, π = 0.95**

502

No Poly

43.96

–

43.96

54.16

98.12

0.448

Poly

38.05

12.80

50.85

47.59

98.44

0.517

**BayesB, π = 0.99**

100

No Poly

43.44

–

43.44

54.58

98.02

0.443

Poly

37.43

12.35

49.78

48.30

98.09

0.508

**BayesCπ**

No Poly

124

45.68

–

45.68

53.63

99.31

0.460

Poly

80

40.21

10.33

50.54

48.58

99.12

0.510

**BayesCπ gen 5**
^{3}

No Poly

92

47.13

–

47.13

53.48

100.61

0.468

Results are based on training on the first three generations and validation on generation 4 using P-BLUP, G-BLUP, BayesB with different π’s, and BayesCπ, and without (No Poly) and with (Poly) polygenic effects.

^{1}Total genetic variance = marker variance + polygenic variance.

^{2}Total QTL variance = residual variance = 51.76 in the QTLMAS2010 dataset.

^{3}Training on the first 4 generations.

The bias of GEBV was evaluated based on the departure from unity of the regression coefficients of phenotype, TBV, and genotypic value on GEBV in the validation data (Table

Model BayesCπ without polygenic effects was applied to obtain the GEBVs of the final generation (5), with training on the first four generations because it resulted in high accuracy and small bias of GEBV based on training in the first three generations. Results at the bottom of Table

Estimated variances

Variance components estimated by the different models are shown in Table

QTL Mapping

Several parameters estimated by BayesCπ can be used to identify QTL regions, for instance, the absolute estimated effects of SNPs, the posterior inclusion probabilities (model frequencies) of SNPs, and the genetic variances explained by SNPs. Many Bayesian QTL mapping studies have applied model frequency or its derivatives as criteria to detect QTL ^{2}=0.22 between adjacent markers on chromosome 1) and the effect of a single QTL could be spread over multiple SNPs. This results in too many signals in model frequency which could increase the probability of false positives and false negatives. To address this problem, we accumulated the effects of adjacent SNPs together into a genomic window. A window size of 10 was used in this study and the variance of GEBV of each 10-SNP window was used as the criterion to detect QTL. Several windows that shared the same SNP with a large effect were considered to identify the same QTL region. Within each region, because windows were overlapping, the window with the highest variance of GEBV was used and the SNP within this window that explained the largest proportion of genetic variance was used to denote the position of the QTL (Figure

Variances of GEBVs of 10-SNP windows across the genome.

**Variances of GEBVs of 10-SNP windows across the genome.** Data sets were generated by permutation (Permuted dataset), simulation with linkage equilibrium in founders (LE simulation dataset), and simulation with initial linkage disequilibrium (LD simulation dataset). The bottom panel show window variances obtained for the original QTLMAS 2010 dataset (Original dataset), as well as the location and variances of true QTLs, along with their mode of inheritance (Additive = additive QTL, Epistatic = epistatic QTL, Imprinted = imprinted QTL). Horizontal lines show the 10% (solid) and 20% (dash) chromosome-wise thresholds for window variance derived from the LD simulation.

Results of the three strategies to set significant thresholds are summarized in Table

Variance components estimated from datasets generated by permutation, simulation with linkage equilibrium in founders (LE simulation), and simulation with initial linkage disequilibrium (LD simulation), and thresholds for 10-SNP window variances based on 10% and 20% chromosome-wise type I error rates.

Methods

Variance Components

Window variance threshold

Genotypic

Residual

Total

10%

20%

Permutation

3.15

98.59

101.74

0.0011

0.0009

LE simulation

20.83

79.84

100.67

0.0204

0.0094

LD simulation

17.14

83.40

100.55

0.1645

0.0887

Original^{1}

47.13

53.48

100.61

–

–

^{1}Estimated from the original QTLMAS2010 dataset using BayesCπ, training on the first 4 generations.

The threshold allowing a 10% chromosome-wise type-I error rate detected 13 QTLs of which 2 were false positives (Figure

Adjustment for multiple testing was based on a Bonferroni-type of adjustment based on an estimate of the effective number of independent tests conducted. A more appropriate adjustment for multiple testing would be replicating the simulation multiple times and picking the highest window variance within each simulation. This replication procedure would resemble the method based on permutation tests proposed by Churchill

The window variance calculated using the sum of model-averaged SNP effects within a specific window will always underestimate the true QTL variance because of the shrinkage of SNP effects by BayesCπ and the incomplete LD between SNPs and QTL. Estimation of the variance of a window can be improved by computing the variance based on the sampled window effects from each sample of the MCMC chain, which is less shrunk than the posterior mean of the window effects.

Although grouping SNPs into windows is effective to concentrate signals, it also has several drawbacks. First, if say two QTL fall into the same region, by window variance they would likely be detected as one QTL; for example, additive QTL11 and QTL12 were detected as a single QTL (Figure

The use of windows in this study is fundamentally different from the use of haplotypes to detect QTL, although both use combinations of adjacent markers. An alternative method may well be constructing haplotypes using two or more adjacent SNP alleles and estimating haplotype effects using Bayesian methods. Villumsen

Conclusions

In this simulated dataset, BayesCπ slightly outperformed BayesB in the accuracy of predicting GEBV, but the accuracy of BayesB was similar to BayesCπ when its π was set equal to the posterior mean of π from BayesCπ. The prediction accuracy of TBV was lower than that of genotypic values. Window variance allowed detection of most large QTLs but had insufficient power to detect the small QTLs. Since the model only captured additive effects of QTLs, each epistatic QTL-pair was detected as one multi-allelic additive QTL and the two imprinted QTLs were not detected. The results expose the need for advanced statistical approaches to address more complicated patterns of genetic effects that exist in real data.

List of abbreviations used

GEBV: Genomic Estimated Breeding Value; SNP: Single Nucleotide Polymorphism; QTL: Quantitative Trait locus; LD: Linkage Disequilibrium; GWAS: Genome-wide Association Study; BLUP: Best Linear Unbiased Prediction; EBV: Estimated Breeding Value; MCMC: Markov Chain Monte Carlo; LE: Linkage Equilibrium; TBV: True Breeding Value

Authors’ contributions

XS carried out the analysis and drafted the manuscript. DH contributed to programming BayesCπ, developed the program for pedigree-based simulation, and helped to interpret the results. RLF contributed to programming BayesCπ and helped to interpret the results. DJG contributed to programming BayesCπ and helped to interpret the results. JCMD was the overall coordinator of the project, developed the method to set thresholds, and helped to interpret the results and draft the manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

Support for this research was provided by USDA AFRI Competitive Grants No. 2007-35205-17862 and 2010-65205-20341 from the National Institute of Food and Agriculture. Organizers of the 2010 QTLMAS workshop are acknowledged for providing the data.

This article has been published as part of