Research Unit Genetics and Biometry, Leibniz Institute for Farm Animal Biology (FBN), WilhelmStahlAllee 2, 18196 Dummerstorf, Germany
Abstract
Background
Molecular marker information is a common source to draw inferences about the relationship between genetic and phenotypic variation. Genetic effects are often modelled as additively acting marker allele effects. The true mode of biological action can, of course, be different from this plain assumption. One possibility to better understand the genetic architecture of complex traits is to include intralocus (dominance) and interlocus (epistasis) interaction of alleles as well as the additive genetic effects when fitting a model to a trait. Several Bayesian MCMC approaches exist for the genomewide estimation of genetic effects with high accuracy of genetic value prediction. Including pairwise interaction for thousands of loci would probably go beyond the scope of such a sampling algorithm because then millions of effects are to be estimated simultaneously leading to months of computation time. Alternative solving strategies are required when epistasis is studied.
Methods
We extended a fast Bayesian method (fBayesB), which was previously proposed for a purely additive model, to include nonadditive effects. The fBayesB approach was used to estimate genetic effects on the basis of simulated datasets. Different scenarios were simulated to study the loss of accuracy of prediction, if epistatic effects were not simulated but modelled and vice versa.
Results
If 23 QTL were simulated to cause additive and dominance effects, both fBayesB and a conventional MCMC sampler BayesB yielded similar results in terms of accuracy of genetic value prediction and bias of variance component estimation based on a model including additive and dominance effects. Applying fBayesB to data with epistasis, accuracy could be improved by 5% when all pairwise interactions were modelled as well. The accuracy decreased more than 20% if genetic variation was spread over 230 QTL. In this scenario, accuracy based on modelling only additive and dominance effects was generally superior to that of the complex model including epistatic effects.
Conclusions
This simulation study showed that the fBayesB approach is convenient for genetic value prediction. Jointly estimating additive and nonadditive effects (especially dominance) has reasonable impact on the accuracy of prediction and the proportion of genetic variation assigned to the additive genetic source.
1 Background
Molecular marker information is commonly used to draw inferences about the relationship between genetic and phenotypic variation in various species, e.g. humans
Different approaches are available to model additive and nonadditive genetic effects. Under the aspect of QTL detection, a genome scan can be carried out to uncover genetic effects using, for example, a variance component method
The objective of this study is to explore the impact of nonadditive effects on the prediction of genetic values in a livestock population. An improved estimation of additive effects and a better prediction of genetic values is intended, when additive and nonadditive effects are jointly involved in fitting a model to a trait. Since methods that aim to estimate nonadditive effects in arbitrary populations are just emerging, it is especially important to validate such approaches with simulations. Therefore, with this study, we pursue methodological aspects, thereby assembling facts that help to interpret results obtained with practical data in future work. We consider additive, dominance and pairwise epistatic effects captured by biallelic markers spread over the whole genome. The details of statistical modelling are presented in the first part of the paper. We extend the fast Bayesian method (fBayesB), which was developed under pure additivity
2 Methods
2.1 Statistical model
For the statistical analysis of genetic effects in a Bayesian framework, a hierarchical model is constructed similar to that of Meuwissen
This model is set up in the way of an F_{∞ }model
This work relies on two assumptions. Firstly, linkage equilibrium (LE) between the different markers is assumed. Then genotypic effects at different loci are independently distributed and the estimation strategy does not depend on the order of markers. Secondly, in order to avoid the estimation of covariance components at intralocus investigations, the additive genetic value and the dominance genetic value are assumed uncorrelated at each locus, i.e. Cov(
The second and third column of
where
where the design matrices
To obtain numerical stability in later calculations, coefficients of the main genetic effects are additionally standardised. Let
Now the hierarchical structure of M1 can be characterised by the following prior distributions
In a second step, the pairwise epistatic effects are modelled. The genetic effect caused by an interaction between locus
As an example,
2.2 Parameter estimation
The essence of the fBayesB approach is the iterative conditional expectation (ICE) algorithm, which is described in detail by Meuwissen
We carry out
where
Set
Now the conditional expectation was determined analytically in Meuwissen
With
The Θ
We introduce a slight modification to fBayesB as we update the estimated residual variance components in each iteration
Then
otherwise the iterations stop at
Eventually, as a consequence of the standardisation, the genetic variance components are estimated as
The suitability of the statistical models M1 and M2 are compared among the different simulated scenarios in terms of accuracy, which is the empirical correlation between predicted and simulated
When studying only the main genetic effects via M1, the results of fBayesB are compared with BayesB
2.3 Simulation study
Data generation
The simulated population is built up in such a way that it reflects a realistic dairy cattle population. We applied a mutationdrift model and simulated a population with effective population size of 100 animals and 52 273 single nucleotide polymorphisms (SNPs) on a 30 Morgan genome (in style of the Illumina Chip BovineSNP50 and based on Btau4.0
Mean (
23QTL scenario
230QTL scenario
additive × additive
additive × dominance
dominance × additive
dominance × dominance
Scale of genetic effects
For convenience, the phenotypes were simulated on the basis of an F_{∞ }model, but the genetic effects were estimated on the orthogonal scale. We employed the equivalence between the representations of genotypic values in (1) and (2) to obtain the translation between scales
If epistasis was simulated, the genetic effects on the orthogonal scale were determined for all locus combinations
which directly led to the epistatic effects on the orthogonal scale. Due to the standardisation step in (3), the derived epistatic effect had to be multiplied by the corresponding scaling term. As an example for
Note that the order of loci (either
Hyperparameters and other settings
The parameter
In this study, we involved prior knowledge about the proportion of nonzero effects of the genetic variation source
Furthermore, to limit the number of iterations, we chose
In BayesB the main genetic effects were estimated simultaneously over the whole genome. A hyperparameter
Outline of data analysis
To begin with, we used every 10th marker (
3 Results
On average 567 loci per dataset had MAF ≤ 0.01. These loci were omitted, but loci deviating from HWE (on average one locus per dataset) were not excluded from the analysis. The average LD between adjacent SNPs was
The differences between fBayesB and BayesB on the basis of M1 are compared. Table
Average estimated variance components (standard deviation in brackets) and average accuracy
Simulation without epistasis
Method
Model
BayesB
M1
0.743
0.035




0.775
0.980
(0.578)
(0.039)
(0.605)
fBayesB
M1
0.742
0.035




0.752
0.978
(0.579)
(0.039)
(0.587)
fBayesB
M2
0.748
0.039
0.008
0.007
0.007
0.008
0.638
0.959
(0.583)
(0.041)
(0.013)
(0.016)
(0.014)
(0.017)
(0.484)





Simulation with epistasis
Method
Model
BayesB
M1
1.313
0.158




2.721
0.785
(0.681)
(0.131)
(0.874)
fBayesB
M1
1.310
0.161




2.619
0.781
(0.687)
(0.132)
(0.845)
fBayesB
M2
1.338
0.193
0.299
0.138
0.065
0.057
1.811
0.833
(0.688)
(0.142)
(0.215)
(0.111)
(0.071)
(0.070)
(0.598)

*23QTL scenario with 5 227 markers and
The additive and dominance effects were estimated equally well with both BayesB and fBayesB. As an example, Figure
Estimates of genetic effects if epistasis was absent in the 23QTL scenario
Estimates of genetic effects if epistasis was absent in the 23QTL scenario. (A) Additive and (B) dominance effects for a single dataset via M1 using fBayesB. Filled circles were plotted for each estimated effect > 10^{4}. Single accuracy of genetic value prediction was 0.946.
M1 and M2 results are compared to study the impact of including or not including pairwise epistatic effects on the accuracy of predicting the genetic values in the test generations. As an example, Additional file
The figure shows estimates of genetic effects and location if epistasis was present in the 23QTL scenario: (a) additive, (b) dominance, (c) additive × additive and (d) additive × dominance effects for a single dataset with M2 using fBayesB. Filled circles were plotted for each estimated effect
Click here for file
Average ratio of additive genetic variance to total genetic variance*
Simulation without epistasis
Model
23QTL scenario
230QTL scenario
M1
0.953
0.810
M2
0.918
0.581
0.948
0.945
Simulation with epistasis
Model
23QTL scenario
230QTL scenario
M1
0.884
0.773
M2
0.626
0.401
0.613
0.648
*fBayesB was used in both QTL scenarios with 5 227 markers and
The results obtained so far are based on
Average accuracy of genetic value prediction depending on broadsense heritability
Simulation without epistasis
Model
M0
0.958
0.940
0.859
0.786
M1
0.978
0.953
0.844
0.774
M2
0.959
0.897
0.640
0.748
Simulation with epistasis
Model
M0
0.741
0.707
0.581
0.618
M1
0.781
0.736
0.582
0.621
M2
0.833
0.718
0.339
0.598
*fBayesB was used in the 23QTL scenario with 5 227 markers. M0 includes only additive genetic effects, M1 includes additive and dominance effects, M2 includes additive, dominance and pairwise epistatic effects. In case of "best 10%" the accuracy of additive genetic value prediction was determined based on 10% animals with best predicted additive genetic value.
In order to prove that we benefit from additionally modelling nonadditive genetic effects if those were simulated, we compared the accuracy of genetic value prediction based on M1 with accuracy obtained from a conventional model including only additive genetic effects, called M0. Except for constellations with
In a further step, we studied the consequence when the genetic variation was spread over a multitude of loci and compare results obtained with BayesB and fBayesB. Furthermore, the 230QTL scenario is confronted with the outcomes of fBayesB in the 23QTL case. When epistasis was not simulated in the 230QTL scenario, highest accuracy of genetic value prediction was obtained with M1, see Table
Average estimated variance components (standard deviation in brackets) and average accuracy
Simulation without epistasis
Method
Model
BayesB
M1
0.631
0.056




0.652
0.860
(0.204)
(0.035)
(0.180)
fBayesB
M1
0.699
0.165




0.413
0.760
(0.207)
(0.065)
(0.132)
fBayesB
M2
0.732
0.304
0.036
0.065
0.068
0.074
0.170
0.608
(0.214)
(0.112)
(0.028)
(0.036)
(0.042)
(0.046)
(0.066)





Simulation with epistasis
Method
Model
BayesB
M1
0.949
0.215




1.968
0.585
(0.250)
(0.067)
(0.266)
fBayesB
M1
0.920
0.267




1.567
0.543
(0.197)
(0.080)
(0.282)
fBayesB
M2
1.277
0.910
0.171
0.275
0.296
0.305
0.493
0.340
(0.230)
(0.269)
(0.086)
(0.106)
(0.127)
(0.126)
(0.257)

*230QTL scenario with 5 227 markers and
Estimates of genetic effects if epistasis was absent in the 230QTL scenario
Estimates of genetic effects if epistasis was absent in the 230QTL scenario. (A) Additive and (B) dominance effects for a single dataset via M1 using fBayesB. Filled circles were plotted for each estimated effect > 10^{4}. Single accuracy of genetic value prediction was 0.814.
The more QTL were simulated, the less accuracy was observed. If a 10fold of QTL was responsible for genetic variation, the accuracy of prediction decreased about 2224% based on M1 and 3549% based on M2. Since the distances between QTL were smaller than in the 23QTL scenario, we could expect that LD between loci contributed to the bias of the estimated variance components. For that reason we calculated the empirical variances obtained from the predicted effectspecific genetic values in the validation set, where the epistatic contribution was collected in one component. Table
Comparison of empirical variances of predicted genetic values and genetic variance components estimated under LE*
Simulation without epistasis
23QTL scenario
230QTL scenario
Model
M1
empirical
0.743
0.035

0.711
0.163

M2
empirical
0.749
0.038
0.030
0.805
0.338
0.278
Simulation with epistasis
23QTL scenario
230QTL scenario
Model
M1
empirical
1.309
0.161

0.981
0.266

M2
empirical
1.332
0.192
0.554
1.442
1.112
1.277
*fBayesB was used in both QTL scenarios with 5 227 markers and
Next we used the genomewide SNP information in the statistical analysis (
Finally, in the real data example, we regarded
Estimated variance components for the real data example*
Model
M0
0.169





0.405
M1
0.171
0.030




0.378
M2
0.174
0.046
0.000
0.000
0.026
0.000
0.303
*Variance components for each source of genetic variation:
The fBayesB approach was applied to public data on a heterogeneous stock of mice. Genetic effects were estimated based on the different models including only additive effects (M0), additive and dominance effects (M1), additive, dominance and pairwise epistatic effects (M2).
Click here for file
4 Discussion
4.1 Hyperparameters and convergence
When we investigated the influence of a varying proportion of genetic to phenotypic variance on genetic value prediction in the 23QTL scenario, it was observed that fBayesB did not fulfil the convergence criterion in all situations. In the extreme case with M2 and
4.2 Proportion of nonzero effects
A preliminary study could show that the choice of the hyperparameter
4.3 Reduction of model dimensionality
SNP density continues to increase; soon wholegenome sequences will be used for statistical analysis
In order to keep as many parameters as required in the statistical model, one could apply a filtering procedure. The significance of putative nonzero effects might be determined, for example, via a stochastic variable selection approach (SVS). In the field of genomic selection, which is based only on additive effects, an SVS implementation of Meuwissen and Goddard
Dimensionality can also be reduced nonparametrically. As an example, a subset of SNPs may be selected via filtering based on entropy information and wrapping using a naive Bayesian classifier
4.4 Nonadditive effects
This study has shown that the inclusion of dominance effects in genetic value prediction improved accuracy compared to purely additive models (Table
In general, and also confirmed in our investigations, parametric methods have difficulties to identify and to estimate epistatic effects. One reason is that the orthogonal decomposition of genetic effects only lead to proper results under idealised conditions (LE, absence of mutation and selection etc.) which are violated in practice
Once gene interactions are discovered, they may be used for mate allocation in livestock breeding, where individuals are mated to achieve favourable nonadditive gene combinations to further increase genetic gain
4.5 Number of simulated QTL
An increase in the number of QTL was accompanied by a reduction in the quality of fBayesB for genetic value prediction. fBayesB was able to identify only the biggest QTL effects in the simulated scenarios, in which (nearly) the same amount of genetic variation was spread over 23 or 230 QTL. Thus, effect size in the 230QTL scenario was roughly onetenth of that in the 23QTL case. This complicated the identification of genetic effects in general and, in particular, of nonadditive effects, which contributed very little to the genetic variance when compared with additive effects. Many tiny effects were estimated with BayesB, even if genetic variation was caused by few QTL with large effects. In both QTL scenarios, accuracy of genetic value prediction was at a high level with BayesB. It may be more realistic to assume that most livestock traits are influenced by many loci and therefore best results can be expected with BayesB.
5 Conclusion
This simulation study showed that the fast Bayesian method (fBayesB) is convenient for genetic value prediction. It requires only a fraction of computing time compared to a conventional MCMC approach BayesB and also enables estimating pairwise interactions.
The number of simulated QTL, the proportion of genetic to phenotypic variance as well as the quantity of SNP in statistical analyses influenced accuracy of genetic value prediction and bias of variance component estimation. Both methods obtained similar results when few QTL with additive and dominance effects were simulated; the maximum accuracy was 98%. As expected, best results were obtained on the basis of the true model corresponding to the simulated scenario, but the loss of accuracy due to using the incorrect model was limited to 25%. If many QTL were responsible for genetic variation, accuracy decreased about 2249% with fBayesB compared to the few QTL scenario, depending on the model. Accuracy based on modelling only additive and dominance effects was generally superior to the complex model, no matter if epistasis was simulated or not, and an additional gain of 410% accuracy was observed with BayesB. To sum up, existing approaches for genomewide estimation of additive genetic effects can easily and robustly be extended by dominance effects to improve accuracy of genetic value prediction and to get further insight into the genetic architecture. In this simulation study, the inclusion of dominance was more important than involving all pairwise interactions, which did not improve prediction in general.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
DW implemented the statistical methods, carried out the analysis and wrote the manuscript. NM simulated the datasets and contributed to the data analysis. NR raised the initial question, advised on the research and suggested improvements to the manuscript. All authors have read and approved the final manuscript.
Acknowledgements
This study is part of the FUGATO project "Bovine Integrative Bioinformatics for Genomic Selection (BovIBI)" with financial support of the German Federal Ministry of Education and Research (BMBF).