Genomic selection is an effective tool for animal and plant breeding, allowing effective individual selection without phenotypic records through the prediction of genomic breeding value (GBV). To date, genomic selection has focused on a single trait. However, actual breeding often targets multiple correlated traits, and, therefore, joint analysis taking into consideration the correlation between traits, which might result in more accurate GBV prediction than analyzing each trait separately, is suitable for multi-trait genomic selection. This would require an extension of the prediction model for single-trait GBV to multi-trait case. As the computational burden of multi-trait analysis is even higher than that of single-trait analysis, an effective computational method for constructing a multi-trait prediction model is also needed.
We described a Bayesian regression model incorporating variable selection for jointly predicting GBVs of multiple traits and devised both an MCMC iteration and variational approximation for Bayesian estimation of parameters in this multi-trait model. The proposed Bayesian procedures with MCMC iteration and variational approximation were referred to as MCBayes and varBayes, respectively. Using simulated datasets of SNP genotypes and phenotypes for three traits with high and low heritabilities, we compared the accuracy in predicting GBVs between multi-trait and single-trait analyses as well as between MCBayes and varBayes. The results showed that, compared to single-trait analysis, multi-trait analysis enabled much more accurate GBV prediction for low-heritability traits correlated with high-heritability traits, by utilizing the correlation structure between traits, while the prediction accuracy for uncorrelated low-heritability traits was comparable or less with multi-trait analysis in comparison with single-trait analysis depending on the setting for prior probability that a SNP has zero effect. Although the prediction accuracy with varBayes was generally lower than with MCBayes, the loss in accuracy was slight. The computational time was greatly reduced with varBayes.
In genomic selection for multiple correlated traits, multi-trait analysis was more beneficial than single-trait analysis and varBayes was much advantageous over MCBayes in computational time, which would outweigh the loss of prediction accuracy caused by the approximation procedure, and is thus considered a practical method of choice.
Keywords:Genomic selection; Multiple traits; Bayesian regression; MCMC iteration; Variational approximation
A huge number of genome-wide polymorphisms have recently been elucidated in livestock and crops with the development of sequencing technologies. High-throughput genotyping systems, such as high-density SNP chips containing several tens or hundreds of thousands of genome-wide SNP markers and GBS (genotyping by sequence) , have become available to efficiently identify genotypes of individuals for a large number of SNPs at low cost. Consequently, genomic selection  is attracting attention as a new breeding technology utilizing the information of genome-wide dense SNP markers. In genomic selection, a model to predict genomic breeding value (GBV) based on genome-wide SNP genotype is firstly constructed using a training population consisting of individuals with both SNP genotypes and phenotypes of a trait and, subsequently, using this model, the GBVs of a trait are predicted for individuals in the tested population, which are selection candidates, based on their SNP genotypes, allowing effective individual selection to be performed without phenotypic records using the predicted GBV. Therefore, genomic selection requires the construction of prediction models being able to accurately relate genotypes of genome-wide SNPs to GBV.
In the original study of genomic selection by Meuwissen et al. , two Bayesian methods were presented for construction of prediction models, which were referred to as BaeysA and BayesB and have been used for the studies of genomic selection in their original or modified forms [3,4]. The model of BayesA is equivalent to the Bayesian shrinkage regression (BSR) model , where all SNPs are included in the prediction model as covariates and the estimates of SNP effects are shrunk by assuming a normal distribution with mean 0 and SNP-specific variance as prior distributions of SNP effects, resulting in negligible estimates for the effects of many SNPs irrelevant to phenotype, but significantly large estimates for the effects of SNPs contributing to phenotype. On the other hand, the BayesB method is regarded as a variant of Bayesian stochastic search variable selection (BSSVS) , where the prior probability, π, that a SNP has zero effect and is removed from the model was considered in the model fitting to obtain the best model explaining the phenotypes with a small number of SNP effects. A normal distribution with mean 0 and SNP-specific variance is assumed for the prior distribution of SNP effects in BayesB, as in BayesA, if SNPs are included in the model with probability 1-π and, otherwise, both the mean and variance are fixed at 0 with probability π.
In BayesA and BayesB, an additional hierarchical structure is induced for this SNP-specific variance, where an inverted chi-square distribution with degree of freedom ν and scale parameter S is adopted as the prior distribution of the variance. However, only the information of the relevant single SNP can be used for the posterior inference of this SNP-specific variance regardless of the number of genotypes or phenotypes. Due to the insufficient Bayesian learning for the variance, the degree of shrinkage for the estimates of SNP effects is largely influenced by the prior setting for ν and S in BayesA and BayesB . In BaeysB, the sparseness of the model and the prediction accuracy are also greatly affected by the prior probability, π, that a SNP has zero effect, which is treated as a known fixed value. To overcome these drawbacks of BayesA and BayesB, the modified methods such as BayesC and BayesD were proposed. In BayesC, a single variance common to all SNPs is adopted for the prior distribution of SNP effects while BayesD allows S to be inferred from the data, given ν. Furthermore, the step for inferring π, which is given as a fixed value in BayesB, can be incorporated in the procedure of BayesC and BayesD and the corresponding methods are termed BayesCπ and BayesDπ, respectively .
These Bayesian methods were mainly developed for genomic selection of a single trait. However, actual breeding of animals and plants often aims to simultaneously improve multiple correlated traits. Therefore, joint prediction of GBVs for multiple traits, taking into consideration the correlation structure between traits, is suitable for multi-trait genomic selection, which requires the extension of existing methods for single-trait GBV prediction to multi-trait case. In QTL mapping methods that use similar models to those of genomic selection, some researchers have developed multi-trait models [8-10]. Xu et al.  extended the BSR model to a multi-trait case introducing the correlation structure between traits in estimation of QTL effects and other non-genetic effects. Banerjee et al.  employed a Bayesian composite model space approach [11,12] for multi-trait QTL mapping, where a variable indicating inclusion or exclusion of each QTL was incorporated for constructing a model with the smallest possible number of QTLs, a similar approach to BSSVS adopted in BayesB, and each trait was allowed to have a different model by assuming the trait-specific effects for QTLs and non-genetic factors that were assumed to be uncorrelated between traits. Meuwissen and Goddard  proposed a multi-trait BSSVS model for QTL mapping. These methods can be applied for prediction of multi-trait GBVs in genomic selection. Calus and Veerkamp  applied the method proposed in  with some modification for multi-trait GBV prediction to compare the accuracy between single-trait prediction and multi-trait prediction in genomic selection.
The computational procedure with MCMC iteration is generally used for the Bayesian methods to estimate parameters in the models, which become complicated models in genomic selection, including a huge number of SNPs as covariates and SNP effects that are estimated as their regression coefficients. The computational burden of MCMC-based Bayesian methods, which requires a long time until convergence when estimating many parameters is huge even in single-trait GBV prediction and would be further increased in the case of multiple traits, thus hindering the MCMC procedure depending on the number of traits to be jointly analyzed. Therefore, it would be necessary to devise a solution for reducing the computational burden of Bayesian methods for multi-trait GBV prediction. So far, some non-MCMC computational procedures for Bayesian methods have been proposed in QTL mapping and genome-wide association study, including EM-algorithm  and variational approximation [15,16]. The EM-algorithm was also applied to single-trait GBV prediction, successfully reducing the computational time [17,18].
In this paper, we propose Bayesian methods for multi-trait GBV prediction, in which BSR models allowing variable selection are developed and MCMC procedures for estimating model parameters including SNP effects are described as well as a computationally cost-effective non-MCMC method using variational approximation as an alternative computational procedure. Hereafter, the Bayesian methods based on MCMC iteration and variational approximation are referred to as MCBayes and varBayes, respectively. The multi-trait Bayesian models described here include a Bayesian shrinkage regression (BSR) models that are equivalent to those adopted by BayesA and BayesD when variable selection is not conducted, and the models of BSSVS that are regarded as slightly modified versions of BayesB and BayesDπ methods when a step of variable selection step is incorporated. We develop computationally-effective variational approximation procedures to construct the posterior distributions of parameters of these Bayesian models.
Using simulated datasets consisting of genotypes of genome-wide dense SNPs and phenotypes of three correlated traits with high and low heritabilities, we investigated the differences in prediction accuracy for each trait between multi-trait analysis, where GBVs of three traits were simultaneously predicted taking the correlation structure between traits into consideration, and single-trait analysis, where each trait was separately predicted for GBV. We also evaluated the prediction accuracy of the varBayes methods in comparison with MCBayes. Moreover, we investigated the performance of multi-trait analysis in simulated data including missing phenotypes.
In this section, we describe Bayesian models for multi-trait GBV prediction and computational procedures for Bayesian estimation of the model parameters, including construction of posterior distributions of the parameters using MCMC iteration and variational approximation. Here, we consider the statistical models for BSR and BSSVS, which are shown to be equivalent to BayesD and similar to BayesDπ, respectively. In these Bayesian models concerned, BSR is regarded as a special version of BSSVS by setting π =0, where π is a prior probability that a SNP does not contribute to traits; thus, we present the multi-trait GBV prediction models in a unified fashion in terms of BSSVS.
We assume that the number of SNPs genotyped is N and a training dataset including n individuals with phenotypic records of multiple traits, where the number of traits is denoted by T, and SNP genotypes is available for estimating parameters in the prediction model. We also assume that a test dataset consists of individuals with only SNP genotypes, for each of which GBV is predicted. We denote two alleles at each SNP by 0 and 1 and three genotypes by ‘0_0’, ‘0_1’, and ‘1_1’.
Models for Bayesian stochastic search variable selection in multi-trait genomic selection
We propose the following Bayesian multi-locus linear model for the phenotypes of T traits of the ith individual, which are denoted as a Tx1 vector yi (i = 1,2,…,n), as a BSSVS model for multi-trait GBV prediction:
where b is a f × 1 vector of non-genetic effects including interceptions of the model with Xi being a T × f design matrix linking b to the ith individual; uil is a variable indicating the genotype of the ith individual at the lth SNP taking a value of −1, 0 or 1 corresponding to the genotypes, ‘0_0’, ‘0_1’, or ‘1_1’, respectively; gl is a T × 1 vector of the effects of the lth SNP on the phenotypes of T traits; γl is a variable indicating the inclusion of the lth SNP in the model with 1 or 0 depending on whether or not the lth SNP is included in the model; and ei is a T × 1 vector of residual errors following a T-variate normal distribution N(0, Σe) with 0 being a T × 1 vector of zeros and Σe being a TxT covariance matrix of ei.
Within the Bayesian framework, prior distributions are assigned to the parameters of the model (1). We assume that the priors of the elements of b are the improper uniform distribution over the possible values. The prior probabilities that γl =1 and γl =0 are presented as 1- π and π, respectively, considering the prior probability that a SNP does not contribute to the trait and is excluded from the model is π. The prior distribution of gl given γl has the form
where Σgl is a T × T matrix that is the variance and covariance matrix of gl given γl =1 and δ (0) is the delta function that concentrates a total mass at zero for all T elements of gl. We induce a hierarchical structure for Σgl by assigning an inverse Wishart distribution with degree of freedom ν and scale parameter S, denoted by IWT(ν,S), to the prior distribution of Σgl :
Although the SNP effect gl is zero and irrelevant to Σgl when γl =0 as shown in (2), we assume that Σgl is a priori distributed as given in (3) regardless of γl. We treat ν as a given fixed value but infer S from the data within the Bayesian framework. For simplicity, we assume here that S is a diagonal matrix with the kth diagonal element being sk , that is S = diag(sk) (k = 1,2,…,T), and we adopt the improper uniform distribution over positive values for a prior distribution of sk. The residual variance and covariance matrix Σe is assumed to have a uniform distribution over the positive definite matrices as a prior distribution.
Denoting these parameters of the Bayesian model collectively by θ, the prior distribution of θ by p(θ) and observed phenotypes yi and SNP genotypes uil (i = 1,2,…,n;l = 1,2,…,N) by Y and U, respectively, we can write the posterior distribution of θ, g(θ|ν,Y, U), as
Given that S is fixed, posterior distribution (4) is equivalent to that of BayesA extended to multi-trait case when π = 0 leading to γl =1 for all SNPs. When π > 0, the selection of SNPs to be included in the model is performed as in BayesB. However, it should be noted that posterior distribution (4) is not equivalent to that of the multi-trait version of BayesB, which supposes that Σgl is a zero matrix as well as gl when γl =0, while, in (4), the prior distribution of Σgl is assumed to be IWT(ν, S) regardless of γl. The form of posterior distribution (4) allows Gibbs sampling in the MCMC estimation for all parameters including gl and Σgl. The full conditional posterior distributions of parameters used in MCBayes are described in Additional file 1.
Variational approximation procedure for multi-trait Bayesian model
We adopted variational approximation as an alternative to MCMC iteration for constructing marginal posterior distributions of parameters based on joint posterior distribution (4). In the variational approximation procedure, the joint posterior distribution is approximated by the product of functions for subsets of parameters with lower dimension. Briefly, we assume that g(θ|ν, Y, U) is approximated by a function of θ, q(θ), which is factorized as q(θ) = q1(θ1) q2(θ2)…qK(θK), where θ1, θ2,… ,θK are mutually exclusive subsets of θ such that ∪ k = 1Kθk = θ and qk (θk) (k = 1,2,…,K) may generally depend on νY and U although this dependence is omitted here for simplicity. This approximating marginal posterior distribution, qk(θk), is referred to as the variational posterior of θk. The form of qk(θk) is determined such that the Kullback–Leibler divergence between g(θ|νY, U) and q(θ|q1(θ1) q2(θ2)…qK(θK), D(q||g), is minimized , where D(q||g) is defined as
It can be shown that qk(θk) is expressed as
where C is a normalized constant and E-θk[.] indicates the expectations of parameters other than θk that are calculated with respect to every other parameter’s variational posteriors except qk(θk) .
In the varBayes method that applies the variational approximation procedure to the multi-trait Bayesian model considered here, g(θ|ν,Y, U) is assumed to be approximated by a factorized function q(θ) that is written as
where we denote all of the variational posteriors for the different parameters in the right-hand side by q(.) for simplicity with the understanding that q(b), q (Σe) and so on take different forms depending on the parameters. The forms of these variational posteriors are derived from (5) in a manner similar to that used for derivation of the full conditional posteriors for the parameters in Gibbs sampling and the computational details are given in Additional file 2. In the following, we will give the variational posteriors for parameters, γl, gl, Σgl (l = 1,2,…,N), b, Σe and S.
The variational posterior for b, q(b),is a multivariate normal density with mean
and variance-covariance matrix
where E(.) indicates the expectation calculated with respect to the variational posteriors of relevant parameters, while q (Σe) is IWT (n-T-1, Se) with
from which we obtain
The variational posterior of Σgl is represented by IWT (νgl, Sgl), where
The expectation of Σgl-1 with respect to this posterior distribution is
For γl and gl, we consider joint distribution for variational posterior q(γl, gl) that is expressed as
and variance-covariance matrix
The marginal posterior distribution of γl is a binomial distribution with probabilities of γl =1 and 0 given by
The conditional distribution of gl given γl is given by
Therefore, we obtain
From (4), the variational posterior of S is a T-variate Wishart distribution with a scale matrix ΣS and degree of freedom νs, WT(νs, ΣS), where and νs = Nν + T + 1, and the expectation of S is expressed as
As outlined above, a well-known probability distribution, such as normal, inverse Wishart and so on, is assigned to the variational posterior of each parameter, which is characterized by the expectations of the functions of other parameters, taken with respect to their variational posteriors. The relationships between these expectations are given by (6), (8), (9), (13), (14), (15) and (16), from which the expectations can be calculated with numerical iterations to obtain the variational posteriors.
Moreover, the prior probability for a SNP to have zero effect, π, can be treated as a variable parameter to be inferred from the data as in BayesCπ and BayesDπ when 0 < π < 1. Here, we assume a uniform prior on 0 < π < 1 to obtain a Beta distribution, B(a,b), as a variational posterior from (4), where and , with the expectation
Accordingly, the variational posteriors of the other parameters are modified with π substituted by E(π) when π is involved in the Bayesian inference.
Treatment of missing phenotypes
In the phenotypic records for multiple traits, it is common for trait values to be partially missing in some individuals. Missing phenotypes of a trait in individuals can be inferred with the observed phenotypes of other traits in the same individuals. The step for inferring missing phenotypes can be implemented in the MCBayes and varBayes procedures as described below.
When there are missing phenotypes in an individual, the residual vector of the individual, e in model (1), is partitioned into components eo and em corresponding to observed and missing phenotypes, respectively. Following , em can be sampled from the following normal distribution,
where Σmm, Σmo and Σoo indicate the partition of the residual variance-covariance matrix Σe corresponding to em and eo. Accordingly, em is drawn with Gibbs sampling in MCBayes while it is obtained as E(Σmo)E(Σoo)-1eo* in varBayes, where, denoting the component corresponding to the missing traits by subscript ‘o’, eo* is written as
and the expectations can be calculated with the variational posteriors of Σe, b and gl . Missing phenotypes are inferred as the sum of the estimates of b, gl and em, which are used for the construction of the prediction model.
We simulated datasets to evaluate the accuracy of the predicted GBVs using the proposed Bayesian methods, MCBayes and varBayes, for multiple traits. In generating the datasets, three traits, denoted as A, B and C, were considered. The simulation of population and genome was carried out following  where a single trait was treated, while multiple traits were generated through a modified approach. The simulated population, genotypes and phenotypes are described in the following.
Populations with an effective population size of 100 were maintained by random mating for 1000 generations to attain mutation drift balance and LD between SNPs and QTLs. In generation 1001 and 1002, the population size was increased to 1000. The population in the 1001st generation was treated as a training population, where the phenotypes of three traits and SNP genotypes of the individuals were simulated and analyzed to estimate the SNP effects in the model. The phenotype of each trait for each individual in the 1001st generation was given as the sum of QTL effects over the polymorphic QTLs and environmental effects, which were sampled as described later. For simplicity, no other fixed effects were assumed. The population in the 1002nd generation was used as a test population, where the individuals were only genotyped for SNP markers without phenotypic records and GBVs of three traits were predicted for each individual using a model with SNP effects estimated based on the training population in the 1001st generation. The true breeding value (TBV) of the individual in the 1002nd generation was also simulated as the sum of QTL effects corresponding to the QTL genotype for each trait and used for evaluating the accuracy of predicted GBV, but was regarded as unknown and unavailable in the estimation of SNP effects in the models. Accuracy was measured based on the correlation between the TBV and predicted GBV, rTBV,pGBV, for each trait and regression of TBV on predicted GBV, bTBV,pGBV, was also obtained for assessing the bias of the prediction.
The genome was assumed to consist of 10 chromosomes each 100cM in length. Two scenarios were considered for the number of available SNP markers and the datasets under these two scenarios were denoted as Data I and Data II. In Data I, 101 marker loci were located every 1cM on each chromosome for a total of 1010 markers on a genome. In Data II, 1010 equidistant marker loci were located on each chromosome for a total of 10100 markers. We assumed that 100 equidistant QTLs were located on each chromosome such that a QTL was in the middle between two marker loci in both Data I and Data II. Therefore, there were a total of 1000 QTLs located on a whole genome. The mutation rates assumed per locus per meiosis were 2.5 × 10-3 and 5.0 × 10-5 for the marker locus and QTL, respectively. At least one mutation occurred in the most of the marker loci with a high mutation rate during the simulated generations. In the marker loci experiencing more than one mutation, the mutation remaining at the highest minor allele frequency (MAF) was regarded as visible, whereas the others were ignored, which resulted in the marker loci having two alleles similar to SNP markers. Although the mutation rate for QTL was assumed 2.5 × 10-5 in the simulation for a single trait conducted in , we here doubled it for generating TBVs of multiple traits for the reason as described below.
The polymorphic QTLs at which mutation occurred were used to simulate the three traits, A, B and C, the heritabilities of which, denoted by hA2, hB2 and hC2, respectively, were assumed to be hA2 = 0.8, hB2 = 0.1 and hC2 = 0.1. We divided all polymorphic QTLs into four groups, Group1, Group2, Group3 and Group4, according to the causal relationship with traits, where Group1 was a group of QTLs affecting both traits A and B, Group2 was a group of QTLs affecting only trait A, Group3 was a group of QTLs affecting only trait B and Group4 was a group of QTLs affecting only trait C. Therefore, it was assumed that QTLs of Group1 had pleiotropic effects on traits A and B while QTLs of other groups affected only a single trait. Each polymorphic QTL was randomly assigned to one of these groups, that is, to Group1, Group2, Group3 and Group4 with respective probabilities 0.4, 0.1, 0.1 and 0.4. Therefore, 80% of the QTL loci affecting trait A were shared by trait B on average and whereas all the QTL loci affecting trait C influence neither traits A nor B. In this setting of multi-trait case, the mutation rate of QTL was increased to 5.0 × 10-5 from 2.5 × 10-5, the mutation rate adopted in  for generating a single trait, to retain the number of QTL per trait.
The pleiotropic effects of each QTL in Group1 were assumed to be correlated between traits A and B with correlation coefficient of 0.9. Consequently, genetic correlations between traits A and B, between A and C and between B and C, denoted as ρGAB, ρGAC and ρGBC, respectively, were ρGAB = 0.72 and ρGAC = ρGBC = 0 on average although the values of correlation coefficients were somewhat fluctuated in each data generation. This setting of traits was adopted to investigate how the prediction accuracy was increased for a low-heritability trait (trait B) by simultaneous analysis for multiple traits including a correlated high-heritability trait (trait A).
The effects of QTL alleles were sampled from gamma distributions independently for each QTL. Pleiotropic effects of QTL alleles in Group1 were determined for traits A and B by generating two correlated gamma random variables, x and y, with the correlation coefficient of 0.9, and assigning them with positive or negative values with equal probabilities. The effects of QTL alleles in the other groups, which affected each single trait only, was determined by sampling a gamma random variable z and by similarly assigning it with the positive or negative sign. We generated these three random variables x, y and z such that their marginal distributions were the same gamma distribution with scale parameter 0.4 and shape parameter 1.66, Gamma(0.4,1.66). For obtaining correlated variables x and y, we generated three independent gamma variables, x1, x2 and x3, which were sampled from Gamma(0.36,1.66), Gamma(0.04,1.66) and Gamma(0.04,1.66), respectively, and determined the values of x and y as x = x1 + x2 and y = x1 + x3. It can be shown that x and y had a correlation coefficient of 0.9 and the same marginal distribution Gamma(0.4,1.66) .
The environmental correlation coefficients between three traits were denoted by ρEAB, ρEAC and ρEBC, respectively, and assumed as ρEAB = 0.1, ρEAC = 0.2 and ρEBC = 0.3. The environmental effects were sampled from a trivariate normal distribution with a mean vector 0 and a variance-covariance matrix RE, where
with σEA2, σEB2 and σEC2 indicating environmental variances of three traits. Environmental variance of trait A was given by σEA2 = (1/hA2-1)σGA2 with its heritability hA2 and genetic variance σGA2, which was variance of TBV of trait A, and those of traits B and C were obtained similarly. The environmental effects were added to TBVs which were given by the sum of QTL effects to determine phenotypic values of three traits for individuals in the 1001st generation.
In Data I, 100 replicated datasets were simulated while Data II consisted of 20 replicated datasets due to the larger number of SNPs. Each of replicated datasets included records of phenotypes of three traits and genotypes of SNPs for the training population (1001st generation) and only SNP genotypes for the test population (1002nd generation). To simulate the situation of missing phenotypes, we generated additional datasets by deleting the phenotypic records of some traits for some individuals in the 100 replicated training datasets of Data I. These 100 replicated datasets were referred to as Data III, where the phenotypic records of traits A, B and C were respectively deleted for individuals of i = 801–1000, individuals of i = 1-500 and individuals of i = 201-700 in 1000 individuals (i = 1-1000) of the 1001st generation of Data I. Therefore, in Data III, the prediction model for GBVs was constructed with a training dataset consisting of the phenotypes of 800, 500 and 500 individuals for traits A, B and C, respectively, where only 100 individuals (i = 701-800) had phenotypic records of all three traits, and 1000 individuals in the 1002nd generation were predicted for GBV based on the genotypes of 1010 markers. The setting of non-zero environmental correlations, i.e., ρEAB = 0.1, ρEAC = 0.2 and ρEBC = 0.3, was adopted here to assess the benefit from the estimation process of missing phenotypes implemented in multi-trait analyses for the prediction accuracy, where the information of environmental covariance between observed and missing phenotypes was utilized as in (18).
Each replicated dataset in Data I, Data II and Data III was analyzed using the proposed methods for multiple traits, MCBayes and varBayes, to construct the GBV prediction model in the1001st generation and investigate rTBV,pGBV and bTBV, pGBV for each trait in the 1002nd generation. For a comparison with conventional single-trait GBV prediction, the same datasets were also analyzed with the single-trait setting of MCBayes where three traits were separately treated without considering the correlation structure between traits.
We conducted cross-validation as well to evaluate the prediction performance within a population in the 1001st generation without the 1002nd generation since the techniques of cross-validation have commonly been used for evaluation of the accuracy in the studies of genomic selection with the actual datasets of animals [24-26] and plants [27-29], where the prediction accuracy of the model for the unobserved future samples was of concern. We applied 10-fold cross-validation. In brief, we randomly split a population of 1000 individuals in the 1001st generation into ten subpopulations with each size 100. By using a single subpopulation used as a test set and the remaining nine subpopulations as a training set to construct prediction model, the prediction accuracy and bias were assessed for the test set. This process was repeated ten times until all single subpopulations were used as test sets exactly once. We averaged the prediction accuracy and bias over ten repetitions to evaluate the prediction performance in each dataset. Because of much computational burden with MCMC analysis for repeated model constructions in cross-validation, evaluation with 10-fold cross-validation was carried out for Data I only, where mean of the prediction accuracies and biases in 100 datasets were obtained as rTBV,pGBV and bTBV, pGBV for each trait and for each prediction method. In the process of cross-validation, we also investigated the correlation between predicted GBV and the phenotypic value, ry,pGBV, in place of TBV, and regression of predicted GBV on phenotypic value, by, pGBV, for each trait.
Two settings for the prior probability that a SNP has zero effect, π, were adopted in model construction, i.e., π was fixed at 0 or π was variable over the range 0 < π <1 and inferred from the data. The values of hyperparameter ν, degree of freedom of prior inverse Wishart distribution of Σgl, were set to 5.0 and 3.2 corresponding to π =0 and 0 < π <1 after preliminary analyses to evaluate the effects of the values of ν on prediction accuracy over 3 < ν <6 although the accuracy was little affected by the values of ν.
In the MCMC iteration of MCBayes, we repeated 11000 cycles including a burn-in period of the first 1000 cycles. The values of parameters were sampled every 10 cycles to obtain the posterior means that were used to determine a prediction model for each generated dataset. In the method of varBayes, we adopted the criterion < 10-8 for convergence in numerical iteration for computing the expectations of parameters, where and are the current and previous value of the expectations of parameters. When this criterion was satisfied, the computational procedure with variational approximation was regarded as converged.
We evaluated the accuracy and bias of the predicted GBVs, rTBV,pGBV and bTBV, pGBV, obtained by the proposed methods for genomic selection of multiple traits including MCBayes and varBayes in 100 simulated datasets of Data I and Data III and in 20 datasets of Data II in comparison with the prediction accuracy with single-trait MCBayes analysis. The means of accuracies and biases of prediction evaluated using a test population in the 1002nd generation are summarized in Table 1, Table 2 and Table 3 for Data I, Data II and Data III, respectively. In all of datasets, MCBayes provided higher prediction accuracy than varBayes in multi-trait analysis (Tables 1, 2 and 3). In Data I, rTBV,pGBV obtained with multi-trait MCBayes analysis was 0.788 (0.051), 0.581 (0.103) and 0.453 (0.090) for traits A, B and C, respectively, when π was fixed at 0, where standard errors were given in parenthesis here and hereafter, while that was 0.753 (0.060), 0.580 (0.117) and 0.364 (0.137) when π was varied and inferred. In Data II with the number of SNPs increased to 10100, rTBV,pGBV with multi-trait MCBayes analysis was higher at 0.902 (0.032), 0.706 (0.103) and 0.519 (0.097) for traits A, B and C, respectively, when π was fixed at 0 while that was 0.868 (0.047), 0.731 (0.120) and 0.401 (0.182) when π was inferred. With multi-trait varBayes analysis, rTBV,pGBV was 0.754 (0.061), 0.570 (0.113) and 0.383 (0.117) for traits A, B and C with π =0 and that was 0.716 (0.070), 0.548 (0.122) and 0.347 (0.131) with π variable in Data I while that was 0.859 (0.049) , 0.656 (0.110) and 0.438 (0.074) with π =0 and 0.838 (0.061) , 0.678 (0.140) and 0.330 (0.157) with π variable in Data II. The prediction with multi-trait MCBayes was almost unbiased in Data I, where bTBV,pGBV was near 1, but was biased in Data II for traits B and C (Tables 1 and 2). The analysis with varBayes showed greater bias than with MCBayes.
Table 1. Accuracy and bias of predicted GBVs in Data I
Table 2. Accuracy and bias of predicted GBVs in Data II
Table 3. Accuracy and bias of predicted GBVs in Data III
In single-trait analysis where MCBayes was applied for a single-trait model, rTBV,pGBV was comparable with that of multi-trait analysis for high-heritability trait A while it was significantly decreased for low-heritability trait B which was highly correlated with trait A (Table 1 and Table 2). For trait C which had same heritability as trait B but was genetically independent of trait A (ρGAC = 0.0), multi-trait MCBayes analysis gave the accuracy comparable to single-trait MCBayes analysis when π =0 while, when π was varied and inferred, single-trait analysis showed considerably higher accuracy than multi-trait MCBayes and varBayes analyses as shown in Table 1 and Table 2. The prediction with single-trait analysis was almost unbiased In Data I, but that was likely to be biased in Data II with bTBV,pGBV much deviated from 1 especially when π was varied and inferred.
In Data III with missing phenotypes, which was derived from Data I by removing some phenotypic records, rTBV,pGBV was decreased for all traits in all methods due to the smaller sample size in comparison with Data I (Table 3). The rate of decrease in rTBV,pGBV was greater for traits B and C than A as the greater reduction of the sample size was made in B and C. Multi-trait MCBayes analysis provided higher prediction accuracy than multi-trait varBayes analysis for all traits except for trait C withπ varied and inferred. For trait B, multi-trait analysis showed higher prediction accuracy than single-trait analysis, but gave less accurate prediction for trait C. For the bias of predicted GBV, bTBV, pGBV, the MCBayes analysis with π =0 was almost unbiased for both multi-trait and single-trait analyses, where bTBV, pGBV ranged 0.931 to 0.998, although the standard error of bTBV, pGBV was increased for low-heritability traits B and C, for which the MCBayes analyses with π varied and varBayes showed much biased prediction.
We listed rTBV, pGBV and bTBV, pGBV obtained by cross-validation conducted in a population of the 1001st generation of Data I as well as ry,pGBV and by,pGBV, correlation between predicted GBV and phenotype and regression of phenotype on predicted GBV, in Table 4. The prediction accuracy evaluated with cross-validation was considerably higher than that evaluated with a test population in the next generation, with the prediction bias being similarly for both evaluated with cross-validation and use of a test population. The relative merits in performance of prediction between the methods were also similar for both evaluations. It was shown in Table 4 that the relational expression, rTBV,pGBV = ry,pGBV/h, held with h being a square-root of heritability and that bTBV,pGBV was almost the same as by,pGBV.
Table 4. Accuracy and bias of predicted GBVs evaluated with cross-validation in Data I
Correlation coefficients between predicted GBVs of trait A, B and C were listed in Table 5 for multi-trait MCBayes and varBayes analyses and single-trait MCBayes analysis in all datasets as well as the correlation coefficients between simulated breeding values (TBVs) of three traits in the 1002nd generation. Although the correlation of simulated breeding values between traits A and B was expected to be 0.72, the actual correlations obtained were biased upwards: 0.755 (0.133) and 0.735 (0.166) in Data I (III) and Data II, respectively. Multi-trait analysis could better estimate these correlations compared to single-trait analysis as seen in the correlation between predicted GBVs between traits A and B. For trait C which was genetically independent of A and B, the correlations between predicted GBVs for traits A and C and traits B and C did not significantly deviate from zero (Table 5).
Table 5. Correlations between predicted GBVs for trait A, B and C
In this study, we proposed Bayesian methods for simultaneously predicting GBVs for multiple traits, where two computational procedures were devised using MCMC iteration and variational approximation, referred to as MCBayes and varBayes, respectively. A Bayesian model for simultaneously analyzing multiple traits was obtained by extending a Bayesian model for single-trait genomic selection proposed by  and  to multi-trait case. We introduced the prior probability that a SNP has zero effect, π, and accordingly, MCBayes with π fixed at 0, meaning that all SNPs are included in a model as covariates, constructs a model for multi-trait GBV prediction in a similar manner to BayesD. On the other hand, the MCMC procedure of MCBayes with π variable and inferred from the data is not exactly the same as that of BayesDπ, where a prior distribution of the variance and covariance matrix of SNP effects, Σgl, is assumed to be independent of whether the SNP is included (γl = 1) or excluded(γl = 0) from the model in MCBayes, as seen in (4), while it is dependent onγl and takes different forms for γl = 1 and γl = 0 in BayesDπ. This modification allows MCMC iteration of MCBayes to be performed with only Gibbs sampling.
In the simultaneous analysis of multiple traits for constructing a GBV prediction model, the computational burden greatly increases depending on the number of analyzed traits in comparison with single-trait analysis. We developed a variational approximation procedure, varBayes, for MCBayes to reduce the computational time for multi-trait analysis. In varBayes, the joint posterior distribution of parameters was approximated by a factorized function, each component of which approximated marginal posterior distribution of each parameter and was referred to as a variational posterior. Variational posteriors were shown to be well-known distribution functions such as normal or inverse Wishart that could be derived by simple non-MCMC based numerical iteration.
In genomic selection, it is important to construct a model that enables accurate prediction for GVBs. Therefore, precise point estimation of the model parameters is more relevant rather than the construction of their posterior distributions. Accordingly, the evaluation of loss of prediction accuracy with varBayes in comparison with MCBayes would be suitable for the evaluation of approximation accuracy of varBayes. Using simulation experiments, we investigated the performance of the prediction model constructed with multi-trait analysis compared with single-trait analysis as well as the model constructed using variational approximation. Moreover, the performance of multi-trait analysis in the case of missing phenotypic records commonly occurring in the treatment of the actual data of multiple traits were evaluated based on the results of simulations. These points are discussed below including the computational time and the possible extension of prediction model considering polygenic effects.
Increase in accuracy for GBV prediction with multi-trait analysis
We evaluated the increase in prediction accuracy with multi-trait analysis in comparison with single-trait analysis using the datasets without missing phenotypes, Data I and Data II (Table 1 and Table 2). For trait A having heritability 0.8, multi-trait analysis and single-trait analysis made no difference in the prediction accuracy with MCBayes. Therefore, the advantage of multi-trait analysis over single-trait analysis in predicting GBV is negligible for high-heritability traits. However, we anticipate the increase in accuracy with multi-trait analysis for low-heritability traits utilizing correlations with high-heritability traits. Actually, for low-heritability trait B with heritability 0.1, which has highly correlated with trait A (ρGAB = 0.72), prediction accuracy was increased with multi-trait analysis. As trait C was not genetically correlated with trait A, the accuracy of predicting the GBV of C was not improved with multi-trait MCBayes analysis. The accuracy of predicted GBV of trait B was also increased with multi-trait varBayes analysis in Data I and Data II in comparison with single-trait MCBayes analysis while the prediction accuracy of trait C was lower with multi-trait varBayes analysis (Table 1 and Table 2). The genetic correlations between traits A and B were better estimated with predicted GBVs in multi-trait analysis than in single-trait analysis as shown in Table 5. Therefore, it can be concluded that low-heritability traits (heritability around 0.1) are better predicted for GBVs utilizing their correlations with high-heritability traits (heritability around 0.8), if any, with multi-trait analysis. However, the benefit of multi-trait analysis would be subtle for high-heritability traits. A similar finding was also reported in . For uncorrelated low-heritability traits such as trait C, it is likely that the prediction using multi-trait analysis with π varied and inferred is less effective in comparison with single-trait analysis. Therefore, multi-trait analysis with π fixed at 0, which allows highly accurate prediction for correlated traits while retaining prediction accuracy comparable with single-trait analysis for uncorrelated low-heritability traits, would be a suitable method of choice for multi-trait genomic selection.
Approximation accuracy of variational procedure for MCMC estimation
Generally, constructing a GBV prediction model with MCMC estimation based on genotypic records for tens of thousands of SNPs and phenotypes for hundreds of individuals requires considerable computational time even for single-trait cases. Much more computational burden would be imposed in constructing a model in multi-trait analysis, depending on the number of traits of interest. Therefore, we proposed a computationally cost-effective method, varBayes, approximating MCMC based method, MCBayes, using a variational approximation procedure.
Simulation experiments showed that the prediction accuracy was lower with varBayes than with MCBayes in multi-trait analysis but the rate of loss of accuracy was not remarkable and was less than 10 percent for traits A and B under the same setting of π while it was greater for C (Table 1 and Table 2). The prediction accuracy for trait B correlated with high-heritability trait A was still higher with multi-trait varBayes analysis than with single-trait MCBayes analysis indicating that varBayes could well utilize the information on the correlation structure in multiple traits. Actually, multi-trait varBayes analysis could better capture the genetic correlation between A and B than single-trait MCbayes analysis (Table 5).
The computational time was greatly reduced for multi-trait varBayes analysis in comparison with multi-trait MCBayes analysis. We carried out all computations using a Fortran program written to implement multiple-trait analysis on a computer having two CPUs each with a quad-core processor (Intel Xeon 2.4GHz). In Data I, where 100 replicates of datasets each including genotypes of 1010 SNPs for 1000 individuals were simulated, varBayes took only 12 minutes with π fixed and 22 minutes with π varied for 100 times of model constructions while the computational time for MCBayes was respectively 440 and 435 minutes. Therefore, the average computational time required to construct a prediction model for each dataset in Data I was less than 15 seconds for varBayes and was more than 4 minutes for MCBayes. For a larger data, Data II, that included 20 replicates of datasets each consisting of genotypes of 10100 SNPs for 1000 individuals, the computational time was 21 and 19 minutes for varBayes with π fixed and π varied, respectively, and the time for MCBayes were increased to more than 12 hours with the average computational times for each model construction being about 1 minute for varBayes and more than 30 minutes for MCBayes. In cross-validation process in which a total of one thousand repetitions of model constructions were performed in Data I, the computational time was about 3days with MCBayes while varBayes completed the analysis within only 4 hours.
Taking computational time and prediction accuracy into account, varBayes is considered a useful method for multi-trait genomic selection, which can rapidly construct a prediction model that is less accurate than that with the MCMC-based method for multi-trait analysis, but is more accurate than that with single-trait analysis for correlated traits. The usefulness of varBayes would be more remarkable for simultaneous prediction of GBVs of a large number of traits based on a huge number of SNPs where the application of an MCMC-based method might be prohibited.
Multi-trait analysis of dataset with missing phenotypes
In Data III, we simulated the datasets under the same condition as Data I except that some phenotypes were assumed to be unobserved. In short, we assumed that phenotypes of traits A, B and C were not available for 200, 500 and 500 individuals, respectively, in a total of 1000 individuals with only 100 individuals having the phenotypes of all three traits. In multi-trait analysis, missing phenotypes of individuals can be estimated with their observed phenotypes of other traits using (18), which indicates that residual effects of missing phenotypes can be restored from those of observed phenotypes. When the model fitting is successful for observed phenotypes, the residual effects of the phenotypes are well estimated by subtracting SNP effects and other fixed effects from the phenotypic effects, and those of missing phenotypes are suitably obtained by (18) utilizing the environmental correlation (covariance) between observed and missing phenotypes. Therefore, by assuming non-zero environmental correlation between traits (ρEAB = 0.1, ρEAC = 0.2, ρEBC = 0.3), the loss of prediction accuracy caused by missing phenotypes was anticipated to be less with multi-trait analysis than with single-trait analysis. We expected that, in Data III, the prediction accuracy of trait C, environmentally correlated with trait A (ρEAC = 0.2), was maintained higher in the presence of missing records for some phenotypes using multi-trait analysis in comparison with single-trait analysis. However, the rate of loss for the prediction accuracy for trait C was similar between multi-trait and single-trait analyses as seen in the comparison of the prediction accuracies between Data I and Data III (Tables 1 and 3). The utility of implementation of (18) for imputing missing phenotypes in the process of multi-trait model construction remains unclear in the settings of simulation adopted here.
Model extension by including polygenic effects
We can modify the Bayesian model (1) by including polygenic effects as follows;
where vi is a vector of polygenic effects for multiple traits and assumed to follow a multivariate normal distribution, vi ~ N(0, Σv) with Σv being a variance-covariance matrix of polygenic effects. The polygenic effects for all individuals of a training population and a test population are collectively denoted by V, then the variance-covariance matrix of V can be expressed as A ⊗ Σv, where A is additive genetic relationship matrix for analyzed individuals, computed from the information of pedigree or markers. When the low-density markers are used for the analysis, a considerable portion of genetic effects could not be captured by markers. Accordingly, if pedigree information is available, inclusion of polygenic effects estimated based on pedigree is beneficial in predicting breeding values for genomic selection. The revised model (19) can be similarly treated by MCBayes and varBayes as the model (1), where estimation steps for additional covariates and parameters, V and Σv, can be easily implemented in the procedures of Bayesian multi-trait model construction using MCMC iteration and variational approximation. When no pedigree information is available, A is computed from marker information as a genomic relationship matrix. However, the model (19) includes all available markers as covariates, resulting in redundantly using the same marker information in the model fitting. In the availability of high-density SNP markers, which is the case for some species of animals and plants currently, the genetic relationships between individuals can be well captured by markers themselves in the multi-locus model (1), the benefit from the inclusion of the polygenic effects in the model would be subtle .
In this study, we described a statistical model for Bayesian simultaneous prediction of GBVs in genomic selection targeting multiple traits and devised an MCMC-based method and a computationally cost-effective method utilizing the variational approximation procedure, referred to as MCBayes and varBayes, respectively, to estimate parameters included in the model. The results of simulation experiments showed that the multi-trait analysis that could utilize the correlation structure between traits allowed more accurate prediction of GBVs for correlated traits compared to single-trait analysis that treated each trait separately, where, for low-heritability traits correlated with high-heritability traits, the prediction accuracy for GBVs was remarkably improved with multi-trait analysis. Although the prediction accuracy with varBayes was lower than that with MCBayes in multi-trait analysis, the rate of loss in accuracy was moderate and the accuracy for correlated low-heritability traits was still higher with varBayes analysis compared to single-trait analysis. Considering the benefit of greatly reduced computational time, varBayes was considered to be a practical method for predicting GBVs in multi-trait genomic selection.
The authors declare that they have no competing interests.
TH devised Bayesian prediction methods for the genomic selection of multiple traits, developed a program for simulations and drafted the manuscript. HI assisted in developing a program and drafted the final manuscript. Both authors read and approved the final manuscript.
This study was supported by the Grant-in-Aid for Scientific Research (B) of Japan Society for the Promotion of Science (Grant No. 22380010).
J Am Stat Assoc 1993, 88:881-889. Publisher Full Text
Attias H: Inferring parameters and structure of latent variable models by variational Bayes. San Francisco, CA: Morgan-Kaufmann; 1999::21-30. [Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence]
Ann Inst Stat Math 1992, 44:97-106. Publisher Full Text
Saatchi M, McClure MC, McKay SD, Megan M, Rolf MM, Kim JW, Decker JE, Taxis TM, Chapple RH, Ramey HR, et al.: Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation.
Crossa J, Delos Campos G, Pérez P, Gianola D, Burgueño J, Araus JL, Makumbi D, Singh RP, Dreisigacker S, Yan J, et al.: Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers.
Crop Sci 2011, 51:2597-2606. Publisher Full Text
Hum Genet 2012, 76:510-523. Publisher Full Text