Department of Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, Tjele, DK8830, Denmark
School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, 275997264, USA
Trans Ova Genetics, Sioux Center, Sioux, IA, 51250, USA
Abstract
Background
To understand the genetic architecture of complex traits and bridge the genotypephenotype gap, it is useful to study intermediate omics data, e.g. the transcriptome. The present study introduces a method for simultaneous quantification of the contributions from single nucleotide polymorphisms (SNPs) and transcript abundances in explaining phenotypic variance, using Bayesian wholeomics models. Bayesian mixed models and variable selection models were used and, based on parameter samples from the model posterior distributions, explained variances were further partitioned at the level of chromosomes and genome segments.
Results
We analyzed three growthrelated traits: Body Weight (BW), Feed Intake (FI), and Feed Efficiency (FE), in an F_{2} population of 440 mice. The genomic variation was covered by 1806 tag SNPs, and transcript abundances were available from 23,698 probes measured in the liver. Explained variances were computed for models using pedigree, SNPs, transcripts, and combinations of these. Comparison of these models showed that for BW, a large part of the variation explained by SNPs could be covered by the liver transcript abundances; this was less true for FI and FE. For BW, the main quantitative trait loci (QTLs) are found on chromosomes 1, 2, 9, 10, and 11, and the QTLs on 1, 9, and 10 appear to be expression Quantitative Trait Locus (eQTLs) affecting gene expression in the liver. Chromosome 9 is the case of an apparent eQTL, showing that genomic variance disappears, and that a trimodal distribution of genomic values collapses, when gene expressions are added to the model.
Conclusions
With increased availability of various omics data, integrative approaches are promising tools for understanding the genetic architecture of complex traits. Partitioning of explained variances at the chromosome and genomesegment level clearly separated regulatory and structural genomic variation as the areas where SNP effects disappeared/remained after adding transcripts to the model. The models that include transcripts explained more phenotypic variance and were better at predicting phenotypes than a model using SNPs alone. The predictions from these Bayesian models are generally unbiased, validating the estimates of explained variances.
Background
Large amounts of genomic information generated from Single Nucleotide Polymorphism (SNP) microarrays have become available in recent years for many species
However, in the eQTL approach, associations between SNPs, transcript level, and phenotypes are analyzed individually. This is likely to lead to “missing heritability”
The choice of Bayesian variable selection (BVS) models was due to its features to separate markers with large/moderate or small effects, and to locate the important regions in the genome or transcriptome which serves a better QTL mapping method because it produces clearer signals for QTL
The aim of this study was to explore the contributions of various sources of variation, such as population structure, SNP variants, and gene expression levels, to a set of growth related traits (body weight, feed intake, and feed efficiency) in mice. These traits are very important, both in terms of agricultural production and for obesity in humans. Bayesian mixed models and Bayesian variable selection models were applied to model pedigree, SNPs and/or gene expressions and to derive explained variances for these components. In addition, they were used to partition of SNPs and gene expression by chromosome and genome sections. To validate the estimates of explained variances, the predictive ability of these models was studied using cross validation.
Data
An M16 × ICR F_{2} population of 440 mice was available with complete records for body weight at 8 weeks (BW) and 337 records for feed intake (FI) and feed efficiency (FE), measured during the period 3 weeks to 8 weeks
Figure S3. Distribution of phenotypes of traits Body Weight including 440 animals, Feed Intake and Feed Efficiency including 337 animals each.
Click here for file
Methods
The most complete model used describes phenotypes y (BW, FI, or FE) by an intercept μ, environmental effects of batch and sex b, a polygenic effect based on pedigree u, regressions on SNP covariates a, regressions on gene expression covariates g, and a model residual e, as:
where X is the design matrix for batch and sex effects, Z is a design matrix that links polygenic effects to the observed records, W is a matrix with 1806 SNP covariates, and Q is a matrix with 23,698 gene expression covariates. The SNP and gene expression covariates were centered and scaled to unit variance.
Based on work of
The explained variance in y from (1) is var(Zu) + var(Wa) + var(Qg) + var(e). To obtain posterior means (PMs) and posterior standard deviations (PSDs) on the explained variances for SNPs and gene expressions, var(Wa) and var(Qg) were evaluated based on the posterior samples for a and g from the MCMC, i.e., as the PM and PSD of var(Wa^{t}) values over MCMC cycles, where a^{t} is the posterior sample for a from MCMC cycle
The second model used was a Bayesian variable selection model, where the approach of George and McCulloch
where
From the posterior samples for a and g in the variable selection model, explained variances were computed and partitioned by chromosome and by genome section. The variable selection model is more suited to make such a partitioning, because unlike the mixed model version, it allows for different variance contributions per SNP. The explained variances were evaluated in the same way as for the mixed model, by evaluating var(Wa^{t}) and var(Qg^{t}) over MCMC cycles
It is difficult to choose an optimal windows size as it depends on extend of LD, marker density and an arbitrary cutoff for what is considered important LD. In the data analyzed here, average R^{2} between adjacent SNPs was 0.55, and average R^{2} between SNPs two apart was 0.39, which we considered sufficiently high to warrant computation of variances in a 5SNP window. To study the relative importance of family structure, SNPs, and gene expressions, six sub models and the complete model (1) were used. These were models that use only pedigree information (PED), only SNP data (SNP), only gene expression data (GEX), SNP + GEX, PED + GEX, PED + SNP, and the complete model PED + SNP + GEX. These models always included sex and batch effects.
The predictive ability of the models was evaluated using an 11fold crossvalidation. For body weight, 440 records were divided randomly in 11 groups, each with 40 individuals. Feed intake and feed efficiency, with 337 records in total, were randomly divided in 10 groups of 30 records and one group of 37 records. The complete model, including all variance parameters, was reestimated on each set of 10 folds and predictions were computed for the phenotypes in the remaining 11^{th} fold. All predictions from the 11fold cross validation were collected to compute correlations between predicted and actual phenotypes, and regressions of predicted phenotypes on actual phenotypes, using the whole data set. The slope of the regression lines of predicted phenotypes on actual phenotypes are expected to be 1 if the model produces unbiased predictions, which would validate the estimates of explained variances. The University of Nebraska Institutional Animal Care and Use Committee approved all procedures and protocols.
Results and discussion
Table
Trait
Explained variances
PED
SNP
GEX
PED + SNP
PED + GEX
SNP + GEX
PED + SNP + GEX
Explained variances are for residuals (E), polygenic effects (P), SNPs (S), and gene expressions (G). The table shows estimates as the posterior mean with posterior standard deviation in parentheses and the proportion of explained variance as percentage of the total.
Body Weight
E
9.96(1.93) 58%
9.82(0.94) 64%
3.57(0.9) 21%
7.07(1.77) 41%
2.43(1.01) 14%
3.08(0.77) 19%
2.06(1) 12%
P
7.26(3.42) 42%


5.04(3.15) 29%
2.45(1.41) 14%

2.08(1.47) 12%
S

5.63(0.9) 36%

5.14(1.08) 30%

2.9(0.67) 18%
2.82(0.73) 17%
G


13.45(1.57) 79%

12.37(1.56) 72%
10.29(1.6) 63%
9.93(1.44) 59%
Total
17.22
15.45
17.02
17.25
17.25
16.27
16.89
Feed Intake
E
155.59(42) 47%
202.89(22) 72%
151.89(27) 51%
137.63(40) 42%
95.48(36) 30%
125.91(24) 43%
80.41(34) 25%
P
174.89(82) 53%


131.88(79) 40%
99.74(57) 31%

89.97(53) 28%
S

79.53(22) 28%

56.32(22) 18%

56.05(19) 19%
45.09(18) 14%
G


150.24(41) 49%

125.33(35) 39%
111.84(33) 38%
104.9(33) 33%
Total
330.48
282.42
302.13
325.83
320.55
293.8
320.37
Feed Efficiency (×10,000)
E
1.59(0.44) 42%
2.40(0.26) 76%
2.23(0.3) 69%
1.53(0.44) 42%
1.09(0.48) 30%
1.88(0.3) 58%
1.07(0.46) 29%
P
2.17(0.92) 58%


1.73(0.86) 47%
1.87(0.78) 51%

1.61(0.77) 44%
S

0.76(0.24) 24%

0.39(0.22) 11%

0.61(0.23) 19%
0.33(0.2) 9%
G


1.01(0.34) 31%

0.71(0.28) 19%
0.73(0.32) 23%
0.66(0.27) 18%
Total
3.76
3.16
3.24
3.65
3.67
3.22
3.67
Overall, explained variances increase by adding gene expression information (GEX; data from liver), i.e., in the most complete model (PED + SNP + GEX) explained variances were 88%, 75%, and 71% for BW, FI, and FE respectively. This confirms the assumption that gene expressions can explain a larger part of phenotypic variance than genetic or genomic information, by capturing environmental, and possibly nonadditive, genetic effects through the gene expressions
This model shows that, for these traits, the liver transcriptome contributes a larger portion of explained variance. This is most pronounced for BW, with 18% of explained variance from the genome and 63% from the liver transcriptome. Thus, in this case, the predominant model is that SNPs regulate gene expressions to exert their effect on the phenotype.
Figure
Figure S1. Decomposition of the proportion of variance explained by SNPs at the level of chromosomes and individual SNPs in two models: the independent model SNP and the conditional model SNP+GEX for Feed Intake. (a) explained variances from SNPs in SNP model (black) and SNP+GEX model (white) in each chromosome. (b) explained variance by individual SNPs in SNP model and (c) SNP+GEX model.
Click here for file
Figure S2. Decomposition of the proportion of variance explained by SNPs at the level of chromosomes and individual SNPs in two models: the independent model SNP and the conditional model SNP+GEX for Feed Efficiency. (a) explained variances from SNPs in SNP model (black) and SNP+GEX model (white) in each chromosome. (b) explained variance by individual SNPs in SNP model and (c) SNP+GEX model.
Click here for file
Decomposition of the proportion of variance explained by SNPs at the level of chromosomes and individual SNPs in two models: the independent model SNP and the conditional model SNP + GEX for Body Weight
Decomposition of the proportion of variance explained by SNPs at the level of chromosomes and individual SNPs in two models: the independent model SNP and the conditional model SNP + GEX for Body Weight. (a) Explained variances from SNPs in SNP model (black) and SNP + GEX model (white) in each chromosome. (b) Explained variance by individual SNPs in SNP model and (c) SNP + GEX model.
This method/approach is suitable for genelevel resolution. However, genelevel resolution is highly data dependent, i.e. it requires high marker density and a study population with LD blocks that span small genomic regions. In this work we have used F2 crosses from outbred lines, which has large LD blocks and this kind of data has limited resolution for finemapping of QTL.
One may argue that the most complete model is more interesting to investigate genetic architecture and chromosomal/subchromosomal variance but as we have shown SNPs and pedigree are largely confounded and they explain about the same variance. This confounded explained variance is getting worse in the case that both Pedigree and SNPs are in one model (PED + SNP model) which is shown in higher confidence intervals of explained variance by pedigree. The model with only omics information (SNP + GEX) is therefore simpler, more accurate and as effective as the model that also uses pedigree information. This is interesting for future applications of omics technologies, because we expect that pedigree information often will be absent.
Figures
Map of chromosome 9 for Body Weight, which follows pattern 1 such that the SNPs variance disappears when gene expression is added to the model (left)
Map of chromosome 9 for Body Weight, which follows pattern 1 such that the SNPs variance disappears when gene expression is added to the model (left). Distribution of the genetic values in population based on chr. 9 in the SNP and SNP + GEX models (right).
Map of chromosome 11 for Body Weight, which follows pattern 2 such that the SNPs variance remain unchanged when gene expression is added to the model (left)
Map of chromosome 11 for Body Weight, which follows pattern 2 such that the SNPs variance remain unchanged when gene expression is added to the model (left). Distribution of the genetic values in population based on chr. 11 in the SNP and SNP + GEX models (right).
PED & SNP
SNP & GEX
PED & GEX
BW
0.94
0.87
0.87
FI
0.93
0.87
0.88
FE
0.89
0.68
0.68
The prediction of phenotypes from these models, using crossvalidation, is shown in Table
Figure S4. Comparison of predicted breeding values versus phenotypes in the models using pedigree information only (PED), SNPs information only (SNP) and gene expression information only (GEX) for three traits Body Weight, Feed Intake and Feed Efficiency according to correlation shown in Table
Click here for file
Trait
Parameter
PED
SNP
GEX
SNP + PED
GEX + PED
SNP + GEX
SNP + GEX + PED
ρ Correlation between true phenotype and predicted value. β, Regression of predicted values on true phenotypes.
Body Weight
ρ
0.76
0.8
0.87
0.80
0.87
0.88
0.88
β
0.99
0.99
1.01
0.99
1.01
1.02
1.02
Feed Intake
ρ
0.63
0.64
0.67
0.64
0.66
0.69
0.68
β
0.98
0.99
0.99
0.96
0.95
0.98
0.96
Feed Efficiency
ρ
0.46
0.45
0.51
0.46
0.54
0.51
0.55
β
0.94
0.96
0.86
0.92
0.98
1
0.96
Conclusions
With increased availability of various omics data, integrative approaches are promising tools for understanding the genetic architecture of complex traits. We have developed a complementary approach to the univariate “eQTL” mapping, by considering Bayesian models that fit all genomewide SNPs and transcript abundances in one model, and that estimate and partition explained variances by chromosome and genome segments. Our results show that, using gene expressions, more of the phenotypic variance can be explained and phenotypes can be better predicted. Predictions were also shown to be unbiased, which validates the assessed explained variances. The improvement of phenotype predictions using gene expression data will be useful for several applications in agriculture and medicine, although it should be assessed on a casebycase basis as to whether a suitable tissue can be sampled for the gene expression measurements. Partitioning of the explained genomic variance at the level of chromosomes and genome segments showed clear examples of eQTL locations as regions where genomic variance disappears when gene expressions are added to the model. Our study used only gene expressions from the liver, and an obvious further extension is to include expressions from other tissues. The QTLs that did not disappear when transcripts are added to the model may be eQTLs that affect gene expression in a tissue other than liver. The Bayesian model is quite efficient for handling large sets of covariates, and extensions to include multiple sets of expressions will be feasible. We have not provided formal statistical tests in this model, but the Bayesian approach lends itself naturally to obtaining confidence intervals for (differences between) parameter estimates. The estimates of total explained variances from the Bayesian mixed model can also be obtained by a residual maximum likelihood (REML) approach. We verified this, and the Bayesian and REML estimates generally agree. However, using REML it is not feasible to utilize mixture priors to better discriminate between SNPs which contribute more or less variance, and to partition the variances at the subchromosome level, which is all straightforward in a Bayesian approach.
Our approach can easily allow up scaling to higherdensity arrays, even to wholegenome sequence data with the variance components analysis as it was for gene expression probes in this study.
Abbreviations
BW: Body Weight; FI: Feed Intake; FE: Feed Efficiency; SNPs: Single Nucleotide Polymorphisms; REML: Restricted maximum Likelihood; QTL: Quantitative trait loci; eQTL: Expression Quantitative trait loci.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AE developed the data analysis pipeline, performed statistical analyses, interpreted the results and wrote the manuscript. PS and LJ were involved in project design, statistical analyses, interpretation of results and manuscript editing. DP and MA prepared the data for the analysis. All authors have read and approved the final manuscript.
Acknowledgement
This research is supported in part by the Quantomics research project that has been cofinanced by the European commission within the 7th Framework Programme, contract No. 222664. This work is a part of PhD project scholarship from the Ministry of Science, Research and Technology of Iran.