Department of Molecular Biology and Genetics, Centre for Quantitative Genetics and Genomics, Aarhus University, Research Centre Foulum, DK-8830, Tjele, Denmark

Abstract

Background

Low cost genotyping of individuals using high density genomic markers were recently introduced as genomic selection in genetic improvement programs in dairy cattle. Most implementations of genomic selection only use marker information, in the models used for prediction of genetic merit. However, in other species it has been shown that only a fraction of the total genetic variance can be explained by markers. Using 5217 bulls in the Nordic Holstein population that were genotyped and had genetic evaluations based on progeny, we partitioned the total additive genetic variance into a genomic component explained by markers and a remaining component explained by familial relationships. The traits analyzed were production and fitness related traits in dairy cattle. Furthermore, we estimated the genomic variance that can be attributed to individual chromosomes and we illustrate methods that can predict the amount of additive genetic variance that can be explained by sets of markers with different density.

Results

The amount of additive genetic variance that can be explained by markers was estimated by an analysis of the matrix of genomic relationships. For the traits in the analysis, most of the additive genetic variance can be explained by 44 K informative SNP markers. The same amount of variance can be attributed to individual chromosomes but surprisingly the relation between chromosomal variance and chromosome length was weak. In models including both genomic (marker) and familial (pedigree) effects most (on average 77.2%) of total additive genetic variance was explained by genomic effects while the remaining was explained by familial relationships.

Conclusions

Most of the additive genetic variance for the traits in the Nordic Holstein population can be explained using 44 K informative SNP markers. By analyzing the genomic relationship matrix it is possible to predict the amount of additive genetic variance that can be explained by a reduced (or increased) set of markers. For the population analyzed the improvement of genomic prediction by increasing marker density beyond 44 K is limited.

Background

Low cost genotyping of individuals or families using genomic markers with constantly increasing density is currently being introduced in genetic improvement programs for agricultural animal and crop species. Use of dense genomic markers can increase the accuracy of predicting additive genetic merit especially for selection of candidates that do not yet have own or progeny records

The current industry standard in dairy cattle breeding is use of 50 K chips such as the Illumina Bovine SNP50 BeadChip

In human genetics very high density chips have been used in large scale studies

As mentioned above the total variance explained by previously identified causal loci is usually only a small fraction of total genetic variance in the populations investigated. In GWAS very stringent significance thresholds are necessary due to the very large number of statistical tests that are conducted when searching the whole genome using high density SNP marker panels. This will only allow loci with large effects to become statistically significant. However,

If not all genetic variance can be explained by markers then, in order to ensure optimal predictions, the remaining genetic variance should be accounted for in other ways. A simple approach is to combine predicted breeding values based on genomic information with traditional breeding values based on pedigree using selection index theory

Recently,

The purpose of this study was to evaluate the amount of additive genetic variation in production and fitness related traits in dairy cattle, to quantify the amount of additive genetic variation that can be explained using genomic markers with different density, and to quantify the amount of genomic variance that can be ascribed to individual chromosomes. The value of increasing the density of marker information for predicting genetic merit was also assessed using subsets of available markers.

Methods

Data

Data on deregressed proofs (**DRP**) were used as response variable in the present study, which were derived from the Nordic Holstein genetic evaluations official spring 2011 run. Production traits included milk production (**
milk
**), fat production (

The deregressed proofs were merged with marker records of individual bulls that were typed using Illumina Bovine SNP50 BeadChip (Illumina, San Diego, California, US). Only bulls with both genotype records and deregressed proof for at least one trait in the database were included in the analysis. A summary of the data used are shown in Table

**Abbreviation**

**Trait**

**n**

**mean**

**Sd**

**
Milk
**

Milk yield

4398

97.41

13.21

**
Fat
**

Fat yield

4398

96.99

12.23

**
Protein
**

Protein yield

4398

95.44

14,55

**
Fertility
**

Female fertility

4415

99.44

16.90

**
Health
**

Health index

4240

96.69

19.22

**
Mastitis
**

Mastitis resistance

4398

95.98

11.70

A total of 47152 SNP markers were available in the raw marker data, and after removing markers with minor allele frequency (MAF) < 0.01 and non-informative markers that were a simple linear function of another marker, a total of 44012 markers remained for analysis. A sire-dam pedigree of all bulls included in the analysis was constructed from the official pedigree file from NAV (

Models

The data were analyzed using the following models:

where **μ** is the general mean, **a**
_{
x
} is the vectors of additive genetic effects not accounted for by genetic markers (for model 1 this reduces to the classical individual animal model since no markers are included in the model), **g**
_{
x
} is the vector of the additive genetic effects accounted for by markers, **g**
_{
c
} is the vector of additive genetic effects accounted for by markers on a specific autosomal chromosome and **g**
_{
o
} is the vector of additive genetic effects accounted for by markers on all remaining chromosomes. 1 is a vector of ones and **Z**
_{
a
} and **Z**
_{
g
} are incidence matrices relating observations in **y** to additive genetic effects in **a** and in **g**
_{
x
}, **g**
_{
c
} or **g**
_{
o
}, respectively. Subscript x on vectors in different models indicates that definitions vary with model. Due to computational constraints specific effects of markers on single chromosomes was estimated for only one chromosome at a time, and each analysis included effects of markers on the specific chromosome and the combined effects of markers on all other chromosomes. Therefore model (4) was run one time for each trait and for each chromosome.

The parameter **μ** was considered as a fixed effect in all models and all other effects were assumed random normally distributed effects with variances**A** is the additive genetic relationship matrix computed from the full pedigree, and

In (5) M is an allele sharing matrix containing the number of copies of the second allele, and P is a matrix containing twice the population frequency of the second allele, i.e._{.} The division with **A** and **G** are comparable but will otherwise not influence the predictions from the model but only the scale of model parameters. Finally,**D** is a diagonal matrix containing weights proportional to the effective number of records in each **DRP**. The linear models (1)-(4) are based on different ways of describing the relationship among animals. For genomic relationships the methods used are detailed in **A**, based on pedigree information uses probability of identity by descent, whereas the genomic relationship matrix **G** based on marker information use probability of identity by state.

The number of markers included in computation of **G** in models (2) and (3) were 44012. The same markers were used in model (4) but markers were split into markers for one chromosome at a time and the markers on all other chromosomes pooled such that two genomic relationship of same size were computed. Model (4) was used in this way for all 29 autosomal chromosomes.

All analysis including estimation of variance components using Restricted Maximum Likelihood were conducted using the DMU software

Analysis of genomic relationships

Additive genetic differences between individuals are due to, generally unknown, causative genes. If the genotypes at all causative loci were known, the true genomic relationship matrix (**G**
_{
t
}) with regard to the trait of interest could be computed based on all the causative loci. In practice this is not possible and instead we compute **G** based on the marker data only and here we name this **G**
_{
m
}. The accuracy of **G**
_{
m
} to describe the genetic covariance among individuals sharing the same causative genes (**G**
_{
t
}) depends on the linkage disequilibrium between the markers and the causative genes. The accuracy of the genomic relationship matrix can be assessed using the procedure of **G**
_{
m
} as an estimate of **G**
_{
t.} The procedure includes the following steps:

1. Randomly sample 2 N SNPs across the genome and divide them in two groups of equal size.

2. Calculate **G**
_{
m
} using then SNPs in group one and calculate **G**
_{
t
} using the SNPs in group two, assuming that the SNPs in group two are the causal variants.

3. Use linear regression

for **G**
_{
t
} and **G**
_{
m
} is removed by subtracting 1.0 from the diagonal elements before estimating α and β.

This procedure is repeated for different N and the relation between

To obtain an unbiased estimate of **G**
_{
t
} we want E

The genomic covariance matrices are diagonally dominant with diagonal elements close to unity. If all diagonal elements in **G**
_{
m
}is unity the adjustment in (7) corresponds to adjusting estimated variance components by β. In other words the estimate of genetic variance obtained from model (2) is biased downwards with an amount proportional to β.

Results

Variance components

A summary of the estimated variance components from all models (1) to (4) are shown in Table **
milk
** to 0.97 for

**Model 1**

**Model 2**

**Model 3**

**Model 4**

1) Ratio of genetic variance in model over total variance.

**Trait**

**
Milk
**

138.24

0.92

134.18

0.88

119.49

20.87

0.93

115.3

**
Fat
**

113.10

0.91

109.33

0.87

93.61

22.36

0.94

90.5

**
Protein
**

143.16

0.97

132.99

0.88

106.67

34.26

0.96

103.0

**
Fertility
**

151.74

0.78

142.42

0.74

110.38

40.10

0.78

106.5

**
Health
**

141.57

0.65

136.70

0.63

101.84

42.60

0.66

98.4

**
Mastitis
**

99.19

0.82

97.30

0.79

81.77

23.67

0.85

79.0

For all traits analyzed, the genetic (total genomic) variance estimated in model (2) was lower than the additive genetic variance estimated using the classical individual animal model (1). The reason for this is that the genomic relationship matrix (**G**) do not trace all relationships due to sharing of causative alleles. However, the difference is small and generally the genomic information accounts for between 92% and 98% of the total additive genetic variance depending on the trait in question. These results are well in line with results of

In order to separate effects of polygenic/familial genetic relationships from genomic relationships model (3) were run. In this model the covariance among animals due to additive genetic relationships and due to genomic relationships (markers) both were included. Averaged across all traits total genetic variance estimated in model (3) was 101.7% of total genetic variance estimated in the animal model (1) with a range from 98.4% for **
protein
** to 106.3% for

Additive genetic variance due to individual chromosomes

Results from analysis using model (4), where genomic variance due to individual chromosomes were estimated, is also summarized in Table

Estimates of variance components due to individual chromosomes are shown in Figure **
milk
** and

**Estimates of genomic variance (y axis) due to individual chromosomes in relation to chromosome length (x axis) in Mb**

**Estimates of genomic variance (y axis) due to individual chromosomes in relation to chromosome length (x axis) in Mb.**

Amount of variance explained depending on number of SNPs

The procedure of

**Expected proportion of total additive genetic variance traced by increasing number of markers**

**Expected proportion of total additive genetic variance traced by increasing number of markers.**

The additive genetic variance explained by different number of markers were investigated using model (2) by varying the number of markers used to compute the genomic relationship matrix. Results averaged over all traits are shown in Table

**No of Markers**

**
β
**

**Estimated proportion of genetic variance explained by markers**

44012

0.960

0.936

22006

0.930

0.918

11003

0.909

0.880

Discussion and conclusions

The records analyzed in this paper were **DRP**s which were derived from the routine genetic evaluations of dairy cattle in Denmark, Sweden and Finland. Such **DRP**s are similar to progeny group means adjusted for non-genetic effects. Therefore, a very large proportion of phenotypic variance in analyzed DRP was due to additive genetic effects. For all traits analyzed more than 92% of all additive genetic variance could be explained using 44 K SNP markers. For models including both polygenic additive genetic (pedigree) effects and genomic (marker) effects, the latter accounted for between 71% and 85% of all additive genetic variance. Estimation of genomic variance of each individual chromosome showed that 96%–97% of all genomic (marker) variance could be attributed to individual chromosomes. Inclusion of polygenic familial effects in the models ensured that potential linkage disequilibrium across chromosomes was already taken into account. Most of the additive genetic variance in the population analyzed could be explained by genetic markers. The effect of reducing (increasing) the number of genetic markers on genomic prediction could be predicted by estimating the accuracy of the genomic relationship matrix.

Variance components

Most of the total phenotypic variance in the traits analyzed was additive genetic due to the use of deregressed proofs, which average out any dominance deviations across multiple daughters. Such proofs, of course, are functions of the procedures and definitions used in the recording system and methods used in the genetic evaluation. For production traits (**
milk
**,

The amount of additive genetic variance that can be explained by markers obviously cannot exceed the total additive genetic variance in the population. The classical individual animal model (1) yields unbiased estimates of population additive genetic variance. Comparing results from model (2) (genomic model) with results from model (1) (animal model) clearly illustrates this. For the traits milk and fat 97% of the additive genetic variance can be explained by markers. The estimates of additive genetic variance due to individual chromosomes clearly show that these traits are influenced by a major gene that contributes to the large proportion of additive genetic variance. Statistical models fitting effects of individual marker genes might be a better alternative in these cases.

For **
protein
** and

Additive genetic variance due to individual chromosomes

When summing over chromosomes the estimates of genomic variance due to individual chromosomes yielded total genomic variances that were similar to the total genomic variance in model (2) where this quantity was estimated directly. The method therefore seems able to yield estimates of genomic variance due to individual chromosomes. Surprisingly the estimates of variance due to individual chromosomes only showed a weak relationship with chromosome length (Figure

Amount of variance explained by genomic markers

Obviously genetic markers cannot explain more than all the total (additive) genetic variation present in the population. The analysis of the genomic relationship matrices revealed that a large proportion of the total additive genetic variance in the Nordic Holstein population was expected to be explained by a set of 44 K markers. Analysis of both production and fitness related traits showed that the amount of variance accounted for by markers in the Nordic Holstein population was close to the expectations from regression based analysis of the genomic relationship matrix. Estimates of genomic variance closely followed expectation when the number of markers included in computation of genomic relationship matrix was varied. The amount of additive genetic variance that can be explained by genomic markers depends on several factors: Number of markers on causative sites, markers in linkage disequilibrium with causative genes due to close “historical” linkage at population level, and finally linkage disequilibrium among markers and genes at family level, due to the family structure in the population. With 44 K markers spread over the genome the number of markers within causative sites probably is limited. The linkage disequilibrium between markers and causative genes is very dependent on effective population size

The analysis of genomic relationship matrices showed that a high proportion of additive genetic variance can be expected to be explained using 44 k genomic markers in this population of dairy cattle. This leaves limited room for further improvements of predictive ability of genomic models by including more markers. One of the current trends in use of genomic markers is to move from 50 K marker chips to 800 K marker chips or even complete sequencing of whole genomes for individual animals. Our results indicate that the advantages of this route may be limited. In fact including several orders of more markers than used in this study may turn out to be counterproductive. Extremely dense markers will include more markers on most causative sites and given knowledge of variation in the causative genes there is no extra information in the remaining markers. Alternative models that better can distinguish between causative genes and non informative markers might be of great value in future

Analysis of the structure of the genomic relationship matrices might be of considerable value in deciding on avenue for future development of typing strategies when using genomic markers. Such analysis also could give extra insight in the effects of population structure and population history on effectiveness of future selection programs using genomic selection in other breeds or in other species.

In summary we estimated the amount of additive genetic variance that can be explained using dense SNP marker panels. In the Holstein population analyzed, almost all the additive genetic variance could be explained using 44 K SNP markers. The amount of additive genetic variance that is expected to be explained by markers could be predicted from analysis of the genomic relationship matrix. Further increases in marker density will have limited effects on predictive accuracy unless better methods distinguishing between markers with real effects and markers with no effect are used. Results presented in this study can be used to determine the weight given to marker relationships and to familial relationships in one step prediction methods where these sources of relationships are combined and in two step methods where information based on genomic relationships must be combined with information form polygenic relationships.

Abbreviations

GS: Genomic selection; DRP: Deregressed proof;

Competing interests

The authors declare no competing interests.

Authors’ contributions

JUJ conceived the study and conducted all analysis, GS developed and implemented algorithms for computing genomic relationship matrices and PM maintained the DMU software package used in the statistical analysis. JUJ edited the manuscript based on extensive input from all authors who have read and approved the final manuscript.

Acknowledgments

The Danish Cattle Federation, FaBa Co-Op, Swedish Dairy Association and Nordic Genetic evaluation is thanked for providing the data. External funding for this study, including the extensive genotyping, were provided by the Danish Ministry for Food, Agriculture and Fisheries project “Genomic Selection – from function to efficient utilization in cattle breeding” Grant no 3405-10-0137 and the Milk Levy Fund, Viking Genetics and Nordic Genetic Evaluation. The first author was funded through a grant from Aarhus University.