Abstract
Background
Linkage disequilibrium (LD) between genes at linked or independent loci can occur at gametic and zygotic levels known asgametic LD and zygotic LD, respectively. Gametic LD is well known for its roles in finescale mapping of quantitative trait loci, genomic selection and evolutionary inference. The lesswell studied is the zygotic LD and its components that can be also estimated directly from the unphased SNPs.
Results
This study was set up to investigate the genomewide extent and patterns of zygotic LD and its components in a crossbred cattle population using the genomic data from the Illumina BovineSNP50 beadchip. The animal population arose from repeated crossbreeding of multiple breeds and selection for growth and cow reproduction. The study showed that similar genomic structures in gametic and zygotic LD were observed, with zygotic LD decaying faster than gametic LD over marker distance. The trigenic and quadrigenic disequilibria were generally two to threefold smaller than the usual digenic disequilibria (gametic or composite LD). There was less power of testing for these highorder genic disequilibria than for the digenic disequilibria. The power estimates decreased with the marker distance between markers though the decay trend is more obvious for the digenic disequilibria than for highorder disequilibria.
Conclusions
This study is the first major genomewide survey of all nonallelic associations between pairs of SNPs in a cattle population. Such analysis allows us to assess the relative importance of gametic LD vs. all other nonallelic genic LDs regardless of whether or not the population is in HWE. The observed predominance of digenic LD (gametic or composite LD) coupled with insignificant highorder trigenic and quadrigenic disequilibria supports the current intensive focus on the use of highdensity SNP markers for genomewide association studies and genomic selection activities in the cattle population.
Keywords:
Crossbred cattle; Gametic linkage disequilibrium; Genomewide multilocus structure; Zygotic linkage disequilibriumBackground
The recent advancement in molecular biology has enabled the rapid development of single nucleotide polymorphism (SNP) genotyping technology, thereby making the genotyping of cheap and abundant SNP markers possible in many livestock species [1]. Commercial SNP chips (e.g., Illumina BovineSNP50 beadchip for cattle) are now available for highthroughput genotyping of large numbers of SNPs in cattle, which allows animal geneticists to search for the quantitative trait nucleotides (QTNs) underlying variation in complex traits from genomewide association studies (GWAS) or to predict animal’s performance through genomic selection [2,3]. The success of GWAS and genomic selection depends crucially on the extent of linkage disequilibrium (LD) between SNPs and QTNs on chromosomes. It is shown (e.g., [3]) that strong gametic LD is found at long distances (> 1 cM) in domestic animals (e.g., cattle and dog) but not in human.
With the availability of highdensity SNPs, many studies have also conducted population genetic analysis of gametic LD in cattle and other animal species (e.g., [48]). However, the gametic LD cannot be calculated directly for unphased SNP markers because the gametic phase of animals that are heterozygous at two or more loci cannot be directly observed or specified. Thus, these studies have often followed the classic approach of Hill [9] to estimating gametic LD for unphased data, but such estimation was carried out under the assumption of HardyWeinberg equilibrium (HWE). For pure breed populations as in the above studies, the HWE assumption may be reasonable.
In this study, we investigate the genomewide extent and patterns of LD in a crossbred cattle population using the genomic data typed with the Illumina BovineSNP50 beadchip. The animals in this study are the progenies of three synthetic lines that were maintained separately at the University of Alberta Kinsella Research Ranch during 1960 to 1989 [10] and were subsequently pooled. It will hereafter be called as the Kinsella composite beef population. Since this population arose from repeated mixing of multiple breeds and selection for growth and cow reproduction, it may not be in HWE. Furthermore, in general diploid nonequilibrium populations such as the Kinsella composite beef population, Yang [11,12] suggested the assessment of LD at both gametic and zygotic levels. Higherorder trigenic and quadrigenic disequilibria are the components of the zygotic LD. To the best of our knowledge, only one study by Liu et al. [13] examined the trigenic and quadrigenic disequilibria in a canine population but it was based on a small number of dogs and a limited number of markers. Additionally, all LD measures (zygotic LD and its components) are known to depend on gene frequencies at different loci, but such dependence is rarely examined. For example, gametic LD is generally smaller if gene frequencies are near fixation than if gene frequencies are intermediate. Thus there is a need to examine the dependence of zygotic LD and its components on gene frequencies.
The Kinsella beef composite population has been the subject of numerous breeding and genomic studies (e.g., [1417]). However, the extent and patterns of gametic LD or other LD measures in this population has never been examined. It would be desirable to fully investigate the extent of LD and their distribution patterns of the whole genome for this population to provide the baseline information about multilocus structures for helping future genomic research. Therefore, the objectives of this study are (i) to determine if individual genic disequilibria are significant; (ii) to investigate the relationship between individual components of zygotic LD and physical distance; and (iii) to examine effects of changing gene frequencies on different LD measures.
Results
Zygotic LD and its components
The estimates of power of chisquare tests for zygotic LD and individual genic disequilibria were plotted against the marker distances of ≤ 50 Mb for all SNP pairs on 29 autosomes (Figure 1). This plot showed that the power decreased with the increasing marker distance. The pace of the power decay varied with individual genic disequilibria with the composite LD being the slowest but the two trigenic disequilibria being the fastest. After this initial sharp decline, the power values were stabilized around 0.4 for the composite LD, 0.2 for the two trigenic disequilibria and 0.05 for the quadrigenic disequilibria over the range of ≤ 50 Mb and beyond. Since the most useful LDs occurred for SNP pairs spanning at close vicinity (≤ 5 Mb) as shown in Figure 1, we calculated gametic, composite and zygotic LDs in terms of squared correlation at nine distance intervals within 5 Mb for all SNP pairs within individual distance intervals and the percentages of LD values being ≥0.25 (Table 1). There was a clear trend of LD decay with the marker distance. The trigenic and quadrigenic disequilibria were not shown because there was little trend of decay for these disequilibria even within 5 Mb.
Figure 1. The relationship between the estimated powers of chisquare tests for zygotic LD and its individual genic components and marker distance for SNP markers that were apart within 50 Mb on 29 autosomes in the Kinsella beef composite population. Note that the powers of the tests for the two trigenic components were very similar over the whole range of marker distance as indicated by the lines for the two disequilibria being overlapped to each other.
Table 1. Means and standard deviations (SD) of gametic LD (), composite LD () and zygotic LD () between syntenic SNP pairs for different ranges of distance (≤ 5 Mb) in the Kinsella composite beef population
Similar trend of LD decay was observed on individual chromosomes. However, given the limited numbers of SNP pairs over the nine intervals within the short distance of ≤ 5 Mb on individual chromosomes, the estimates of power of chisquare test statistic for gametic, composite and zygotic LDs are presented for two groups of SNP pairs, those with a distance of ≤ 50 Mb (Linked Group) and those with a distance of > 50 Mb (Unlinked Group) (Table 2). As in the usual convention, any marker pair spanning > 50 Mb would be considered freely recombined or unlinked, assuming the onetoone relationship between the genetic and physical distances (i.e., 1cM = 1 Mb). The physical distances for individual chromosomes are listed in Table A 1. Within each group, the chisquare tests for gametic LD and composite LD had similar power, but they both had higher values than the chisquare tests for zygotic LD. Genomewide, the chisquare tests for gametic and composite LD were ~15% more powerful in the Linked Group than in the Unlinked Group, but the chisquare tests for zygotic LD were only ~7% more powerful in the Linked Group than in the Unlinked Group. For the Linked Group, the range of mean estimates of power for gametic LD was from 50.6% on BTA 19 to 53.9% on BTA 25; the range of mean estimates of power for composite LD was from 50.6% on BTA 19 to 54.0% on BTA 25; and the range of mean estimates of power for zygotic LD was from 16.9% on BTA 24 to 19.7% on BTA 11. The corresponding ranges of mean estimates of power for the Unlinked Group were 0.348 (BTA12) – 0.401 (BTA 29) for gametic LD, 0.344 (BTA 12) – 0.392 (BTA 29) for composite LD and 0.105 (BTA 23) – 0.126 (BTA 11) for zygotic LD. It should be noted that chromosomes 25, 27 and 28 are shorter than 50 Mb, and chromosome 26 is 52 Mb long but the chisquare statistics for all marker pairs with a distance >50 Mb did not exceed 3.84.
Additional file 1. Table A1. Number of Single Nucleotide Polymorphism (SNP) markers (m) and chromosome length (mega base pairs, Mb) for 29 bovine autosomes (BTA 1 to BTA 29) in the Kinsella composite beef population. Mean, standard deviation (SD), minimum and maximum distances (in base pairs) between all pairs of adjacent markers are also presented. Table A2. The proportion of syntenic SNP pairs with outofbound estimates of generalized measures of squared correlation for trigenic and quadrigenic disequilibria in the Kinsella composite population. Table A3. Summary statistics on singlelocus heterozygosity, fixation index and HardyWeinberg disequilibrium (HWD) averaged over all SNP markers on 29 bovine autosomes (BTA 1 to BTA 29) in the Kinsella composite beef population. Appendix: Sampling variances of individual genic disequilibria in zygotic LD.
Format: PDF Size: 951KB Download file
This file can be viewed with: Adobe Acrobat Reader
Table 2. The power values^{*}of test statistics for gametic LD, composite LD and zygotic LD for marker pairs with a distance ≤50 Mb and >50 Mb on 29 autosomes in the Kinsella composite beef population
The estimates of power of the chisquare tests for trigenic and quadrigenic components of the zygotic LD are given for the Linked (≤ 50 Mb) and Unlinked (> 50 Mb) Groups (Table 3). For a given trigenic or quadrigenic disequilibrium, the power estimates of chisquare tests were similar regardless of the distance between marker pairs. The ranges of the power for each of the two trigenic disequilibria were 0.1030.145 in the Linked Group and 0.0810.184 in the Unlinked Group. Such ranges for the quadrigenic disequilibrium were 0.0720.092 in the Linked Group and 0.0520.066 in the Unlinked Group. It should be noted that the estimates of power were based on the number of marker pairs left after removing those with the generalized squared correlations () being outside the acceptable range of 0 to 1. We recorded separately the frequencies of the two outofbound situations (< 0 and > 1) in Table 1. First, for < 0, the sampling variances of estimated trigenic disequilibria were negative for about 69% of the genomewide syntenic marker pairs (36,131,636); in contrast, the sampling variances of estimated quadrigenic disequilibrium were positive for all the syntenic marker pairs. Second, for > 1, there was 0.02% of the genomewide syntenic marker pairs for both trigenic disequilibria, but only 0.001% for quadrigenic disequilibrium.
Table 3. The power estimates* of test statistics for the trigenic and quadrigenic disequilibria for marker pairs with a distance ≤50 Mb and >50 Mb on 29 autosomes in the Kinsella composite beef population
Presented in Table 4 are the generalized squared correlations () of gametic, composite, trigenic and quadrigenic disequilibria averaged over all syntenic marker pairs. The genomewide values for digenic disequilibria (gametic and composite LD) were about three times those of trigenic and quadrigenic disequilibria. There was variation in individual genic disequilibria among chromosomes. For example, the gametic LD averaged over all pairs on chromosomes ranged from 0.0082 on BTA 1 to 0.0126 on BTA 25 whereas the quadrigenic LD ranged from 0.0012 on BTA1 to 0.0016 on BTA 25. When looking at the percentages of SNP pairs with ≥ 0.2 (Table 5), the values for digenic disequilibria were also two to three times those for trigenic and quadrigenic disequilibria.
Table 4. The estimated digenic (gametic and composite), trigenic and quadrigenic disequilibria averaged over all syntenic SNP pairs on 29 autosomes in the Kinsella composite beef population
Table 5. The proportion of syntenic SNP pairs with the generalized measures of square correlation estimated for gametic, composite, trigenic and quadrigenic disequilibria that exceeded 0.2 on 29 autosomes in the Kinsella composite beef population
The mean values of LDs and power estimates for chisquare tests for gametic, composite, trigenic, quadrigenic and zygotic LD were summarized for all syntenic marker pairs (intrachromosome pairs) and all nonsyntenic pairs (interchromosome pairs) (Table 6). The mean values of individual genic disequilibria and test power were greater for syntenic marker pairs than for nonsyntenic marker pairs though such difference between syntenic vs. nonsyntenic pairs were more pronounced for the digenic disequilibria than for trigenic and quadrigenic disequilibria. In particular, the two trigenic disequilibria and their test power were almost the same for syntenic and nonsyntenic pairs. The magnitudes of LD values and test powers decreased with the number of alleles in the LD measures for both syntenic and nonsyntenic marker pairs with the order of digenic LD > trigenic LD > quadrigenic LD. It should be noted that despite the same number of possible intra or interchromosome marker pairs for all individual genic disequilibria, only those pairs whose generalized squared correlations fell within the acceptable range of 0 ≤ ≤ 1 were retained for calculating the mean LD values and estimating the test powers.
Table 6. Means and ranges of generalized measures of squared correlations for digenic (gametic and composite), trigenic, quadrigenic and zygotic disequilibria averaged over all syntenic (intrachromosome) SNP pairs and over all nonsyntenic (interchromosome) SNP pairs across the composite beef genome
Dependence of LD on gene frequency
The power of chisquare tests for individual genic disequilibria was estimated for syntenic marker pairs belonging to nine classes of minor allele frequency (MAF) with three MAF intervals (< 0.1, 0.10.3 and 0.30.5) at each of the two loci on each of 29 chromosomes as well as for nonsyntenic marker pairs on two different chromosomes. The power estimates for nonsyntenic marker pairs were calculated on all 406 [29(291)/2] chromosome pairs. In Table 7 are given the mean, minimum, and maximum values of power estimates for the syntenic marker pairs over 29 chromosomes and for nonsyntenic pairs over 406 chromosome pairs. In all nine MAF classes, the digenic disequilibria (composite LD) and zygotic LD were greater for intrachromosome pairs than for interchromosome pairs but trigenic and quadrigenic disequilibria were similar for both intra and interchromosome pairs.
Table 7. The mean, minimum and maximum of power estimates^{*}of the test statistics for digenic, trigenic and quadrigenic disequilibria obtained for nine combinations of minor allele frequency (MAF) categories at each of the two loci (MAF_{A}and MAF_{B}) for all syntenic (intrachromosome) SNP pairs over 29 chromosomes and for all nonsyntenic (interchromosome) SNP pairs over 406 [29(291)/2] chromosome pairs in the Kinsella composite beef population
The power estimates of chisquare tests for digenic, trigenic, and quadrigenic disequilibria increased with the increasing MAF at both loci, whereas those for zygotic LD decreased with the increasing MAF (Table 7). For example, for intrachromosome pairs, the estimates of power for composite LD increased from 0.389 when MAF at both loci were less than 0.1 to 0.512 when MAF at both loci were above 0.3; in contrast, the power estimates for zygotic LD decreased from around 0.3 when MAF at both loci were below 0.30 to about 0.1 when MAF at either locus was above 0.3. In all nine MAF classes, the power estimates for trigenic and quadrigenic disequilibria were much smaller than those for digenic disequilibria. In particular, the power estimates for trigenic and quadrigenic disequilibria in most MAF classes were below 0.05, confirming the hypothesis of zero trigenic and quadrigenic disequilibria. Patterns of changes in the estimates of power for individual genic disequilibria with gene frequency were similar for intra and interchromosome pairs but the power estimates were generally smaller for interchromosome pairs than for intrachromosome pairs throughout all MAF classes.
Discussion
This study represents the first major genomewide survey of highorder genic disequilibria between three or four genes at pairs of loci in a farmed animal species. We chose the Kinsella composite beef population for such a survey because continued crossbreeding and selection for growth and cow reproduction would have made the population to stay in a HWD condition, thereby providing an excellent opportunity for uncovering the highorder genic disequilibria. The survey showed that the trigenic and quadrigenic disequilibria were generally two to three times smaller than the usual digenic disequilibria (gametic or composite LD) (Table 5). Correspondingly, there was less power of testing for these highorder genic disequilibria than for the digenic disequilibria (Tables 2 and 3). The magnitude and power of LD decreased with the distance between markers in close proximity (≤ 5 Mb) though the decay was much more obvious for the digenic disequilibria than for highorder disequilibria (Table 1; Figure 1).
The power estimates for zygotic LD and its components (Tables 2 and 3) were much higher than the expected value of 0.05 for unlinked marker pairs with a distance of >50 Mb. This expectation would be true if physical linkage between SNPs is the only cause of LD. It is well known that LD may arise from many evolutionary and demographic factors such as selection, random drift, inbreeding, mutation and population admixture and such LD can occur between linked as well as independent loci [3,11]. Goddard and Hayes [3] cited the small effective population size (N_{e}) as a major cause of LD in cattle populations. The effective population size was large (>50,000) before domestication, but was drastically declined to 1,0002,000 after domestication. In many cattle breeds, this number was further declined to approximately 100 after recent breed formation. This sharp decline in N_{e} causes some LD to exist at long distances. The strong effects of selection and population admixture expected in our crossbred cattle population would also contribute further to greater LD. It should be recognized that the effect of selection on LD involves markers or genes that are localized at certain parts of the genome and thus selectioninduced LD would have no or little relation with physical distance.
Most studies have focused on gametic LD for individual purebreed cattle populations. Our estimates of gametic LD for the Kinsella composite beef population are similar to those reported for Holstein and other purebreed cattle in the recent literature [48]. In fact our estimates are generally slightly lower from different comparisons. Khatkar et al. [4] observed the genomewide average of 0.016, in comparison to our estimate of 0.0105. Khatkar et al. [4] used a smaller set of markers (15,036 SNPs) covering all 29 autosomes but a larger number of animals (1,546). Sargolzaei et al. [5] presented values between adjacent markers with a genomewide mean of 0.31, comparing to our mean of 0.195 (detailed data not shown). However, an updated study by the same group [6] with more markers (38,590 SNPs) had a genomewide average of 0.20 which is very close to our value. VillaAngulo et al. [7] calculated values using 101 targeted highdensity regions (nonoverlapping genomic windows of 100 kb containing 10 or more markers and a maximum gap between markers of 20 kb) on QTLrich chromosomes 6, 14 and 25 to calculate values for 19 beef and dairy breeds; the mean values of ranged from 0.204 for Nelore to 0.397 for Hereford with an overall average of 0.294. Qanbari et al. [8] obtained a genomewide average of 0.30 for SNP pairs with a distance of < 25 kb for German Holstein, which is very comparable to our estimate of 0.285 for the same distance range (<25 kb) (data not shown). Since our and estimates are very similar, the above comparisons of gametic LD estimates with other studies would be applicable to composite LD estimates as well. However, no comparison on zygotic LD () was possible because other studies have not provided the estimates of zygotic LD. If such estimates of zygotic LD were available, they would have had a similar pattern as in our study. Overall, we conclude that LD useful for genomic selection most likely occurs between those markers that are adjacent or tightly linked.
It is somewhat surprising to see similar values of gametic LD in our crossbred population and in those purebred populations as described above. In the purebreed studies, the focus has been on gametic LD that is estimable often under the HWE assumption [9]. We suggest that the similarity may be due to lack of strong HWD in our beef population. In our study, significant HWD was found at a total of 4,024 (8.2%) SNPs over the genome (Table A 1). This is only slightly higher than 5%, the probability of significant HWD by chance alone. Higher HWD would be expected in our beef population because of continued crossbreeding of animals with diverse breed (genetic) backgrounds every generation. While the data quality control measures such as the removal of ~17% of SNP loci especially with MAF <2.0% and HWD chisquare values of >600 might have contributed to the reduced HWD, it could also be that after several generations of crossbreeding, certain level of genetic homogeneity might have been achieved. In other words, genetic integrity of distinct breeds is no longer clear. Thus, while our gametic LD was estimated without the HWE assumption, the Kinsella composite beef population may be closer to the HWE condition than what we have thought in the past.
To the best of our knowledge, the only other survey of highorder genic disequilibria was made by Liu et al. [13]. Using the essentially same statistical analysis as in our study, Liu et al. [13] observed that the genomewide percentages of significant composite LD, two trigenic disequilibria and quadrigenic disequilibrium were 61%, 23%, 19%, and 22%, respectively. These power estimates are clearly higher than those observed in our study (Tables 2 and 3). Several factors may contribute to the difference in genic disequilibria and their test power between the two studies. First, In order to fit the simple biallelic model for detecting individual genic disequilibria, Liu et al. [13] used the most frequent allele and a new synthetic allele consisting of all other alleles for the microsatellite markers with more than two alleles observed in [18]. The use of the synthetic allele certainly reduces the likelihood of detecting low MAF. In contrast, our less stringent threshold of MAF ≥ 2% at diallelic SNPs would allow for the presence of low MAF. Second, 148 dogs sampled from the pedigree by Liu et al. [13] were closely inbred relatives and the pedigree was established from a limited number of founders (seven greyhounds and six Labrador retrievers). Thus, strong founder effect coupled with high level of inbreeding would have caused large LD in the dog population whereas our composite beef population with a large number of founders and repeated crossbreeding should have only had a limited founder effect or inbreeding effect on LD. Third, Liu et al. [13] observed a much greater chromosometochromosome variation in individual genic disequilibria than we did in our study. Such interchromosome variation is due in part to the difference in marker density between the two studies. The limited number of markers sampled from individual chromosomes across the canine genome would make the study by Liu et al. [13] more likely to suffer from biased sampling of the genome. Fourth, with a small sample size (148 dogs) in Liu et al. [13], the tests for individual genic disequilibria must be based on a twoway contingency table with many empty cells. Due to unpredictable distributions of these empty cells in the twoway table, the genic disequilibria might have been under or overemphasized, thereby resulting in a much wider range of the power values.
Our study showed that digenic disequilibria (gametic or composite LD) were much more important than trigenic and quadrigenic disequilibria in terms of magnitudes and test power. This observation coupled with the similarity of the two digenic disequilibria (gametic and composite LD) is consistent with the low level of HWD in our beef population (only 8.2% of SNPs had a significant HWD as shown in Table A3). Under HWE, all nongametic associations would be zero and thus composite LD would equal to gametic LD and three and fourgene disequilibrium would be negligible [19,20]. Moreover, in developing the theory underlying the statistical analysis used here, Weir and Cockerham [19] assumed the random union of gametes that would have taken from an infinite large founder population so that all initial disequilibrium would be digenic (gametic LD). It is evident from Table 6.4 of Weir and Cockerham [19] that individual genic disequilibria in subsequent inbred generations decay by a rate gauged in terms of twolocus descent measures. Further numerical results (Table 6.5 of Weir and Cockerham [19]) showed that relative to the digenic disequilibria, the trigenic and quadrigenic disequilibria would be always small in a given inbred generation and they could take a long time to reach the equilibrium values. Thus, it is expected that the digenic LDs overpower the highorder disequilibria.
Our study is the first empirical evaluation of the dependence of LD on allele frequency. The power estimates of chisquare tests for digenic, trigenic and quadrigenic disequilibria increased with the increasing gene frequency at both loci, but those for zygotic LD decreased with the increasing gene frequency. Weir and Cockerham [19] used extensive computer simulations to show the similar trend for individual genic disequilibria but these authors did not consider zygotic LD. The simulation results also confirmed that the allowance for digenic disequilibria not only led to nonzero gametic or composite LD as expected, but also to nonzero quadrigenic disequilibrium as implied by the power of the chisquare test being more than 5%. It is difficult to explain exactly the trend for zygotic LD. Since zygotic LD is a complex function of individual genic disequilibria weighted by gene frequency [11,19], the combinations of gene frequencies and individual disequilibria are too numerous to identify the exact combinations of genic disequilibria at different gene frequencies for the observed trend of zygotic LD. This will certainly be an area for more research in the future.
In our study, we proposed an ad hoc measure of nonallelic associations () based on chisquare statistics for individual genic disequilibria. It is equal to the squared correlation (r^{2}) only when the chisquare statistics are calculated from a 2 × 2 contingency table [21]. We used this ad hoc measure for estimating the degree of association. More importantly the measure was also used for detecting outlier chisquare statistics by setting the range of the observed values as 0 ≤ ≤ 1 for individual genic disequilibria. Before the removal of outliers, we observed extremely large standard deviations of chisquare values over marker pairs in many frequency classes for the trigenic and quadrigenic disequilibria. Weir and Cockerham [19] noted in their simulation study a similar problem of unusually large standard deviations of chisquare values but offered no explanation about it. After the removal of outliers, we noted that all standard deviations of chisquare values fell within the normal range. Of course, we used a pragmatic and conservative approach to trim off the outlier chisquare statistics. When the value is calculated from a general I × J contingency table, its allowable range is given by 0 ≤ ≤ min[(I1), (J1)] [16]. It is already known (e.g., [12], [20]) that the gametic and zygotic LD are calculated from a 2 × 2 contingency table and thus their acceptable range should be 0 ≤ ≤ 1. However, it remains unclear of the size of the contingency tables for other genic disequilibria. Thus, more research is needed to determine appropriate contingency tables for digenic, trigenic, and quadrigenic disequilibria.
Our study has practical implications. The main result from our study is the predominance of digenic disequilibria coupled with insignificant highorder disequilibria. This result supports the current intensive effort of using gametic LD for GWAS and genomic selection in cattle and other animal species. It has been demonstrated (e.g., [3]) that gametic LD in domestic animals may occur at a long distance (> 1 cM) due to the rapid and sudden decline of population size (bottleneck effect) during domestication and at breed formation, in comparison to the situation in human where there is no gametic LD at long distances. However, it has also been suspected that such LD may be in part due to false positive associations resulted from a mixture of multiple breeds or other causes. If the breeds can be identified through pedigree information, Goddard and Hayes [3] suggested the use of breeds as a covariate in the statistical model to minimize such false positive associations. However, animals in our composite beef population had been produced through multisire breeding group natural service on pasture over 45 generations. Clearly, their breed identity is no longer available despite the inference of parent identity based on the BovineSNP50 Beadchip. So the suggestion by Goddard and Hayes [3] may not be feasible for this composite beef population. Nevertheless, since lack of highorder trigenic and quadrigenic disequilibria coupled with the evidence of insignificant HWD would indicate little nongametic multilocus correlation [19], our analysis may provide a quick, practical means of assessing the importance of the false positive associations in a multibreed population.
Conclusions
This study is the first major genomewide survey of highorder genic disequilibria between three or four genes at pairs of 50K SNP markers in a farmed animal species. The survey showed that the trigenic and quadrigenic disequilibria were generally insignificant and two to threefold smaller than the usual digenic disequilibria (gametic or composite LD). The powers of tests for these highorder genic disequilibria dropped rapidly even at a very short distance between SNPs. These results support the current intensive focus on the use of gametic LD for GWAS and genomic selection activities in the Kinsella composite beef population.
Methods
Description of animals and genotyping data
The Kinsella beef composite population has been described in our recent studies on QTL mapping, candidate gene identification and genomic selection [e.g., 1417]. Here we recapitulate the essential details of this population. It was produced by crossing between Angus, Charolais, or University of Alberta hybrid bulls and a hybrid dam line. The hybrid dam line was obtained by crossing among three composite cattle lines, namely beef synthetic 1 (SY1), beef synthetic 2 (SY2) and dairy × beef synthetic (SD) for more than 10 years after 30 years (19601990) singlesire crossbreeding. SY1 was composed of approximately 33% each of Angus and Charolais, 20% Galloway, 5% Brown Swiss, and small amounts of other breeds. SY2 was composed of approximately 60% Hereford and 40% other beef breeds mainly including Augus, Charolais and Galloway. SD was composed of approximately 60% dairy cattle (Holstein, Brown Swiss, or Simmental) and approximately 40% of other breeds, mainly including Angus and Charolais [22]. The blood samples of 1023 beef steers were collected and genotyped using the Illumina Infinium genotyping system with the BovineSNP50 Beadchip. All steers were produced from multisire breeding group natural service on pasture. The sire genotype of each calf was determined in a parentage test by using the BovineSNP50 Beadchip, but the parentage of about 100 animals remained unknown because these animals were either sires at initial crossing or sires without progeny. There were 116 sire families with varying family sizes ranging from one to 54 progeny per family. It is estimated that there have been about 45 generations since initial crossing.
A total of 51,828 SNP markers were originally obtained in the genotyped data. These markers were distributed across 29 autosomes and one sex chromosome in the entire bovine genome. For our analyses, we only used 43,124 SNPs after removing those markers (i) with monomorphism, (ii) with unknown genomic position and (iii) on the sex chromosome, (iv) with minor allele frequency (MAF) of ≤ 2% [1], and (v) with a Chisquare value >600 for the HWD test.
Components of zygotic linkage disequilibrium
For two loci, each with two alleles, A and a at locus A and B and b at locus B, there are nine possible genotypes (ten if the coupling and repulsion double heterozygotes are distinguishable). Following Yang [23], we wrote frequencies of these genotypes as, , which result from union of gametes uy and vz with uv = A or a, and yz = B or b. The genotypic frequencies at individual loci are the marginal totals of the appropriate twolocus genotypic frequencies. For example, the frequency of genotype AA is,
With the genotypic frequencies at locus A, the frequency of allele A is, and that of allele a is p_{a} = 1 p_{A}.
Departures from HWE at locus A are, and those at locus B are, .
In a random mating population, HWD disappears (i.e., D_{A} = D_{B} = 0). In a nonrandom mating population, nonzero HWD is measured by the fixation index which can be either positive when there is inbreeding or negative when inbreeding is avoided. For example, the HWD at locus A can be written as D_{A} = f_{A}p_{A}p_{a}, where , is the fixation index at locus A, with 1 ≤ f_{A} ≤ +1.
It was established [11,12] that the total zygotic LD between loci A and B could be defined in terms of zygotic LDs for individual genotypes with each zygotic LD being a complex function of digenic, trigenic and quadrigenic disequilibria. For example, the zygotic LD for double homozygote AABB () would simply be the deviation of the frequency of double homozygote from the product of the corresponding homozygotes at loci A and B
where each genic disequilibrium (D) is the deviation of a frequency from that based on random association of genes and accounting for any lower order disequilibria. The usual gametic LD () would be the deviation of frequency of gamete AB from the product of frequencies of allele A at locus A and allele B at locus B with .
When zygotes arise from random union of gametes as often assumed in most LD studies, all nongametic disequilibria including HWD would disappear (e.g., ). In this case, the zygotic LD for genotype AABB () would reduce to, .
This formula is the basis for possible use of double homozygosity to measure gametic LD in a random mating population [24,25].
Since the two types of double heterozygote (AB/ab and Ab/aB) in our unphased SNP data could not be distinguished, we used the composite LD (Δ_{AB}) and a composite quadrigenic component (Δ_{AABB}) in place of gametic and quadrigenic disequilibria. Thus, the zygotic LD for genotype AABB () in equation (1) was rewritten as
where
and
It should be noted from equations (1) and (2) that the two trigenic disequilibria in (2) were rewritten without superscripts for notational simplicity.
Maximum likelihood estimation
Following Weir and Cockerham [19] and Weir [20], we used the procedure of statistical inference based on the assumption of multinomial sampling of individual diploids from a population. The observed frequencies and disequilibria with tildes (~) were maximum likelihood (ML) estimates of corresponding parametric values. Since the additive models described earlier allowed for defining the same number of parameters as there would be degrees of freedom, the ML estimates were simply replacing all parametric values of frequencies and disequilibria with corresponding observed values. For example, the ML estimates of composite LD were simply given by,
However, the ML estimates might be biased because they would involve quadratic terms of multinomial variables. For example, the expectation of the squared gene frequency of allele A over replicate samples of size n would be,
where D_{A} is the HWD measure at locus A[20]. With the sufficiently large sample (n = 1023 animals) in our data set, we invoked largesample theory for statistical inference about genic disequilibria. Thus, we ignored the possible biases of order 1/n.
Hypothesis testing and power
With a ML estimate ( or ) of a given genic disequilibrium D or Δ, along with its sampling variance, [Var() or Var()] being given in Appendix in the Additional file 1 section, we constructed a test statistic, or
to test the hypothesis of zero disequilibrium (i.e., H_{0}: D = 0 or H_{0}: Δ = 0). Assuming the asymptotic normality of the ML estimate, X^{2} under the hypothesis of zero disequilibrium would be distributed as chisquare with one degree of freedom.
As usual, each chisquare test would commit two kinds of error: a true hypothesis may be rejected (Type I error) or a false hypothesis may not be rejected (Type II error). The probability of Type I error is measured by the significance level whereas the probability of Type II error is often related to the power of the test. Generally, as the power is the probability of rejecting a false hypothesis, it equals to one minus the probability of Type II error. In the present study, however, we adopted a different use of the power as proposed by Weir and Cockerham ([19], p. 100) and Weir ([20], p. 110): we calculated the power when the hypothesis being tested is true. In this particular case, a power value equals to the significance level.
Chisquare statistic and correlation
In the past, the squared correlation (r^{2}) has been routinely used as a measure of gametic LD (), composite LD (), or zygotic LD (). We used a chisquare statistic (i = GLDCLD or ZLG) to test for the significance of the LD estimate. It is known from the literature [21] that the relationship of would hold exactly only for a 2 × 2 contingency table. This was the case for GLD and ZLD, but not for CLD. When dropping out three and fourgene disequilibria in testing for zero composite LD (), we would obtain an approximate chisquare statistic, .
Which would be equal to as given in Weir [26]. Similar approximations or restrictions would be needed if the relationship of were desired for three and fourgene disequilibria. Thus, to avoid such approximations or restrictions, we used a generalized measure of square correlation [21] in place of r^{2} as a standardized measure of genic disequilibria. As pointed out above, the relationship of would hold only for a 2 × 2 contingency table.
Data analysis
All data analysis and required computation were done using SAS 9.3 [27]. The calculation of gametic and composite LD was carried out using PROC ALLELE of SAS/Genetics 9.3. In this calculation, the SNP marker data was read in as columns of genotypes using the GENOCOL and DELIMITER= options in the PROC ALLELE statement; gametic LD was calculated if the HAPLO= EST option in the PROC ALLELE statement was invoked, whereas composite LD was calculated if the HAPLO= NONEHWD option was specified. Zygotic LD and its components as well as hypothesis testing were calculated using SAS Macro language and SAS/IML procedure.
Abbreviations
bp, base pairs; cM, Centimorgan; CLD, Composite linkage disequilibrium; GLD, Gametic linkage disequilibrium; GWAS, Genomewide association studies; HWD, HardyWeinberg disequilibrium; HWE, HardyWeinberg equilibrium; kp, Kilo base pairs; LD, Linkage disequilibrium; MAF, Minor allele frequency; Mb, Mega base pairs; QTN, Quantitative trait nucleotide; SNP, Single nucleotide polymorphism; ZLD, Zygotic linkage disequilibrium.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
This research represents a portion of a thesis submitted by QJ to the University of Alberta in partial fulfilment of the requirements for the Master of Science degree. QJ, ZW and RCY designed the study. QJ and RCY analyzed the data and wrote the manuscript. ZW edited the manuscript. SSM provided intellectual inputs to the study. All authors read and approved the final manuscript.
Acknowledgements
We thank two anonymous reviewers for valuable comments. This research has been supported by grants from the Alberta Livestock and Meat Agency (ALMA 2008 F175R, ALMA 2007 F142R), Natural Science and Engineering Research Council of Canada (NSERC RGPIN34131207) to ZW, and the Natural Sciences and Engineering Research Council of Canada (NSERC OGP0183983) to RCY.
References

Hayes BJ, Chamberlain AJ, McPartlan H, MacLeod I, Sethuraman L, Goddard ME: Accuracy of markerassisted selection with single markers and marker haplotypes in cattle.

Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps.
Genetics 2001, 157:18191829. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Goddard ME, Hayes BJ: Mapping genes for complex traits in domestic animals and their use in breeding programs.
Nat Rev Genet 2009, 10:381391. PubMed Abstract  Publisher Full Text

Khatkar MS, Nicholas FW, Collins AR, Zenger KR, Cavanagh JA, Barris W, Schnabel RD, Taylor JF, Raadsma HW: Extent of genomewide linkage disequilibrium in Australian HolsteinFriesian cattle based on a highdensity SNP panel.
BMC Genomics 2008, 9:187. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sargolzaei M, Schenkel FS, Jansen GB, Schaeffer LR: Extent of linkage disequilibrium in Holstein cattle in North America.
J Dairy Sci 2008, 91:21062117. PubMed Abstract  Publisher Full Text

Bohmanova J, Sargolzaei M, Schenkel FS: Characteristics of linkage disequilibrium in North American Holsteins.
BMC Genomics 2010, 11:421. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

VillaAngulo R, Matukumalli LK, Gill CA, Choi J, Van Tassell CP, Grefenstette JJ: Highresolution haplotype block structure in the cattle genome.
BMC Genet 2009, 10:19. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Qanbari S, Pimentel ECG, Tetens J, Thaller G, Lichtner P, Sharifi AR, Simianer H: The pattern of linkage disequilibrium in German Holstein cattle.
Anim Genet 2010, 41:346356. PubMed Abstract  Publisher Full Text

Hill WG: Estimation of linkage disequilibrium in randomly mating populations.
Heredity 1974, 33:229239. PubMed Abstract  Publisher Full Text

Berg RT, Makarechian M, Arthur PF: The University of Alberta beef breeding project after 30 years—A review.

Yang RC: Zygotic associations and multilocus statistics in a nonequilibrium diploid population.
Genetics 2000, 155:14491458. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yang RC: Analysis of multilocus zygotic associations.
Genetics 2002, 161:435445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu T, Todhunter RJ, Lu Q, Schoettinger L, Li H, Littell RC, BurtonWurster N, Acland GM, Lust G, Wu R: Modelling extent and distribution of zygotic disequilibrium: implications for a multigenerational canine pedigree.
Genetics 2006, 174:439453. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang Z, Goonewardene LA, Yang RC, Price MA, Makarechian M, Knapp J, Okine EK, Berg TR: Estimation of genetic parameters and trends in preweaning traits of beef lines subject to phenotypic selection.

Durunna ON, Mujibi FDN, Goonewardene L, Okine EK, Basarab JA, Wang Z, Moore SS: Feed efficiency differences and reranking in beef steers fed grower and finisher diets.
J Anim Sci 2011, 89:158167. PubMed Abstract  Publisher Full Text

Mujibi FDN, Nkrumah JD, Durunna ON, Stothard P, Mah J, Wang Z, Basarab J, Plastow G, Crews DH, Moore SS: Accuracy of genomic breeding values for residual feed intake in crossbred beef cattle.
J Anim Sci 2011, 89:33533361. PubMed Abstract  Publisher Full Text

Nalaila SM, Stothard P, Moore SS, Li C, Wang Z: Whole genome QTL fine mapping for ultrasound and carcass merit traits in beef cattle using Bayesian shrinkage method.
J Anim Breed Genet 2012, 129:107119. PubMed Abstract  Publisher Full Text

Todhunter RJ, Bliss SP, Casella G, Wu R, Lust G, BurtonWurster NI, Williams AJ, Gilbert RO, Acland GM: Genetic structure of susceptibility traits for hip dysplasia and microsatellite informativeness of an outcrossed canine pedigree.
J Hered 2003, 94:3948. PubMed Abstract  Publisher Full Text

Weir BS, Cockerham CC: Complete characterization of disequilibrium at two loci. In Mathematical Evolutionary Theory. Edited by Feldman MW. Princeton University, Princeton New Jersey; 1989:86110.

Weir BS: Genetic Data Analysis II. Sinauer Associates, MA Sunderland; 1996.

Bishop Y, Fienberg SE, Holland PW: Discrete Multivariate Analysis: Theory and Practice. MIT, Cambridge, MA; 1975.

Goonewardene LA, Wang Z, Price MA, Yang RC, Berg RT, Makarechian M: Effect of udder type and calving assistance on weaning traits of beef and dairy × beef calves.
Livest Prod Sci 2003, 81:4756. Publisher Full Text

Yang RC: Epistasis of quantitative trait loci under different gene action models.
Genetics 2004, 167:14931505. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sabatti C, Risch N: Homozygosity and linkage disequilibrium.
Genetics 2002, 160:17071719. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yang RC: Gametic and zygotic associations.
Genetics 2003, 165:447450. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Weir BS: Inferences about linkage disequilibrium.
Biometrics 1979, 35:235254. PubMed Abstract  Publisher Full Text

SAS Institute Inc: SAS User’s Guide Version 9.3. Cary. SAS Institute Inc, NC, USA; 2010. PubMed Abstract  Publisher Full Text  PubMed Central Full Text