Effectiveness of genomic selection and fine mapping is determined by the level of linkage disequilibrium (LD) across the genome. Knowledge of the range of genome-wide LD, defined as a non-random association of alleles at different loci, can provide an insight into the optimal density and location of single-nucleotide polymorphisms (SNPs) for genome-wide association studies and can be a keystone for interpretation of results from QTL mapping.
Linkage disequilibrium was measured by |D'| and r2 between 38,590 SNPs (spaced across 29 bovine autosomes and the X chromosome) using genotypes of 887 Holstein bulls. The average level of |D'| and r2 for markers 40-60 kb apart was 0.72 and 0.20, respectively in Holstein cattle. However, a high degree of heterogeneity of LD was observed across the genome. The sample size and minor allele frequency had an effect on |D'| estimates, however, r2 was not noticeably affected by these two factors. Syntenic LD was shown to be useful for verifying the physical location of SNPs. No differences in the extent of LD and decline of LD with distance were found between the intragenic and intergenic regions.
A minimal sample size of 444 and 55 animals is required for an accurate estimation of LD by |D'| and r2, respectively. The use of only maternally inherited haplotypes is recommended for analyses of LD in populations consisting of large paternal half-sib families. Large heterogeneity in the pattern and the extent of LD in Holstein cattle was observed on both autosomes and the X chromosome. The extent of LD was higher on the X chromosome compared to the autosomes.
The discovery of new SNPs and the continuous declining cost of genotyping the bovine genome are making it possible to investigate genome-wide linkage disequilibrium (LD) in detail. Linkage disequilibrium refers to a non-random association in the occurrence of alleles at two loci, which results in a higher frequency of certain haplotypes at the loci than expected by chance.
The knowledge of the extent and the pattern of LD throughout the bovine genome plays an important role in gene mapping and genome-wide association studies and is a fundamental tool for: exploring the degree of diversity among breeds of cattle, inferring the distribution of crossing-over, and identifying regions of genome that have been subject to selective sweep. A high resolution LD map can also provide beneficial information for designing SNP panels for genome-wide selection.
The two most commonly used measures of LD for bi-allelic markers are r2 and |D'|. The range of both measures is between 0 and 1. |D'| < 1 indicates that historical recombination has occurred between the loci and |D'| = 1 indicates absence of recombination between the two loci since the occurrence of one of the polymorphism. Therefore |D'| is rather an indicator of missing haplotypes than a reliable measure of LD . Moreover |D'| tends to be strongly inflated in small samples and in the presence of a rare allele . r2 represents the correlations between the two loci and r2 = 1 when only two haplotypes are present, which is usually a consequence of population bottlenecks or genetic drift . r2 is preferred for association studies, because there is a simple inverse relationship between r2 and the sample size that is required to detect the association between a QTL and the SNP .
Diploid cells of Bos taurus contain 29 homologous autosomal pairs and one pair of sex chromosomes. All autosomes are acrocentric, however, both gonosomes (X and Y) are submetacentric . The X chromosome (Chr X) is the second largest chromosome of the bovine karyotype  and is composed of two distinct regions: the pseudo-autosomal region (PAR), which is homologous to the Y chromosome and the X-specific region. In females, the cross-over can occur at any segment of the Chr X. On the other hand, in males, the cross-over is limited only to the PAR region. There is also variability in recombination rates on autosomes , which, apart from other factors, leads to considerable diversity of the LD pattern in different genomic regions. Generally, LD is higher on the Chr X than on the autosomes . The Y chromosome is one of the smallest chromosomes , its Y-specific part is gene poor and consists of repetitive sequences, which makes a physical mapping of this region difficult and consequently there is no Y chromosome map available so far for bovine .
Previous studies on LD in dairy cattle were based on microsatellites and reported high levels of LD for long distances [10-12]. Recent studies used SNPs for evaluating the extent of LD and revealed that strong LD extends for shorter distances than previously reported. Marques et al.  reported (analyzing 505 SNPs on chromosome 14 in Holsteins) moderate levels of LD (r2≥0.2) extending up to 100 kb, similar results were presented by McKay et al. (2,670 SNPs) . In the study by Khatkar et al. (9,195 SNPs) , r2≥0.2 was observed in Australian Holsteins between SNPs less than 60 kb apart. Sargolzaei et al. reported useful LD (r2≥0.2) between markers with inter-marker distance ≤ 100 kb in North American Holsteins (5,564 SNPs). There is variation in the published extent of LD because the estimates of LD strongly depends on various factors such as: history and structure of the studied population (evolutionary forces that affected the population), sample size, marker type (microsatellites or SNPs), density and distribution of markers, type of method used for haplotype reconstruction, strictness of SNP filtering (threshold of minor allele frequencies and Hardy-Weinberg equilibrium), use of maternal haplotypes only or both maternal and paternal haplotypes. The objective of this study was to investigate the extent and the pattern of LD on the autosomes and on the chromosome X of North American Holsteins.
Results and discussion
Out of 38,590 SNPs used in this study, 37,899 SNPs were spread out across the 29 bovine autosomes (BTA) and 691 SNPs were located on Chr X. The approximate boundary between the X-specific and PAR regions was located at 143.83 Mb, which is in agreement with Sandor et al. . The X-specific and the PAR regions harboured 629 and 62 SNPs, respectively. The distribution of SNP minor allele frequencies (MAF) of these SNPs was nearly uniform on both autosomes and Chr X (Additional file 1, Figure S1). The average MAF of all SNPs before editing was 0.26 and after the 4,446 SNPs with MAF < 0.05 were removed the averaged MAF increased to 0.29. The average MAF was slightly higher than the average MAF of SNPs reported for the SNPs on the Illumina BovineSNP50K BeadChip for Holstein by the Bovine Hapmap Consortium (0.25)  and by Matukumalli et al. (0.26) . The SNPs were almost uniformly distributed across the autosomes with an average inter-marker distance of 66 kb (Additional file 2, Figure S2). On the other hand, SNPs were not evenly spaced on Chr X (Additional file 3, Figure S3). There were certain regions of Chr X where adjacent SNPs were separated by more than 1 Mb (Figure 1). The average distance between the adjacent SNPs on Chr X was 215 kb.
Figure 1. Distribution of SNPs across Chr X. The dashed line depicts the boundary between the X-specific and the pseudo-autosomal regions.
Format: PNG Size: 13KB Download file
Format: PNG Size: 16KB Download file
Format: PNG Size: 16KB Download file
Analysis of decay of linkage disequilibrium
In order to examine the decay of LD with physical distance, SNP pairs on the autosomes were sorted into bins based on their inter-marker distance and average values of r2 and |D'| were calculated for each bin (Table 1 and 2). Moderate levels of r2 (r2 between 0.2 and 0.24) were observed at distances shorter than 60 kb. When moving from 60 to 500 kb, average r2 declined from 0.16 to 0.11. High variability of r2 was observed in bins with a marker distance < 1 Mb. Markers with r2≥0.3 were on average spaced by 60 kb. But not all markers with the 60 kb inter-marker distance had r2≥0.3. In the 0 to 40 kb and 40 kb to 60 kb bins, only 30 and 23% of markers had r2 ≥0.3, respectively. Figure 2 shows a scatter plot of |D'| values against distance. In contrast to decay of r2, where LD values clearly declined with distance, the changes of |D'| followed a much less steeper declining pattern. Average |D'| of markers 40-60 kb apart was 0.72 and 19% of SNP pairs with 0.5-1 Mb inter-marker distance had |D'| > 0.8 (Table 2). |D'| = 1 was observed at distances exceeding 100 MB.
Table 1. Pairwise linkage disequilibrium (r2) for syntetic SNPs at various distanced pooled over all autosomes
Table 2. Pairwise linkage disequilibrium (|D'|) of syntenic SNPs at various distanced pooled over all autosomes
Figure 2. Decay of |D'| as function of physical distance on BTA 1. The white line represents the median decline of |D'|.
On Chr X, LD was calculated separately for SNPs located on the X-specific and the PAR regions of Chr X (Table 3). Higher levels of LD were found on the X-specific region of Chr X than on autosomes. Very low levels of LD, even at very short distances (<40 kb), were reported on the PAR region. This was expected because in males, the cross-over is limited to the PAR region and consequently the rate of recombination was higher in this segment. On the other hand, the X-specific region is recombined only in gametes produced by females and consequently only two thirds of the X chromosomes recombine in each generation . Overall, LD on Chr X was greater compared to LD on autosomes. It can be explained by not only a lower recombination rate but also by a lower mutation rate and a faster genetic drift due to a smaller effective population size .
Table 3. Pairwise linkage disequilibrium (r2) for syntetic SNPs at various distanced pooled over X-specific and pseudo-autosomal (PAR) regions of the X chromosome
Table 4 shows mean r2 and |D'| of the 30 individual Bos taurus chromosomes and mean r2 and |D'| of adjacent SNPs pooled over all 29 autosomes and Chr X. The average recombination distance in bovine genome is 1.25 cM per Mb but the recombination distances decrease with the length of the chromosome and therefore the rate of recombination increases with the length of the chromosome . This suggests that in general, LD of longer chromosomes will extend for shorter distances and consequently such chromosomes will have lower overall LD than shorter chromosomes. In this study, certain chromosomes had higher LD than others but there was no relationship between the length of the chromosome or the average spacing of SNPs and the extent of the LD. The average r2 ranged from 0.17 (BTA27, BTA28 and BTA 29) through 0.25 (BTA 7 and BTA14) and |D'| ranged from 0.59 (Chr X) through 0.73 (BTA1, BTA 3 and BTA 8). The average MAF per chromosome ranged from 0.28 to 0.30.
Table 4. Adjacent linkage disequilibrium (r2 and |D'|) on 29 autosomes (BTA)
Linkage disequilibrium of intragenic (within genes) and intergenic pairs of SNP was investigated. Kim et al.  reported a higher LD and a higher variation in intragenic regions than in intergenic regions. In this study, there were no differences in the extent of LD and the decline of LD with the distance between intragenic and intergenic regions (Table 5).
Table 5. Linkage disequilibrium (r2) of intragenic and intergenic pairs of SNP
When the decay of r2 with a distance was plotted separately for each chromosome, several SNP pairs on BTA1, BTA6, BTA 26 showed higher r2 than expected (Figure 3). This was a possible indication of misplaced SNPs on those chromosomes. In order to identify those likely misplaced SNPs, a simple algorithm based on investigation of the decay of LD with distance was developed. The distance between each of the 38,590 SNPs and a SNP with which it had the largest r2 was recorded. If this distance was larger than 10 MB, the particular SNP was flagged as a possibly misplaced SNP and it was further investigated. This algorithm detected 223 misplaced SNPs. The majority of the misplaced SNPs were on BTA 1 (27), BTA 6 (67), BTA 26 (18) and Chr X (55). The correct location of these SNPs was approximated by assuming that the SNP was located between two markers with the highest pairwise r2 with the SNP. Additional file 4, Figure S4 shows the decay of LD with a distance of using their original location and the corrected position of the misplaced SNPs. The smoother pattern of the declining trend of r2 with a distance in plots where the position of the SNP was corrected suggests that our simplified approximation of the SNP location has a reasonable accuracy. Also the pattern of the decay of LD on BTA1, BTA6, BTA 26 acquired the expected shape after the position of misplaced SNPs was corrected (Figure 3). Our simple algorithm identified 0.57% of SNPs as misplaced. It is very likely that there could be other SNP with incorrect location, especially on BTA 14 and Chr X, where the pattern of LD was slightly erratic in certain regions even after the corrections. This is because the algorithm was not able to identify SNPs that were misplaced by only a few MB, due to the high heterogeneity of LD at short distances.
Figure 3. Decay of r2 as function of physical distance. The plots represent pairwaise LD on chromosomes 1, 6, 26 and X before and after (corr) positions of the misplaced SNPs were corrected.
Additional file 4. Decay of linkage disequilibrium (r2) with distance of misplaced SNPs (before and after correcting SNP location). The titles indicate chromosome number, name and physical location of the misplaced SNP. The even figures represent a decline of LD when location of the misplaced SNP was corrected.
Format: PDF Size: 4.1MB Download file
This file can be viewed with: Adobe Acrobat Reader
Paternal versus maternal haplotypes
Using only maternally inherited haplotypes is a common practice in studies investigating LD with the sample population consisting of large sire half-sib families. The reasoning is that pedigree structure leads to overrepresentation of the paternal haplotypes in the sample, because sires have multiple progeny in the data set, which inflates frequencies of certain haplotypes and consequently leads to overestimation of LD. In order to test this hypothesis and to quantify the size of the bias of LD, |D'| and r2 were calculated using maternal, paternal and both haplotypes. First of all, differences in haplotype frequencies between maternal and paternal haplotypes were tested by a χ2 test. The frequencies of maternal haplotypes were assumed to be the true expected values and they were contrasted to the frequencies of paternal haplotypes. Using the χ2 test, 82% of SNP pairs were identified to have significantly different (P < 0.01) haplotype frequency in maternal than in paternal haplotype. This could be due to the previously mentioned pedigree structure of our sample or it could be a result of different selection pressure in the sire of sires pathway compared to the dam of sires pathway.
Figure 4 compares the decay of linkage disequilibrium with distance on BTA 1 when r2 was computed from maternal, paternal or both haplotypes. It is quite apparent from these three diagrams that LD decayed slower in paternal haplotypes. Moreover, 49 SNP pairs with 50-100 MB inter-marker distance had noticeably inflated values of paternal r2 (r2 > 0.25). Considering both haplotypes, the negative effect of paternal haplotype was reduced, but a certain bias was still present. Therefore to avoid a possible bias in the estimates of LD due to the pedigree structure, use of only maternal haplotypes is recommended.
Figure 4. Comparison of decay of linkage disequilibrium with distance (r2) on BTA1 using only maternal, paternal or both haplotypes. The white line represents the average decline of linkage disequilibrium.
Non-syntenic linkage disequilibrium
Linkage disequilibrium was calculated for all possible combination of pairs of SNPs located on different chromosomes. None of the non-syntenic SNP pairs had r2 > 0.25, which could have occured due to selection or if one of the SNPs was located on a wrong chromosome. 1,444,163 non-syntenic SNP pairs had |D'|≥0.8 but because 94% of these SNPs had MAF < 0.10, it suggests that low MAF largely affected the estimates of |D'|. Low MAF increases chances of a missing haplotype and if at least one haplotype is missing |D'| = 1. On the other hand, SNPs with rare alleles did not have inflated values of r2. This is consistent with previous studies [15,16]
Effect of sample size and MAF on distribution of r2 and |D'|
As illustrated in Figure 5, the sample size did not have a large effect on r2 values, however, an upward bias was observed in |D'|. This bias increased as the size of the subsample decreased and as the distance between SNPs increased. With the 55 sample size, |D'| was higher by 7% in the 0.5-1 MB distance than with the full data set. Samples with more than 444 bulls had a minimal bias of |D'|. Estimates of r2 were not influenced by the sample size when at least 55 animals were used for the calculation. Samples with only 22 animals resulted in an overestimation of r2. These results indicate that a sample size of at least 444 and 55 animals is necessary for an accurate estimation of |D'| and r2, respectively. This is in agreement with Khatkar et al.  who suggested a minimal sample size of 400 and 75 for estimation of |D'| and r2, respectively.
Figure 5. Distribution of average pairwise r2 and |D'| by sample size and distance. The number in parenthesis in the legend indicates the sample size of each subset.
Figure 6 shows residual |D'| as a function of average MAF. Residual |D'| was the highest at a low MAF, suggesting that |D'| is overestimated at low MAF. Bias of |D'| was small (0-0.05) at intermediate MAF (0.2-0.4). Residual |D'| was negative at a high MAF, indicating that |D'| is underestimated at high MAF. This is not a surprising finding, because the denominator in the formula of |D'| is equal to a minimal product of SNP allele frequencies. In SNP pairs with low allele frequency, D will be in the formula divided by a small number, resulting in a large |D'|. The opposite is true for pairs with a high allele frequency. Residual r2 was on the other hand independent of the average MAF.
Figure 6. Average residual |D'| as a function of average MAF of a SNP pair.
Patterns of linkage disequilibrium
Non-monotonic erratic trend of average r2 was observed in each of the studied 30 chromosomes using the sliding window approach (Figure 7). The erratically high r2 in some sliding windows of autosomes were not caused by a low number of SNPs per window or due to high number of SNPs in the window with MAF < 0.1 probably created by a variation in recombination rates within chromosomes. On the other hand, there were several windows on Chr X that did not even have the required 20 pairs of SNPs.
Figure 7. Average r2 per sliding window (scatter plot), number of SNPs per window (bottom bar plot) and frequency of SNPs with MAF < 0.1 per window (top bar plot).
Chromosomes contain local hot spots and cold spots that undergo quite different rates of meiotic recombination and these spikes in r2 could indicate cold spots. Bovine autosomes are acrocentric and both X and Y chromosomes are subemetacentric . Centromeres are special structures of eukaryotic chromosomes that hold sister chromatids together and ensure proper chromosome segregation during cell division. Centromere suppresses meiotic recombination within itself and also in the proximal chromosomal region. On the other hand, regions next to the telomeres (terminal ends of chromosomes) increase recombination rates . Values of r2 were not noticeably higher at the very beginning or end of any autosomes compared to the middle of the chromosome, as was expected since centromeres are located in Bos taurus on telomeres. Therefore this finding suggests that the effect of the centromere and telomeres on recombination rate cancelled one another out. Chr X has centromere in the middle of the chromosome but because there were not enough informative windows in this region, it was not possible to investigate the effect of centromers on the extent of LD of this chromosome. The telomeric parts of the chromosome did not have a noticeable higher r2 than the rest of the chromosome.