Abstract
Background
Linkage disequilibrium (LD) plays a fundamental role in population genetics and in the current surge of studies to screen for subtle genetic variants affecting complex traits. Methods widely implemented in LD analyses require samples to be randomly collected, which, however, are usually ignored and thus raise the general question to the LD community of how the nonrandom sampling affects statistical inference of genetic association. Here we propose a new approach for inferring LD using a sample unrandomly collected from the population of interest.
Results
Simulation study was conducted to mimic generation of samples with various degrees of nonrandomness from the simulated populations of interest. The method developed in the paper outperformed its rivals in adequately estimating the disequilibrium parameters in such sampling schemes. In analyzing a 'case and control' sample with βthalassemia, the current method presented robustness to nonrandom sampling in contrast to two commonly used methods.
Conclusions
Through an intensive simulation study and analysis of a real dataset, we demonstrate the robustness of the proposed method to nonrandomness in sampling schemes and the significant improvement of the method to provide accurate estimates of the disequilibrium parameter. This method provides a route to improve statistical reliability in association studies.
Background
Linkage disequilibrium (LD) has long been one of the central topics in evolutionary and population genetics. Linkage disequilibrium refers to nonrandom association of alleles at different linked or unlinked loci in a population. Inference about LD provides useful information for distinguishing between alternative evolutionary models of genetic polymorphisms within or divergence between populations [1]. The current surge of population based association studies has reported identification of causal genetic variants of disease susceptibilities in humans [2] and complex genetic variation in plants and animals [3,4]. The kernel of these studies is inference of LD between the genetic variants and functional loci that are closely genetically linked. Thus, adequate prediction of LD is obviously crucial for reliability and accuracy of these studies.
The coefficient of LD between two biallelic loci is defined as D = f_{AB } f_{A }f_{B }in a randomly mating population, where f_{AB}, f_{A }and f_{B }are frequencies of gametes AB, alleles A and B in the population. The genetic parameter has been reparameterized into different forms for various purposes of LD analysis [5]. Hill proposed the well known "chromosome counting" method to estimate the parameter by using data of genotypes at the two loci from a random sample [6]. It has been widely used in population genetic analyses and studies on linkage disequilibrium based mapping [7,8]. The principle of the analysis has also been employed to develop widely used methods for predicting haplotypes of DNA markers in natural populations [9,10]. In practice, however, it is very rare that the samples for LD analyses are truly randomly collected from the population under study. For example, the samples used in many association studies or population genomics analyses were so collected that the frequencies of some genotypes are artificially inflated to ensure that genotypes involving a rare allele are well represented [11,12]. Weir and Cockerham explored the consequences of implementing the method to estimate LD by using the samples in which some genotypes are missing and stressed that the method should not be used to estimate the parameter from nonrandom samples [13], but neither appropriate theory nor method has been developed for inferring population disequilibrium from nonrandom samples.
In this paper, we develop a new method to calculate the maximum likelihood estimate of the coefficient of linkage disequilibrium between genes at any pair of polymorphic loci in any randomly matting population by making use of samples of genotype data, which are nonrandomly collected from the population of interest. On the basis of simulation studies, we demonstrate that bias in estimates of the disequilibrium parameter from Hill's method arising from the use of nonrandom samples can be substantially reduced by implementing the new method. We compared analyses of a 'case and control' dataset of βthalassemia using three different methods: Hill's, haplotype prediction by the computer software PHASE2.1.1 and the method developed in the present study.
Results
We developed a likelihoodbased statistical approach to estimate the coefficient of linkage disequilibrium between a pair of polymorphic loci in a natural population and to test for significance of the disequilibrium by making use of the samples that are not randomly collected from the population. The method uses information from the conditional distribution of genotypes at one locus given genotypes at the other in formulating the statistical analysis with nonrandom samples. This is in contrast to the approach proposed by Hill [6], which relies on information of a joint distribution of genotypes at the polymorphic loci and estimates the disequilibrium parameter from using the samples randomly collected from the population under question. Hill's method has been extended or converted into various forms/approaches that are widely used in the current surge of genetic association studies and population genetic analyses [11,12,1418]. Here, our analysis is focused on Hill's method (H) and the method (L) developed in the present study.
To explore the adequacy of the methods H and L in estimating LD and their statistical power in detecting LD, we first conducted a simulation study to generate samples collected from the simulated population by sampling with various degrees of nonrandomness. We then implemented the methods to analyze both the simulated datasets and the real data of 20.693 kb DNA sequence surrounding the βglobin gene from a 'case and control' study with βthalassemia [19].
Simulation Study
The study considered three schemes of sampling individuals from a simulated population in which the distribution of genotypes at the marker and disease loci was as described by Table 1 for any given set of simulation parameters p, q and D. Sampling Scheme I involved n individuals being sampled completely randomly from the simulated populations. Sampling Scheme II generated the samples in which either individuals with a specific marker genotype or individuals with a specific markerdisease genotype were missing. Finally, Sampling Scheme III mimicked the generation of the casecontrol samples used in most association studies, in which the cases and controls were present in the sample in equal proportions. In addition, we explored influence of various 'case and control' proportions on statistical power of the 'case and control' design for detecting LD. The computer programs designed for simulating linkage disequilibrium between two biallelic loci can be found elsewhere [20], and were modified to generate the samples in the present study.
Table 1. Conditional probability distribution of disease genotypes for a given marker genotype
Table 2 lists the means and standard deviations of 1,000 estimates of the linkage disequilibrium coefficient from Hill's method (Method H) and the method proposed in the present study (Method L) by making use of random samples of 200 individuals from 12 different simulated populations (Sampling Scheme I). Allele frequencies at the marker and disease loci were directly calculated from the samples. Both methods adequately estimate the disequilibrium parameters although the estimates from Method L may have slightly smaller standard deviations than those from Method H.
Table 2. Prediction of Sampling Scheme I
In Sampling scheme II, individuals were randomly sampled from 6 simulated populations but those with either a specific marker genotype (n_{i• }= 0, i = 1, 2, 3 corresponding to MM, Mm or mm) or a markerdisease genotype (n_{ii }= 0, i = 1, 2, 3 corresponding to MMAA, MmAa or mmaa) were excluded in the sampling process. The sampling continued until the required number of individuals were obtained (n = 200). From the samples so obtained, we calculated the allele frequencies, p and q, and then estimated D by use of both methods (H and L). Means and standard deviations of the D estimates, based on 1,000 simulations, are summarized in Table 3. In most cases, the two methods accurately estimate the disequilibrium parameters. However, when individuals with heterozygous genotypes at the marker locus were absent (n_{2• }= 0) in the sample, Method H underestimated the parameters in the simulated populations 35. In contrast, Method L estimated the parameters adequately in these cases. In the cases where individuals with a specific markerdisease genotype were not present in the sample, Method H severely underestimated the disequilibrium, for instance, when n_{11 }= 0 in simulated population 4 and n_{33 }= 0 in simulated population 3. The biased estimates were substantially improved by making use of Method L. One may argue that different performance of the two methods could be due to the limited sample size (n = 200). We compared the methods with using much larger sample sizes (n = 400, 800) and observed that the same pattern in mean estimates of the disequilibrium parameter as demonstrated in Table 3 with n = 200 (Additional file 1). Another question that may naturally rise from this sampling scheme is how much genotypic distribution at the simulated marker and disease loci deviates from that expected under HardyWeinberg equilibrium. The significance test shows that these genotype distributions in the samples generated from the sampling schemes highly significantly deviate from the HardyWeinberg expectations (Additional file 2), revealing remarkable nonrandomness of the samples with respect to the populations from which these samples were collected.
Additional file 1. Table S1 Simulation of Sampling Scheme II using larger sample sizes.
Format: DOC Size: 101KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 2. Table S2 HardyWeinberg equilibrium test.
Format: DOC Size: 50KB Download file
This file can be viewed with: Microsoft Word Viewer
Table 3. Prediction of Sampling Scheme II
Sampling scheme III considered the scenario that the disease allele had a low frequency but generated an equal number of the case individuals (i.e. either a homozygote or a heterozygote of the disease allele) and the controls (i.e. a homozygote of the wild type allele). To mimic the sampling scheme, we randomly generated a given number of 'case' or 'control' individuals for each of seven simulated populations. From these individuals, 100 'case' and another 100 'control' individuals were randomly collected, making a constant sample size of 200. The samples so generated make a severely nonrandom collection of the cases but they present a typical example of the samples widely used in many current genetic association studies with a casecontrol design in which roughly an equal number of sporadic case individuals and control individuals were collected from the population under question [11,1419]. Use of the samples so collected raised a question of how the allele frequency parameters, p and q, can be calculated and used to estimate the disequilibrium parameter, D. We proposed to calculate p, the marker allele frequency, from the control subsamples, and explored two alternative ways to obtain the value of q, the disease allele frequency. Firstly, q was obtained from an independent population survey such as a prior epidemiological study or population survey. To assimilate this scenario, we used the stimulated value of q in estimation of D. Secondly, we explored the use of q values directly estimated from the casecontrol sample in estimation of D.
Table 4 compares the means and standard deviations of the LD estimates, D, using Methods H and L on the data from the 1,000 casecontrol simulations. Although the disease allele frequencies (q) were obtained from true value of simulated parameter (i.e. from a population survey in practice), estimates of the disequilibrium coefficient using Method H () were seriously biased from their corresponding true values, sometimes even outside the theoretical boundaries for the parameters. In contrast, the Method L estimated the disequilibrium parameters (D_{L}) adequately in all seven simulated situations. When the disease allele frequencies were estimated directly from the casecontrol samples and incorporated into the algorithms to estimate the parameter D, both the methods yielded biased estimates of the simulated parameters. However, the estimates from Method L deviate less from the corresponding true values than those from Method H. It should be noted that Method H produced biased estimates from the casecontrol samples and the bias was more severe when the population frequency of the disease allele was used than when the frequency was directly estimated from the casecontrol samples. An detailed examination found that this was because, in the former, Method H always produced the estimates of the haplotype frequency, , which were outside their theoretical boundaries.
Table 4. Prediction of Sampling Scheme III
To explore influence of population size of the casecontrol samples on estimation and detection of the disequilibrium, we calculated estimates of the disequilibrium coefficient and the corresponding LOD scores using Method L in the casecontrol samples with an equal proportion of 'case and control' individuals but different total sample sizes. Table 5 summarizes the means and standard deviations of 1,000 repeated estimates of the disequilibrium coefficients and the corresponding LOD scores. It can be seen that the disequilibrium parameters are estimated adequately from the method by use of the casecontrol sample with a size as small as only 100 individuals. The LOD scores and the test statistic for significance of the disequilibrium, increase as the sample size increases.
Table 5. Prediction from casecontrol samples with various sample sizes
Table 6 lists the means and standard deviations of 1,000 repeated estimates of the disequilibrium coefficient and the corresponding LOD score values from the casecontrol samples of 200 individuals but with varying proportions of the cases and controls. Variation in the proportion of case individuals in the sample does not show observable influence on adequacy of estimates of the disequilibrium parameter but the LOD score values constantly increase with the increased proportion of controls.
Table 6. LD estimation from case and control samples with varying proportions
Analysis of βthalassemia dataset
βthalassemia is an autosomal recessive hemoglobinopathy caused by mutations in the βglobin (HBB) gene (OMIM 141900). The disorder is one of the most common inherited hemoglobinopathies in the world, with estimates of carrier frequencies ranging from 3 to 10% in some areas of the tropics and subtropics including southern China [19,21,22]. A frame shift mutation in codons 41 and 42, a 4bp deletion (CTTT), of the human βglobin gene represents the most common βthalassemia mutations in East and Southeast Asia. The population frequency of the deletion is as high as 3% in South China [19]. To survey the distribution of linkage disequilibrium among the polymorphic sites surrounding the βglobin gene, Zhang et al collected a sample of 40 Chinese individuals, including 16 β^{CD41/42 }thalassemia heterozygotes and 24 normal individuals [19]. They directly sequenced a 15.933kb DNA region spanning 20.693 kb of the βglobin cluster surrounding the deletion and detected 50 biallelic sites in the sequenced region. All individuals in the sample were genotyped at the polymorphic markers. This dataset represents a typical example of the selected sample in which disease carriers are deliberately enriched and no homozygote of the disease allele (i.e. at the deletion locus) is present in the sample.
To evaluate linkage disequilibria across the sequenced region, the haplotypes in the sample were first predicted by use of the computer software PHASE2.1.1 developed by Stephens et al [10], and the predicted haplotypes were then used to calculate the coefficients of linkage disequilibrium. We implemented the haplotypebased method and the two methods (H and L) to analyze the sequence data and calculated the coefficient of linkage disequilibrium between the disease mutation, which was set as position 0, and each of the other polymorphic sites. We used the population frequency of 3% for the disease causing allele in the analysis and illustrate the distribution of estimates of the disequilibrium between each of the polymorphic sites and the disease causing mutation in Figure 1(a) together with the lowest and highest bounds for the disequilibrium parameter which were inferred from the maker allele frequency estimates and the population frequency of the disease causing allele. It can be seen that the LD estimates from the present method could differ substantially from the other two methods. Importantly, the LD estimates by the former lie properly between the corresponding theoretical bounds whilst the estimates by the latter two may be severely out of the bounds, suggesting serious bias in the estimates to the true parameters. Figure 1(b) shows the LOD score values for the LD estimates from the three methods and demonstrates that the present method provides much more likely estimates of linkage disequilibria in the vicinity of the disease causing site than the other methods under comparison. The data analysis clearly indicates that the method developed here is robust to nonrandomness of the samples by which linkage disequilibrium is statistically inferred.
Figure 1. Linkage disequilibrium analysis of the βthalassemia sample. Distribution of linkage disequilibrium between each of polymorphic sites and the βthalassemia causing mutation in a 20.693 kb region surrounding the human β globin gene. (a) Estimates of the coefficients of LD from three different methods. The dot lines represent the lowest and highest theoretical bounds of the disequilibrium parameter, which are defined by allele frequencies at the polymorphic marker and disease causing sites. (b) The LOD score values calculated for the LD estimates from the three methods.
Discussion
The past decade has witnessed great progress in highthroughput detection and genotyping of single nucleotide polymorphisms (SNPs) in the genomes of plants, animals and humans. This has stimulated tremendous interest in mapping and identifying subtle genetic variants contributing to phenotypic variation of complex traits in natural populations through detecting linkage disequilibrium maintained in the populations by close linkage between alleles at genetic polymorphic sites and at trait loci. However, a common concern of association studies is the high proportion of false positive or false negative tests of association of trait phenotype with causal genetic polymorphisms as well as the limited statistical power in detecting genuine associations [23,24]. A rich pool of literature has been focused on exploring factors that cause these problems and seeking for solutions to them. The most prominent among these is population stratification that could cause both false positive and false negative inferences of association [25].
Although it has been well established that skewed sampling from a population with a linkage equilibrium distribution of multilocus genotypes may result in spurious linkage disequilibrium [26], there has not been a comprehensive investigation of the consequences of nonrandom samples on statistical inference of LD from populations under disequilibrium. We demonstrate in the present study that the use of nonrandom samples could result in severely biased estimates of LD from the method proposed for random samples. The simulation study showed that the estimates could be so biased as to be outside the theoretical limits of the corresponding simulated values and the biases can be either up or downwards. These results indicate that the nonrandomness of the sample may result in considerable false positive or false negative inference of the disequilibrium parameter and, in turn, false positives or false negatives in association analyses in which significant degree of markerdisease association implies significant linkage disequilibrium. Instead of considering the joint probability distribution of genotypes at the marker and disease loci in estimating LD from random samples [6,13], we propose the use of the conditional probability distribution of disease genotypes given any marker genotype in developing a new method to estimate the parameter. The method avoids or effectively alleviates the influence of nonrandom presentation of any markerdisease genotype in the samples on the parameter estimation. On the basis of simulation studies, we show that the method yields equally adequate estimates of LD to the method previously proposed [6] and currently widely cited in the literature when individual genotypes are randomly sampled from the populations. However, the method confers significant improvement over the current method when the parameter estimation is made from using artificially selected samples. Methodologically, the improvement in the parameter estimation of the method developed in the present study over the current methods [6,13] can be explained by their difference in extracting information of linkage disequilibrium between the two loci of interest from the samples under study. The present method uses the conditional probability distribution of genotypes at one locus on any given genotype at the other locus as illustrated in Table 1. When the sample of study is collected in such a way that genotype(s) at one of the two loci undergo selection, for example, the casecontrol samples where genotypes at the disease locus are strongly selected, the joint distribution of genotypes at the two loci in the sample deviates greatly from the joint genotypic distribution in the population from which the sample is collected. However, the conditional distribution of genotypes at one locus given any genotype at the other selected or unselected locus in the sample remains approximately the same as that in the population. Hence the analysis based on the conditional genotypic distribution is more robust to nonrandomness of the samples to infer linkage disequilibrium in populations of interest than that based on the joint genotypic distribution.
The 'case and control' design used in many association studies to screen for genetic polymorphisms in significant association with phenotypic variation probably illustrates the most popular example of LD analysis with nonrandom samples [11,12,1418]. Although the base populations under investigation may be randomly mating with respect to genotypes at markerdisease loci, the samples used in the studies are collected so that case individuals are well represented and thus create severely nonrandom presentation of the populations. Given that the analysis of the samples is to infer the situation in the base populations, accurate inference of LD from the samples is obviously crucial for reliability of these analyses. The βthalassemia data analysis in the present study represents a typical example of such studies. The analysis with the dataset, although very limited in sample size, highlights the importance of adequately tackling the nonrandomness in the 'case and control' studies in order to achieve reliable assessment of LD. It clearly demonstrates the robustness of the method developed in the present study to the nonrandomness. Previous methods, which are widely implemented in the current literature of 'case and control' studies are only suitable for analysis with random samples and so can result in seriously biased inference of the disequilibrium parameter.
A practical problem that is raised to accurately infer LD from 'case and control' samples is the need for accurate estimates of allelic frequencies at both marker and putative disease loci. We propose the use of control samples to calculate marker allelic frequency because there is usually not any prior selective criterion imposed on the marker genotypes of those individuals to be included in the control samples. It is shown that accurate estimates of the allele frequency are the basis for accurate estimation of the disequilibrium parameter (Table 4), but neither the 'cases' or 'controls' separately or together are likely appropriate to estimate the allele frequency at the putative disease locus. In the scenario where epidemiological information is available for disease candidates such as the βthalassemia data considered in the present study, the allele frequency estimate from a prior population survey may be used to meet this need. Alternatively, one can search the LOD score profile (equation 1) for the MLE of the disequilibrium parameter over various q values and choose the MLE of D calculated from the q value at which the likelihood profile reaches the maximum. Figure 2 illustrates such an example with the βthalassemia data analysis and shows that the MLE of D was obtained when q took a value of ~3%, which is near to the estimate from the previous population survey [19].
Figure 2. LOD score profile. LOD score profiles for the MLE of linkage disequilibrium between the disaese causing mutant and each of two linked polymorphic markers M516 and M2046 over different values of disease allele frequency. The two markers are of 516 and 2046 bp to the disease causing mutant site.
Methodologically, the present study has been focused on the most prominent linkage disequilibrium measure, D as defined above. There are several other measures, such as D' or r^{2 }etc, frequently used in the literature of genetic association study or population genetic analysis. These latter measures are either a scaled or standardized form of the basic disequilibrium parameter D. Bias in estimate of D will be inherent to that of transformed versions of the parameter. For example, we reanalyzed the βthalassemia data by using r^{2}, which is defined as D^{2}/p(1p)q(1q) in the present notation, as the disequilibrium measure and showed the analysis in Additional file 3. The disequilibrium distribution shows an almost identical pattern to that demonstrated in Figure 1, in which D was the disequilibrium measure.
Additional file 3. Figure S1 Linkage disequilibrium measured in r^{2}.
Format: DOC Size: 104KB Download file
This file can be viewed with: Microsoft Word Viewer
One of the key properties of 'case and control' designs is its statistical power in detecting association between a genetic polymorphic marker to the phenotype of a disease trait. It has been widely accepted that level of LD is a critical factor in determining the power of 'case and control' designs to detect significance of genetic association [27,28]. Several studies have focused on investigating factors affecting statistical power of the 'case and control' design, including marker allele frequency and errors in marker genotype and trait phenotype [2931]. An equal proportion of cases and controls were proposed in these studies and implemented in many real 'case and control' experimental analyses [11,31]. The present study investigated the impact of using varying proportions of the cases and controls on the power of the 'case and control' design in detecting linkage disequilibrium and shows that, for a given size of the 'case and control' sample, increasing the proportion of the 'cases' decreases rather than increases the statistical power (Table 6). In fact, increasing the proportion of control individuals in the 'case and control' samples alleviates the effects of nonrandomness of the samples. This result indicates a need of reconsideration of the commonly accepted sampling strategy for 'case and control' study designs. We reexamined the question by implementing the chisquare based test proposed to test for significance of difference in marker allele frequency between cases and controls. The analysis was summarized as an Additional file 4 and shows that use of an equal proportion of cases and controls in the casecontrol design is favoured for a higher statistical power to detect the genetic associations. However, it has been well established that the chisquare based association test is highly vulnerable to deviation of genotypic distribution from the HardyWeinberg equilibrium [32]. Any violation to the equilibrium may result in severe type I error. Moreover, it is clear from Additional file 2 that the equilibrium does not usually hold in these samples.
Additional file 4. Table S3 Chisquare test of case and control samples with varying proportions.
Format: DOC Size: 60KB Download file
This file can be viewed with: Microsoft Word Viewer
It is well established that LD based association analysis is effective only for the genes underlying Mendelian traits or the genes with major genetic effects on polygenic traits [23,33]. In the present study we assumed that genotypes at the putative disease locus are observable as are those at the marker locus. The assumption holds for Mendelian traits but may be questionable for quantitative traits. In fact, the theoretical model and analysis developed in the present study can be incorporated into the statistical framework we previously developed for detecting and estimating linkage disequilibrium between a genetic marker and a locus affecting a quantitative trait showing continuous or dichotomous phenotypic variation [20,34,35]. The quantitative genetic model allows allelic frequencies at the marker and trait loci, the coefficient of linkage disequilibrium between the two loci, genetic effects at the trait locus and the residual variance component to be modelled and inferred. Integration of the two models enables the nonrandomness of samples to be properly accounted for in the statistical inference on the genetic parameters.
Conclusions
We demonstrated that the nonrandomness may cause seriously biased assessment of LD when using the current methods originally developed for random samples. We have developed a new approach for inferring LD from samples with various degrees of nonrandomness, and showed the significantly improved robustness of the present approach over the current methods when nonrandom samples were to evaluate LD through intensive simulation studies and analysis of a case and control sample of βthalasemia. As accurate estimation of the disequilibrium parameter is crucial for any association study, in which the case/control design represents a typical example of nonrandom samples, the present paper highlights the importance of tackling the problem of using nonrandom samples to the community of LD analysis, and in addition, provides a route to improve statistical reliability in association studies.
Methods
Inferring LD from Using Nonrandom Samples: we consider two biallelic loci M (marker locus) and A (disease locus). There are two alleles at the marker locus, M and m, with population frequencies p and 1p respectively and two alleles at the disease locus, A and a, with population frequencies q and 1q. For simplicity, but without loss of generality, we denote by D, the coefficient of linkage disequilibrium between genes at the two loci. The conditional probability (f_{ij}) of genotypes at one of two loci given a genotype at the other locus can be expressed in term of these genetic parameters when assuming random mating in the population. For example, the disease genotype distribution given any marker genotype is presented in Table 1.
Among a random sample collected from such a random mating population, let n_{ij }be the number of individuals with the jth (j = 1, 2, 3) genotype at the disease locus and the ith (i = 1, 2, 3) genotype at the marker locus. The logarithm of the likelihood function of the genetic parameters, p, q and D, given the observations n_{ij }has the form:
Although there are three parameters in the loglikelihood function, our focus is on D, the coefficient of linkage disequilibrium. We will investigate and discuss impacts of the other two parameters, p and q, on statistical inference on D. The maximum likelihood estimation (MLE) of D can be calculated from the equation
which can be solved explicitly by using Mathematica [36] and from a polynomial equation with power of 5
Mathematical forms for a_{i }(i = 0, 1, 2, ..., 5) are given in Additional file 5 (on the website of BMC Genomics). Although an analytical solution does not exist, equation (2) can be solved numerically [37]. The equation has at least one real root because a_{5 }= 4n > 0 where n is the sample size. If multiple real roots were found, we determined , the maximum likelihood estimate of D, by comparing their corresponding likelihood values. It should be noticed that a meaningful estimate of D must not be beyond its theoretical bounds {max(pq,(1  p)(1  q),min(p(1  q),(1  p)q)}. Thus, the MLEs of the parameter were chosen from possible roots of the equation (2) that lies in the theoretical bounds and corresponds to the highest likelihood. With the MLE and the likelihood function, one can test for significance of the disequilibrium parameter using the likelihood ratio test statistic
which asymptotically follows a chisquare distribution with 1 degree of freedom. It should be noticed that the above analysis is based on the conditional probability distribution of genotypes at one locus given any genotype at the other of the two loci.
Additional file 5. Mathematical forms for the coefficients of the polynomial equation (2).
Format: DOC Size: 52KB Download file
This file can be viewed with: Microsoft Word Viewer
Inferring LD from Using Random Samples: By assuming use of samples randomly collected from a randomly mating population, Hill (1974) proposed estimating the MLE of frequency of haplotype MA, say g_{11}, through an iterative procedure, which is well known as the "chromosome counting" algorithm in the literature [6]. Mano et al highlighted that the iterative algorithm may lead to local maxima and proposed conditions for their existence [38]. Weir and Cockerham [13] showed that a more robust approach to calculate the MLE of g_{11 }is to solve it from the cubic equation given below
where b_{0 }= (2n_{11 }+ n_{12 }+ n_{21})pq, b_{1 }= 2npq  (2n_{11 }+ n_{12 }+ n_{21})(1  2p  2q)  n_{22}(1  p  q), b_{2 }= 2n(1  2p  2q)  2(2n_{11 }+ n_{12 }+ n_{21})  n_{22 }and b_{3 }= 2n. With , and the MLE of g_{11}, p and q respectively, the MLE of the coefficient of disequilibrium coefficient, D, can be calculated as . Weir and Cockerham stressed that calculation of by solving the cubic equation may avoid the scenario of the iterative procedure may be diverging or converging but to the wrong roots [13]. Because b_{3 }= 2n < 0, the cubic must have at least one real root. The theoretical bounds for g_{11 }are given by {max(0, p + q  1), min(p,q)} [13]. Significance of the MLE of D can also be tested through a likelihood ratio test statistic with a form similar to equation (3). However, to evaluate the likelihood function, one needs to use the joint genotype probability distribution given by Hill [6]. It should be noticed that the above analysis is based on the joint probability distribution of genotypes at the two loci of interest.
The above analyses show that the disequilibrium parameter can be estimated and detected from the two methods. In the Results section, we explored the performance of these methods at inferring the linkage disequilibrium using a simulation study and by analysis of real data.
Authors' contributions
ZL conceived of the study and developed the theoretical analyses. MW conducted the simulation study and analyzed the βthalassemia dataset, with assistance from TJ, NJ and LW. MW and ZL wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
We thank Dr. XiangMin Xu for allowing us to share the thalassemia data analysed in the present study. We are grateful to three anonymous reviewers for their constructive comments which have helped improve presentation and clarity of an earlier version of the paper. This study was supported by a research grant from Biology and Biotechnology Research council (BBSRC, UK) and by research grants from China's National Natural Science Foundation to ZWL.
^{† }We would like to dedicate this paper to Professor Michael J. Kearsey, our mentor, for his 70^{th }birthday.
References

Lewontin RC, Kojima KI: The evolutionary dynamics of complex polymorphisms.
Evolution 1960, 4:458472. Publisher Full Text

Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Thompson D, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, Wareham N, Ahmed S, Healey CS, Bowman R, Meyer KB, Haiman CA, Kolonel LK, Henderson BE, Le Marchand L, Brennan P, Sangrajrang S, Gaborieau V, Odefrey F, Shen CY, Wu PE, Wang HC, Eccles D, Evans DG, Peto J, Fletcher O, et al.: Genomewide association study identifies novel breast cancer susceptibility loci.
Nature 2007, 447:108793. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, Doeblay J, Kresovich S, Goodman MM, Buckler ES: Structure of linkage disequilibrium and phenotypic associations in the maize genome.
Proceedings of the National Academy of Sciences of the United States of America 2001, 98:1147911484. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, Cookson Wo, Taylor MS, Rawlins JNP, Mott R, Flint J: Genomewide genetic association of complex traits in heterogeneous stock mice.
Nature Genetics 2006, 38:879887. PubMed Abstract  Publisher Full Text

Devlin B, Risch N: A comparison of linkage disequilibrium measures for finescale mapping.
Genomics 1995, 29:311322. PubMed Abstract  Publisher Full Text

Hill WG: Estimation of linkage disequilibrium in randomly mating populations.
Heredity 1974, 33:229239. PubMed Abstract  Publisher Full Text

Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA: Additional SNPs and linkagedisequilibrium analyses are necessary for wholegenome association studies in humans.
Nature Genetics 2003, 33:518521. PubMed Abstract  Publisher Full Text

Slatkin M: Linkage disequilibrium  understanding the evolutionary past and mapping the medical future.
Nature Reviews Genetics 2008, 9:477485. PubMed Abstract  Publisher Full Text

Long JC, Williams RC, Urbanek M: An EM algorithm and testing strategy for multiplelocus haplotypes.
American Journal of Human Genetics 1995, 56:799810. PubMed Abstract  PubMed Central Full Text

Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data.
American Journal of Human Genetics 2001, 68:978989. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fung HC, Scholz S, Matarin M, SimonSanchez J, Hernandez D, Britton A, Gibbs JR, Langefeld C, Stiegert ML, Schymick J, Okun MS, Mandel RJ, Fernandez HH, Foote KD, Rodriguez RL, Peckham E, De Vrieze FW, GwinnHardy K, Hardy JA, Singleton A: Genomewide genotyping in Parkinson's disease and neurologically normal controls first stage analysis and public release of data.
Lancet Neurobiology 2006, 5:91116. Publisher Full Text

Dawson E, Abecasis GR, Bumpstead S, Chen Y, Hunt S, Beare DM, Pabial J, Dibling T, Tinsley E, Kirby S, Carter D, Papaspyridonos M, Livingstone S, Ganske R, Lohmussaar E, Zernant J, Tonisson N, Remm M, Magi R, Puurand T, Vilo J, Kurg A, Rice K, Deloukas P, Mott R, Metspalu A, Bentley DR, Cardon LR, Dunham I: A firstgeneration linkage disequilibrium map of human chromosome 22.
Nature 2002, 418:54448. PubMed Abstract  Publisher Full Text

Weir BS, Cockerham CC: Estimation of linkage disequilibrium in randomly mating populations.
Heredity 1979, 42:105111. Publisher Full Text

Scott LJ, Bonnycastle LL, Willer CJ, Sprau AG, Jackson AU, Narisu N, Duren WL, Chines PS, Stringham HM, Erdos MR, Valle TT, Tuomilehto J, Bergman RN, Mohlke KL, Collins FS, Boehnke M: Association of transcription factor 7like 2 (TCF7L2) variants with type 2 diabetes in a Finnish sample.
Diabetes 2006, 55:264953. PubMed Abstract  Publisher Full Text

The Welcome Trust Case Control Consortium: Genomewide association studye of 14,000 cases of seven common diseases and 3,000 share controls.
Nature 2007, 447:66178. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs.
Nature 2007, 449:851861. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Maris JM, Mosse YP, Bradfield JP, Hou CP, Monni S, Scott RH, Asgharzadeh S, Attiyeh EF, Diskin SJ, Laudenslager M, Winter C, Cole KA, Glessner JT, Kim C, Frackelton EC, Casalunovo T, Eckert AW, Capasso M, Rappaport EF, McConville C, London WB, Seeger RC, Rahman N, Devoto M, Grant SFA, Li HZ, Hakonarson H: Chromosome 6p22 locus associated with clinically aggressive neuroblastoma.
New England Journal of Medicine 2008, 358:258593. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene): Genomewide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20.
Nature Genetics 2009, 41:82428. PubMed Abstract  Publisher Full Text

Zhang W, Cai WW, Zhou WP, Li HP, Li L, Yan W, Deng QK, Zhang YP, Fu YX, Xu XM: Evidence of gene conversion in the evolutionary process of the codon 41/42 (CTTT) mutation causing betathalassemia in southern China.
Journal of Molecular Evolution 2008, 66:436445. PubMed Abstract  Publisher Full Text

Luo ZW: Detecting linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations.
Heredity 1998, 80:198208. PubMed Abstract  Publisher Full Text

Weatherall DJ, Clegg JB: Inherited haemoglobin disorders an increasing global health problem.
Bulletin of the World Health Organization 2001, 79:704712. PubMed Abstract  PubMed Central Full Text

Xu XM, Zhou YQ, Luo GX, Liao C, Zhou M, Chen PY, Lu JP, Jia SQ, Xiao GF, Shen X, Li J, Chen HP, Xia YY, Wen YX, Mo QH, Li WD, Li YY, Zhuo LW, Wang ZQ, Chen YJ, Qin CH, Zhong M: The prevalence and spectrum of alpha and beta thalassaemia in Guangdong Province implications for the future health burden and population screening.
Journal of Clinical Pathology 2004, 57:517522. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cardon LR, Palmer LJ: Population stratification and spurious allelic association.
Lancet 2003, 361:598604. PubMed Abstract  Publisher Full Text

Balding DJ: A tutorial on statistical methods for population association studies.
Nature Reviews Genetics 2006, 7:781791. PubMed Abstract  Publisher Full Text

Pritchard JK, Stephens M, Rosenberg NA, Donnelly P: Inference of population structure using multilocus genotype data.
Genetics 2000, 155:94559. PubMed Abstract  PubMed Central Full Text

Avery PJ, Hill WG: Distribution of linkage disequilibrium with selection and finite population size.
Genetical Research 1979, 33:2948. Publisher Full Text

Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, LiuCordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome.
Science 2002, 296:22252229. PubMed Abstract  Publisher Full Text

Weir BS: Linkage disequilibrium and association mapping.
Annual Review of Genomics and Human Genetics 2008, 9:129142. PubMed Abstract  Publisher Full Text

Edwardes MD: Sample size requirements for casecontrol study designs.
BMC Med Res Methodol 2001, 1:11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Gordon D, Finch SJ: Factors affecting statistical power in the detection of genetic association.
Journal of Clinical Investigation 2005, 115:14081418. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barral S, Haynes C, Stone M, Gordon D: LRTae: improving statistical power for genetic association with case/control data when phenotype and/or genotype misclassification errors are present.
BMC Genetics 2006, 7:5. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sasieni PD: From genotypes to genes doubling the sample size.
Biometrics 1977, 53:12531261. Publisher Full Text

Yu JM, Buckler ES: Genetic association mapping and genome organization of maize.
Current Opinion in Biotechnology 2006, 17:155160. PubMed Abstract  Publisher Full Text

Luo ZW, Tao SH, Zeng ZB: Inferring linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations.

Luo ZW, Wu CI: Modeling linkage disequilibrium between a polymorphic marker locus and a locus affecting complex dichotomous traits in natural populations.
Genetics 2001, 158:17851800. PubMed Abstract  PubMed Central Full Text

Wolfram S: Mathematica. 2nd edition. AddisonWesley Publishing Co, Inc; 1991.

Riley KF: Mathematical Methods for the Physical Sciences. Cambridge Univ. Press; 1978.

Mano S, Yasuda N, Katoh T, Tounai K, Inoko H, Imanishi T, Tamiya G, Gojobori T: Notes on the maximum likelihood estimation of haplotype frequencies.
Ann Hum Genet 2004, 68:257264. PubMed Abstract  Publisher Full Text