Skip to main content
  • Methodology article
  • Open access
  • Published:

Generalized linear mixed model for segregation distortion analysis

Abstract

Background

Segregation distortion is a phenomenon that the observed genotypic frequencies of a locus fall outside the expected Mendelian segregation ratio. The main cause of segregation distortion is viability selection on linked marker loci. These viability selection loci can be mapped using genome-wide marker information.

Results

We developed a generalized linear mixed model (GLMM) under the liability model to jointly map all viability selection loci of the genome. Using a hierarchical generalized linear mixed model, we can handle the number of loci several times larger than the sample size. We used a dataset from an F2 mouse family derived from the cross of two inbred lines to test the model and detected a major segregation distortion locus contributing 75% of the variance of the underlying liability. Replicated simulation experiments confirm that the power of viability locus detection is high and the false positive rate is low.

Conclusions

Not only can the method be used to detect segregation distortion loci, but also used for mapping quantitative trait loci of disease traits using case only data in humans and selected populations in plants and animals.

Background

Segregation distortion refers to a phenomenon that the observed genotypic frequencies deviate significantly from the expected Mendelian frequencies [1]. Different populations have different Mendelian ratios, e.g., the typical Mendelian ratio for an F2 population is 1:2:1 for the three genotypes A1A1: A1A2: A2A2. Many reasons can explain the observed distortion [2–7]. The most promising explanation is viability selection on the distorted markers or loci linked to the markers [8]. In genetic mapping for quantitative traits, the basic assumption is Mendelian segregation [9]. Therefore, distorted markers are usually discarded prior to QTL mapping because people usually fear unexpected consequences of distorted markers on the results. In a recent study [10], we found that segregation distortion is not necessarily harmful to QTL mapping; rather, it can help in some circumstances. Consequently, we can incorporate segregation distortion into existing QTL mapping programs [11].

It appears that segregation distortion is common rather than rare. If segregation distortion is indeed caused by viability selection loci, these loci themselves are of interest because they may help to understand the mechanism of natural selection and evolution. Chi-square tests are commonly used to test segregation distortion. Fu and Ritland [12] and Lorieux et al. [13] developed maximum likelihood methods to map segregation distortion loci. The methods are interval mapping approaches in which one distortion locus is tested at a time. Vogl and Xu [14] used an MCMC implemented Bayesian algorithm to detect multiple segregation loci simultaneously. These methods are quite different from the usual QTL mapping procedures in quantitative trait genetic mapping. Luo and Xu [15] first developed an expectation and maximization (EM) algorithm for mapping viability selection loci. This method takes advantage of the well known EM algorithm in interval mapping. Recently, Luo et al. [16] developed a quantitative genetic model to map viability loci. The authors postulated a hidden underlying liability for each individual. The liability is an unobserved quantitative trait and natural selection acts on the liability. The method of Luo et al. [16] actually maps loci controlling the hidden liability (a quantitative trait). Therefore, methods of QTL mapping and viability locus mapping have been unified into the same framework of interval mapping. Both methods are called QTL mapping, but the traits mapped are different, the former maps observed quantitative traits and the latter maps unobserved liability.

The quantitative genetic model of Luo et al. [16] is an interval mapping approach. The state-of-the-art QTL mapping procedure is the Bayesian shrinkage method [17–19] because it simultaneously evaluates the entire genome. It is natural to extend the Bayeisan shrinkage method to map multiple viability loci. The Markov chain Monte Carlo (MCMC) algorithm is commonly used to implement the Bayesian method. Such a sampling based method is time consuming. A fast version of the Bayesian method is the empirical Bayesian method [20] where the variance components in the prior distributions of QTL effects are first estimated from the data and then used as the priors to estimate the QTL effects under the general Bayesian framework. This method is essentially the linear mixed model approach. When applied to discrete traits, the method is called the generalized linear mixed model [21, 22].

Numerous algorithms have been developed to implement the generalized linear mixed model. The pseudo likelihood algorithm [23–25] appears to be the most popular one. The method requires a normal transformation of the original data point using the first step Newton-Raphson update. Once the data points are normally transformed, they are treated as normal quantitative phenotypes. The usual linear mixed model applies to the transformed data points. The difference between the Newton-Raphson transformation and the data transformation commonly seen in data analysis is that the Newton-Raphson transformation is a function of the data point and parameters while the usual data transformation is a function of the data point only. Therefore, the Newton-Raphson transformation is required for each cycle of the iteration process.

It is not clear how to use the pseudo likelihood approach to mapping viability loci because there is no phenotypic data point to transform. However, the method of McGilchrist [26] for generalized linear mixed model can be applied here. This method only requires a linear predictor, a likelihood and a prior distribution for each effect in the linear predictor. In this study, we used the McGilchrist's [26] method to perform parameter estimation.

Method

Liability model and viability selection

Let us define a continuous variable y j as the liability for individual j,

y j = X j β+ ∑ k = 1 p Z j k γ k + ε j
(1)

where ε j ~ N(0,1) is a residual error with a standardized normal distribution. Other model effects are defined as follows. There may be some effects not related to genetics, such as age, location and other systematic effects, and these effects are captured by β and the design matrix X. There are p genetic loci each with an effect γ k for k = 1, ..., p. The value of Z jk is determined by the genotype of individual j at locus k. For example, an F2 individual derived from the cross of two inbred lines can take one of three genotypes, A1A1, A1A2 and A2A2. Under the additive genetic model, Z jk is defined as

Z j k = + 1 0 - 1 for for for A 1 A 1 A 1 A 2 A 2 A 2
(2)

and γ k = a k is the additive genetic effect for locus k. Under the dominance effect model, the genetic effect for locus k is a 2 × 1 vector γ k = [a k d k ]T, where d k is called the dominance effect. The corresponding Z variable is also a vector and defined as

Z j k = H 1 H 2 H 3 for for for A 1 A 1 A 1 A 2 A 2 A 2
(3)

where H i is the i-th row of matrix H, as shown below,

H= + 1 - 1 0 + 1 - 1 - 1
(4)

The liability y j is not observed but it determines the viability of individual j. It is assumed that individual j will survive if y j > 0 and die otherwise. Since we can only observe the surviving individuals, all individuals in the sample have liabilities greater than zero. This will cause the selected population to deviate from the expected Mendelian segregation ratio for loci responsible for viability selection and all loci linked to the viability loci. Although all individuals have survived, some may have a high liability and some may have a low liability, but all have a liability greater than zero. We now use the concept of penetrance to describe the survivability of an individual. Let

E ( y j ) = η j = X j β+ ∑ k = 1 p Z j k γ k
(5)

be the expectation of the unobserved liability (a linear predictor). We use the normal or the logistic function to model the probability of survival for individual j, i.e., Φ(η j ) or logistic(η j ) = exp(η j )/[1 + exp(η j )]. Conditional on the genotypes of all other loci, the penetrances for the three genotypes of locus k are defined as

Φ ( H 1 γ k + η j ( - k ) ) Φ ( H 2 γ k + η j ( - k ) ) Φ ( H 3 γ k + η j ( - k ) ) for for for G j k = A 1 A 1 G j k = A 1 A 2 G j k = A 2 A 2
(6)

where

η j ( - k ) = X j β+ ∑ k ′ ≠ k p Z j k ′ γ k ′
(7)

is the linear predictor excluding locus k. This model was first introduced by Luo et al. (2005) for single locus analysis, which does not include η j(-k) in equation (6). The data that allow us to estimate γ k is the genotype array for all individuals at locus k. Define

w j = w j ( 1 1 ) w j ( 1 2 ) w j ( 2 2 )
(8)

as a multivariate Bernoulli variable with three categories (i.e., a multinomial variable with sample size one). If individual j has a genotype A1A1, then wj(11)= 1 and wj(12)= wj(22)= 0. The probabilities of individual j taking the three genotypes are derived from the Bayes' theorem,

π j ( 1 1 ) = 1 π ̄ j ϕ 1 1 Φ ( H 1 γ k + η j ( - k ) ) π j ( 1 2 ) = 1 π ̄ j ϕ 1 2 Φ ( H 2 γ k + η j ( - k ) ) π j ( 2 2 ) = 1 π ̄ j ϕ 2 2 Φ ( H 3 γ k + η j ( - k ) )
(9)

where

π ̄ j = ϕ 1 1 Φ ( H 1 γ k + η j ( - k ) ) (1) + ϕ 1 2 Φ ( H 2 γ k + η j ( - k ) ) (2) + ϕ 2 2 Φ ( H 3 γ k + η j ( - k ) ) (3) (4)
(10)

is the mean of the three penetrances and

Ï•= Ï• 1 1 Ï• 1 2 Ï• 2 2
(11)

is the expected Mendelian ratio. In an F2 population, the expected Mendelian ratio is ϕ= 1 4 2 4 1 4 . Note that if γ k = 0, vector π j = [πj(11)πj(12)πj(22)] will be equivalent to the expected Mendelian ratio for every individual at the locus.

If there is no factor to be considered other than the markers, the term X j β should disappear here. This is different from the usual linear regression analysis where an intercept should always appear in the model. With the liability selection model, there is no intercept. We now assume only one co-factor to consider. The X j variable can be discrete or continuous, but the distribution in the unselected population must be known. In this study, we first assume that X j is discrete, say gender, a variable indicating the gender of individual j with X j = 1 representing male and X j = -1 representing female. In the unselected population, the sex ratio should be 1:1. If the population evaluated has a biased sex ratio, this means that the gender has an effect on the liability. We can estimate the gender effect β on the liability. Let φ= φ 1 φ 2 = 1 2 1 2 be the expected sex ratio (prior to the selection). Define ξj(1)or ξj(2)as the posterior probability that individual j is male or female, respectively. These posteriors are calculated using

ξ j ( 1 ) = 1 ξ ̄ j φ 1 Φ ( η j ( - β ) + β ) ξ j ( 2 ) = 1 ξ ̄ j φ 2 Φ ( η j ( - β ) - β )
(12)

where

ξ ̄ j = φ 1 Φ ( η j ( - β ) + β ) + φ 2 Φ ( η j ( - β ) - β )
(13)

is the mean penetrance of the two genders and

η j ( - β ) = ∑ k = 1 p Z j k γ k
(14)

is the linear predictor excluding the gender effect.

We now assume that X j is a continuous non-genetic effect, e.g., age. Let us assume that X j follows a normal distribution in the unselected population, i.e., p(X j ) = N(X j |μ, σ2), where μ and σ2 are known. Let β be the effect of X j on the liability. Define Φ(X j β + ηj(-β)) as the probability that individual j has survived the selection. The posterior probability is defined as

ξ j = 1 ξ ̄ j N ( X j | μ , σ 2 ) Φ ( X j β + η j ( - β ) )
(15)

where

ξ ̄ j = ∫ - ∞ ∞ N ( X j | μ , σ 2 ) Φ ( X j β + η j ( - β ) ) d X j (1) = Φ ( μ β + η j ( - β ) ) ∕ ( σ 2 β 2 + 1 ) 1 ∕ 2 (2) (3)
(16)

Proof of this equation (16) is straightforward and thus given in the next paragraph.

Let f(X j ) = N(X j |μ, σ2) be the normal density for variable X j with known μ and σ2. The following Lemma [27] is used to derive equation (16).

∫ - ∞ + ∞ f ( X j ) Φ X j - ξ λ d X j =Φ μ - ξ σ 2 + λ 2
(17)

Let us rewrite equation (16) as

ξ ̄ j = ∫ - ∞ + ∞ f ( X j ) Φ ( X j β + η j ( - β ) ) d X j (1) = ∫ - ∞ + ∞ f ( X j ) Φ X j - ( - η j ( - β ) ∕ β ) 1 ∕ β d X j (2) (3)
(18)

Comparing equation (18) with equation (17), we can see that ξ = -ηj(-β)/β and λ2 = 1/β2. Substituting these into equation (17), we get

ξ ̄ j = ∫ - ∞ + ∞ f ( X j ) Φ X j - ( - η j ( - β ) ∕ β ) 1 ∕ β d X j = Φ μ - ( - η j ( - β ) ∕ β ) σ 2 + 1 ∕ β 2 = Φ μ β + η j ( - β ) σ 2 β 2 + 1
(19)

This concludes the derivation of equation (16) presented in the previous paragraph.

Likelihood, prior and posterior

It is difficult (if not impossible) to construct the joint likelihood for all loci, but conditional on the effects and the genotypes of other loci, the likelihood for locus k can be derived based on the multivariate Bernoulli distribution, that is

L ( γ k ) = ∑ j = 1 n w j ( 1 1 ) ln ( π j ( 1 1 ) ) + w j ( 1 2 ) ln ( π j ( 1 2 ) ) + w j ( 2 2 ) ln ( π j ( 2 2 ) )
(20)

The exact notation for this log likelihood should be L(γ k |η(-k)) because it is conditioned on the gender effect and effects of other loci. We use the simplified notation to improve the readability. Let us assign a normal prior to γ k , i.e.,

p ( γ k ) =N ( γ k | 0 , Σ k )
(21)

Furthermore, we assign a hierarchical prior to ∑ k ,

p ( Σ k ) = Inv - Wishart ( Σ k | Ï„ , ω )
(22)

where τ is the prior degree of freedom and ω is the prior scale matrix with the same dimension as ∑ k . The reason for assigning these prior distributions is to handle a possible large number of loci involved in the model. Uniform prior for the gender effect is assumed. The log posterior (denoted by LogPost) is

LogPost ( γ k = L ( γ k ) + ln N ( γ k | 0 , Σ k ) + ln [ Inv - Wishart ( Σ k | Ï„ , ω ) ]
(23)

where a constant has been ignored.

For the sex effect (discrete co-factor), the likelihood for β conditional on ηj(-β)is

L ( β ) = ∑ j = 1 n 1 2 ( X j + 1 ) ln ( ξ j ( 1 ) ) + 1 2 ( 1 - X j ) ln ( ξ j ( 2 ) )
(24)

For the continuous co-factor, the log likelihood for parameter β can be written as

L ( β ) = ∑ j = 1 n ln ( ξ j ) = ∑ j = 1 n ln Φ ( X j β + η j ( - k ) ) - ln Φ ( μ β + η j ( - β ) ) ∕ ( σ 2 β 2 + 1 ) 1 ∕ 2
(25)

Prior distribution for the non-genetic effect is assumed to be uniform (uninformative prior) and thus only the likelihood is needed to find the posterior mode estimate of β.

Posterior mode estimation

Due to the possible large number of parameters, we take a sequential approach to estimating the posterior mode parameters with one locus at a time. This approach is also called the coordinate descent algorithm. Once the parameters of all loci are updated, the sequence is repeated until a certain criterion of convergence is reached.

Let us define the first step of the Newton-Raphson iteration as

γ k ( t + 1 ) = γ k ( t ) - ∂ 2 LogPost ( γ k ) ∂ γ k ∂ γ k T - 1 ∂ LogPost ( γ k ) ∂ γ k
(26)

and denote the variance of this updated parameter by

V k =- ∂ 2 LogPost ( γ k ) ∂ γ k ∂ γ k T - 1
(27)

where the first and second partial derivatives are evaluated at γ k = γ k ( t ) . The posterior mean and posterior variance matrix for γ k at iteration t are denoted by E ( γ k ) = γ k ( t + 1 ) and var(γ k ) = V k , respectively. Since the posterior distribution of γ k is approximately multivariate normal (asymptotical theory), the posterior mean is identical to the posterior mode. The posterior of ∑ k remains scaled inverse Wishart due to the conjugate property of the prior. Therefore, the posterior mode of ∑ k is

Σ k ( t + 1 ) = E ( γ k γ k T ) + ω ( τ + 1 ) + 2 + 1 = E ( γ k ) E ( γ k T ) + var ( γ k ) + ω ( τ + 1 ) + 2 + 1
(28)

where τ + 1 is the degree of freedom for the inverse Wishart posterior and the number 2 represents the dimension of vector γ k .

The posterior mode estimation of β conditional on the effects of all loci is

β ( t + 1 ) = β ( t ) - ∂ 2 L ( β ) ∂ β ∂ β T - 1 ∂ L ( β ) ∂ β
(29)

with an estimation error variance approximated by

var ( β ) =- ∂ 2 L ( β ) ∂ β ∂ β T - 1
(30)

The iteration process of the posterior mode estimation is summarized as follows.

Step 0: Initialize all parameters.

Step 1: Update the non-genetic effect using equation (29).

Step 2: Update effect of marker k for k = 1, ⋯, p using equation (26).

Step 3: Update ∑ k for k = 1, ⋯, p using equation (28).

Step 4: Repeat step 1 to step 3 until the iteration process converges.

Genetic contribution from an individual locus

An obvious advantage of the liability model is that we are able to calculate the proportion of the liability variance contributed by each SDL, similar to the proportion of quantitative trait variance contributed by each QTL. Suppose that we have detected one SDL with both additive and dominance effects. The theoretical variances of the Z variables in an F2 population are 0.5 for the additive part and 1.0 for the dominance part. The reason is that the three genotypes are coded as +1, 0 and -1 for the additive Z and -1, 1 and -1 for the dominance Z [28]. Let a k and d k be the additive and dominance effects of this SDL. The genetic variance explained by this locus is

V G = 1 2 a k 2 + d k 2
(31)

The residual variance of the liability is set at unity and thus the variance of the liability is

V P = V G +1= 1 2 a k 2 + d k 2 +1
(32)

The broad sense heritability is defined as

H= V G V P = 1 2 a k 2 + d k 2 1 2 a k 2 + d k 2 + 1
(33)

This is the proportion of the liability variance contributed by the k th SDL. Assuming that the multiple SDL are not closely linked, the overall contribution from all SDL is approximated by

H= V G V P = ∑ k = 1 p ( 1 2 a k 2 + d k 2 ) ∑ k = 1 p ( 1 2 a k 2 + d k 2 ) + 1
(34)

The liability model has unified QTL mapping and SDL mapping in the same framework of quantitative genetics.

Results

Mouse experiment

We used a published dataset of an F2 mouse experiment to demonstrate the application of the method. The dataset was published by Lan et al. [29] and is freely available from the internet. The mouse genome has 19 chromosomes (excluding the sex chromosome). The data contains 110 F2 ob/ob mice derived from the cross of two inbred lines (BT×BTBR) and 193 markers covering 1,800 cM of the entire mouse genome. The average marker distance was 9.35 cM per marker interval. We inserted one or more pseudo markers in intervals larger than 5 cM to make sure that the entire genome is evenly covered by (pseudo or true) markers with no intervals larger than 5 cM. The number of pseudo markers inserted was 273, resulting in a total of 466 markers (193 true and 273 pseudo markers). For the pseudo markers, the genotype indicator variable, w j = [wj(11)Wj(12)wj(22)], is missing for every individual. In the data analysis, the missing variable was replaced by the conditional probability calculated using the multipoint method [30].

The top panel of Figure 1 shows the frequencies of the three genotypes, A1A1, A1A2 and A2A2, plotted against the mouse genome. It is obvious that there is a severe distortion in the beginning of chromosome 6 where the population contains almost exclusively the A2A2 genotypes with A1A1 and A1A2 almost eliminated from the population. Chromosomes 14 and 18 also show mild segregation distortion. Interval mapping for segregation distortion using the QTL procedure in SAS [31] showed that the LOD score for chromosome 6 is 43.25 (see the bottom panel of Figure 1 for the LOD score profile obtained from the interval mapping analysis). The interval mapping procedure [31] is a separate analysis for each marker. With the interval mapping, the position with the highest LOD score (43.25) occurred at a pseudo marker (at position 15.69 cM) between the first true marker (D6Mit86, 0 cM) and the second true marker (D6Mit224, 30.4 cM) on chromosome 6. The estimated frequencies of this pseudo marker are 0.0000, 0.0001 and 0.9999 for the three genotypes (A1A1, A1A2 and A2A2), respectively.

Figure 1
figure 1

Frequencies of the three genotypes and LOD score profiles of the mouse genome. This is an F2 population derived from the cross of two inbred lines (BT×BTBR). (a) The top panel shows the frequencies of three genotypes with the blue, red and green patterns representing the A1A1, A1A2 and A2A2 genotypes, respectively. (b) The bottom panel shows the LOD score profiles for the mouse genome obtained from the interval mapping of segregation distortion. The profile in red (LOD SDL) represents the LOD score for segregation distortion. The curves in blue and black are the LOD scores for QTL of the 10 week body weight and joint testing of the QTL and segregation distortion.

We used the generalized linear mixed model to analyze all the 466 markers (193 true and 273 pseudo) jointly. In the mouse data, among the 110 mice, 52 were male and 58 were female. Apparently, the sex ratio is not biased and thus sex appears to have no effect on the survivorship. However, we included the sex factor as a fixed effect in the model to test the robustness of our model. We expected that our model would detect no sex effect on the survivorship. The generalized linear mixed model had 466 × 2 + 1 = 933 model effects, including 466 additive effects, 466 dominance effects and one sex effect. This GLMM with 110 individuals was indeed able to handle such a large model (933 model effects). The hyper parameters used in the analysis was (τ, ω) = (0,0), equivalent to the Jeffrey's prior for the variance components. The estimated additive and dominance effects along with the corresponding LOD scores are depicted in Figure 2. One segregation distortion locus was detected on chromosome 6 (same as that of the interval mapping). The location of this distortion locus is right at the first marker of chromosome 6 (D6Mit86, 0 cM). The interval mapping approach described in the previous paragraph also detected a segregation distortion locus. However, the SDL detected by interval mapping was located halfway (15.69 cM) between markers D6Mit86 (0 cM) and D6Mit224 (30.4 cM) (see Figure 1 for the result of interval mapping). The GLMM analysis also showed some distortion for the second marker (D6Mit224, 30.4 cM), but the LOD score is only 3, barely significant. Therefore, we can safely ignore this locus due to linkage with the first marker. Let us go back to the first marker D6Mit86, the major SDL detected by the GLMM method. This segregation distortion locus is caused by both the additive and dominance effects. The estimated additive effect (± standard error) is a ^ = 4.6230 ± 0.4248 while the estimated dominance effect (± standard error) is d ^ =-1.6656±0.1833. The LOD scores are 25.69 and 17.92, respectively, for the additive and dominance effects. Simulation experiment under the null hypothesis (Mendelian segregation) showed that the 95% value of the null distribution of the LOD scores is 3.8, much smaller than the actual LOD score of 25.69. Therefore, we are very confident for this detected segregation distortion locus. As expected, the estimated sex effect is β ^ =0.1969±0.3002 with a LOD score of 0.0934, smaller than 1.0255, the 95% value of the LOD score generated under the null model. Therefore, we can safely claim that the gender effect is insignificant.

Figure 2
figure 2

QTL effects and LOD scores of the mouse genome estimated by GLMM. The additive effect and the dominance effect are shown in the top panel and LOD score of additive effect and dominance effect are in the bottom panel. Additive effect and the LOD score profiles are colour coded in blue and the dominance effect and LOD score profiles are coded in red. Positions of the 193 true markers are indicated by the barcode like ticks on the horizontal axis. The critical value (95% of the LOD score generated under the null model) is 2.99, which is smaller than the observed LOD score of 25.0.

In the GLMM analysis, the QTL effect has been interpreted as an effect on a hypothetical liability. The total variance of the liability is (see the Method section)

σ Liability 2 = 0.5 × a ^ 2 + d ^ 2 + 1 = 0.5 × 4.6230 2 + ( − 1.6556 ) 2 + 1 = 14.4606
(35)

Therefore, the proportion of the liability variance explained by this segregation distortion locus is

H = 0.5 × a ^ 2 + d ^ 2 0.5 × a ^ 2 + d ^ 2 + 1 = 13.4606 14.4606 = 0.9308
(36)

which is also called the broad sense heritability. This single locus contributes approximately 93% of the liability variance. We can also calculate the expected frequencies of the three genotypic based on the estimated QTL effect. Let

π ¯ = 0.25 × Φ ( − a ^ − d ^ ) + 0.5 × Φ ( d ^ ) + 0.25 × Φ ( a ^ − d ^ ) = 0.0003878 + 0.0239483 + 0.25 = 0.2743361
(37)

The expected frequencies for the three genotypes are

π 11 = 1 π ¯ × 0.25 × Φ ( − a ^ − d ^ ) = 0.0014 π 12 = 1 π ¯ × 0.50 × Φ ( d ^ ) = 0.0873 π 22 = 1 π ¯ × 0.25 × Φ ( a ^ − d ^ ) = 0.9113
(38)

respectively, for A1A1, A1A2 and A2A2.

Simulation experiment

We simulated a single chromosome with 2400 cM in length covered by 481 markers evenly placed on the genome with 5 cM per marker interval. The additive QTL effects of six markers were simulated with the true positions and true effects as presented in Figure 3 (bottom panel). Dominance effects were not simulated (zero values) although they were included in the data analysis. Frequencies of the three genotypes of a simulated F2 family with 500 individuals are also presented in Figure 3 (top panel). We also simulated two co-factors that influence the liability. The first co-factor was the sex effect coded as 1 for male and -1 for female with an effect value of β1 = 1.0. The second co-factor was a continuous variable with μ = 0 and σ2 = 0.025. The effect of this co-factor on the liability was β2 = 1.0. The liability of each individual was generated using the linear model containing the two cofactors and the six QTL. An individual with a liability greater than 0 survived the selection, otherwise, it was eliminated. All the 500 individuals in the sample survived the selection. The simulated data were analyzed using the generalized linear mixed model with (τ, ω) = (0,0) as the hyper-parameter values.

Figure 3
figure 3

Genotype frequencies and the true QTL effects for segregation distortion in the simulation experiment. In the top panel, blue and green areas represent the frequencies of the two types of homozygote while the red area represents the frequency of the heterozygote. The true QTL effects are shown in the bottom panel.

The estimated additive effects and the LOD scores are given in Figure 4. The estimated dominance effects and LOD scores were all near zero and thus not presented in the figure. Critical value of the LOD score generated from the null model was 2.99, which is smaller than the LOD score of each identified QTL. Therefore, all the six QTL have been identified by the method with no false positive identification. Figure 5 gives the estimated QTL effects and LOD scores for a dataset simulated under the null model. We can see that both the effects and the LOD scores are extremely small. The estimated QTL effects from simulation experiment (not the null model) are also presented in Table 1 along with the true values. Except QTL 5, all other QTL have been identified at the positions where they were simulated. QTL 5 was missed at the simulated position (1750 cM) but the effect was picked up at position 1735 cM, 15 cM away from the true position. The six QTL plus the two co-factors contributed 84.55% of the total variation of the liability and the estimated proportion was 82.74%, very close to the true proportion. The simulated data analysis demonstrates that the generalized linear mixed model successfully identified the simulated QTL and the two co-factors.

Figure 4
figure 4

Estimated additive QTL effects and the LOD scores for segregation distortion in the simulation experiment. The additive QTL effects (top panel) and LOD scores for the additive effects (bottom panel) are estimated by GLMM. The true effect is colour coded in red and the estimated effect is coded in blue.

Figure 5
figure 5

The estimated QTL effects (top panel) and LOD scores (bottom panel) under the null model. The data was simulated with no segregation distortion.

Table 1 Estimated parameters of the QTL identified by GLMM compared to true values in the simulation.

This paragraph describes the result of 100 repeated simulations generated from the same set of parameters. This experiment allowed us to evaluate the power and false positive rate of QTL identification. The critical value for the LOD score was 2.99, which was generated empirically from multiple simulations under the null model (see the Method section). For each of the true QTL, if any marker with 15 cM away from the true QTL had a LOD score greater than 2.99, this QTL was declared as being detected. Since each marker interval was 5 cM, the 30 cM (15 cM left and right) coverage contained five markers (including the one with the true effect). If any marker more than 15 cM away from a simulated true QTL had a LOD score greater than 2.99, that marker was declared as a false positive. Results of the replicated simulation experiments are given in Table 2. The average estimated effects (QTL and co-factors) are consistently smaller than the true values due to the shrinkage nature of the estimation. The biases, however, are not too strong to affect the powers because all effects have been detected with very high powers (ranging from 71% to 100%). For the entire 100 replications, we only detected five false positives (positive markers that are 15 cM away from a true effect). The overall false positive rate is 5/[100 ×(481-5 × 6) = 0.0001111, extremely low. The number 481 in the denominator is the total number of markers, the number 6 is the number of markers with true effects and the number 5 is the number of markers in the window covering a true QTL.

Table 2 Average estimates of effects and powers of simulated QTL and co-factors from 100 replicated simulations.

Discussion and conclusions

Genome-wide segregation distortion is a common phenomenon in genetic mapping, but it is usually ignored. The main reason is the difficulty in joint estimation and tests of the segregation distortion loci. We formulated the problem as a typical quantitative genetics problem using a hypothetical liability to describe the fitness of each individual. Using a generalized linear mixed model, we were able to estimate and test genome-wide quantitative trait loci controlling the hidden liability. We used a mouse dataset to demonstrate the method and detected a major QTL for the liability that explains 93% of the liability variance. The simulated data experiment showed that the method can detect a QTL (e.g., the second QTL simulated) explaining 7.71% of the liability variation with 71% power. The method was implemented in a SAS/IML program. The code is posted on our website (http://www.statgen.ucr.edu) for general application. With this method and the program, genome-wide segregation distortion can be investigated routinely in future genetic data analysis.

As a Bayesian method, there are a rich array of prior distributions can be explored. In this study, we used the inverse Wishart as the prior distribution for the prior variance matrix of QTL effects. For the additive genetic model (one effect per locus), the inverse Wishart distribution becomes a scaled inverse Chi-square distribution. It is possible to use the exponential distribution (the Lasso prior) as an alternative prior [32]. Because the method uses multiple levels of prior choice, the model can also be called hierarchical generalized linear mixed model [24, 33]. This study opens a new area in statistical genetics and further studies are expected to arise. For example, how to use the adaptive Lasso [34] to address this problem is entirely unknown and can be explored in the future.

A caveat of this method is the requirement of Mendelian segregation ratio (before the selection). For populations generated through line crossing experiments, Mendelian ratios are known. However, for uncontrolled populations, the theoretical Mendelian frequencies are not available. In this case, one needs to survey the unselected population to obtain the genotypic frequencies as the controlled "Mendelian segregation". If one can genotype both the selected and unselected individuals, one may simply use the case-control study and there is little reason to use this case-only study approach. In reality, genotyping individuals is much more costly than pooling the DNA of a sample of individuals. The cost effective approach is to genotype each individual in the surviving sample and genotype the pooled DNA sample for the unselected population because we only need the frequencies of genotypes (not the genotypes of individuals) in the unselected population. For the co-factors, we also need the expected frequencies of the co-factors in the unselected population. We examined the sex effect (discrete co-factor) and a normally distributed co-factor. The expected 1:1 sex ratio was used as the expected frequency. For the normal co-factor, we used the mean and variance of the co-factor used in the simulation (the true values) to construct the expected distribution. In reality, one needs to survey the entire population to obtain the expected distribution. For continuous variables deviating from normality, one may discretize a variable to a few groups. For example, age is a quantitative variable but one can arbitrarily divide individuals into a few age groups. This discretization will eliminate the restriction of normal distribution.

The method developed here can be applied to more broad situations beyond genetics without much modification. For example, if we know the joint distribution of k variables in a base (unselected) population and the joint distribution of the variables in a selected sample. We can simply test the difference between the two distributions to see which variables influence more on the selection. However, the pair-wise covariance may not allow us to make a precise decision on the importance of each variable. If two variables both influence the selection and they are highly correlated, the influence of one variable may be simply caused by the high correlation with the true causal variable. The proposed method here can help separate the true causality from the influence due to correlation.

QTL mapping is usually conducted in unselected populations. Individuals with undesired phenotypes must also be evaluated to obtain unbiased estimates of QTL effects. This is not a cost effective approach in breeding companies. Breeders wish to use only selected individuals to breed and keep no records for the unselected individuals. If we only evaluate the selected individuals, markers associated with the traits of interest will show distorted segregation. If the selection criterion is not well defined, for example, drought resistance, it is hard to map QTL. The segregation distortion loci are actually the QTL for drought resistance if one knows that there is no segregation distortion in the unselected population. The method developed here can be directly applied to mapping drought resistance QTL. Because we can perform QTL mapping using selected population, this approach may be called "mapping while selecting". For example, breeders may want to evaluate drought resistance of a family of recombinant inbred lines (RIL) by planting all seeds in a harsh drought environment. Eventually all plants die except the ones with strong resistance of drought. Breeders may have no records of the plants eliminated, but they can still perform QTL mapping for this trait (drought resistance) using all plants that have survived the selection. Other stress related traits can also be mapped using this approach, e.g., pest and salinity resistances.

In human genetics, case-control study is a common approach for mapping disease loci. In situations where there are no records for the control but the case, this case-only study may benefit from the new method. For example, one may easily get patient data from hospitals but hardly has individual records for the entire population. QTL mapping for the disease trait is still possible if we have the population records (frequencies) of genotypes in the entire population.

In summary, we developed a hierarchical generalized linear mixed model to map QTL for liability. This is a new approach to genetic mapping. It incorporates a seemingly different problem (segregation distortion) into the same QTL mapping framework for quantitative traits. Statistically, it shows that the generalized linear mixed model can be applied to situations where there are no phenotypic records; one only needs a likelihood function, a linear predictor and a prior distribution to infer the posterior mode estimation of the model effects.

References

  1. Sandler L, Hiraizumi Y, Sandler I: Meiotic Drive in Natural Populations of Drosophila Melanogaster. I. the Cytogenetic Basis of Segregation-Distortion. Genetics. 1959, 44 (2): 233-250.

    PubMed Central  CAS  PubMed  Google Scholar 

  2. Faris JD, Laddomada B, Gill BS: Molecular mapping of segregation distortion loci in Aegilops tauschii. Genetics. 1998, 149 (1): 319-327.

    PubMed Central  CAS  PubMed  Google Scholar 

  3. Hackett CA, Broadfoot LB: Effects of genotyping errors, missing values and segregation distortion in molecular marker data on the construction of linkage maps. Heredity. 2003, 90 (1): 33-38. 10.1038/sj.hdy.6800173.

    Article  CAS  PubMed  Google Scholar 

  4. Hartl DL, Hiraizumi Y, Crow JF: Evidence for sperm dysfunction as the mechanism of segregation distortion in Drosophila melanogaster. Proceedings of the National Academy of Sciences of the United States of America. 1967, 58 (6): 2240-2245. 10.1073/pnas.58.6.2240.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Lu R, Bernardo S: Chromosomal regions associated with segregation distortion in maize. TAG Theoretical and Applied Genetics. 2002, 105 (4): 622-628. 10.1007/s00122-002-0970-9.

    Article  CAS  PubMed  Google Scholar 

  6. Taylor DR, Ingvarsson PK: Common Features of Segregation Distortion in Plants and Animals. Genetica. 2003, 117 (1): 27-35. 10.1023/A:1022308414864.

    Article  CAS  PubMed  Google Scholar 

  7. Xu Y, Zhu L, Xiao J, Huang N, McCouch SR: Chromosomal regions associated with segregation distortion of molecular markers in F2, backcross, doubled haploid, and recombinant inbred populations in rice (Oryza sativa L.). Molecular and General Genetics MGG. 1997, 253 (5): 535-545. 10.1007/s004380050355.

    Article  CAS  PubMed  Google Scholar 

  8. Charlesworth B, Charlesworth D: Some evolutionary consequences of deleterious mutations. Genetica. 1998, 102/103: 3-19.

    Article  Google Scholar 

  9. Lander ES, Botstein D: Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989, 121 (1): 185-199.

    PubMed Central  CAS  PubMed  Google Scholar 

  10. Xu S: Quantitative trait locus mapping can benefit from segregation distortion. Genetics. 2008, 180 (4): 2201-2208. 10.1534/genetics.108.090688.

    Article  PubMed Central  PubMed  Google Scholar 

  11. Xu S, Hu Z: Mapping quantitative trait loci using distorted markers. International Journal of Plant Genomics. 2010, 2009: 1-11.

    Google Scholar 

  12. Fu YB, Ritland K: Evidence for the partial dominance of viability genes contributing to inbreeding depression in Mimulus guttatus. Genetics. 1994, 136 (1): 323-331.

    PubMed Central  CAS  PubMed  Google Scholar 

  13. Lorieux M, Perrier X, Goffinet B, Lanaud C, León DG: Maximum-likelihood models for mapping genetic markers showing segregation distortion. 2. F<sub>2</sub> populations. TAG Theoretical and Applied Genetics. 1995, 90 (1): 81-89.

    Article  CAS  PubMed  Google Scholar 

  14. Vogl C, Xu S: Multipoint mapping of viability and segregation distorting loci using molecular markers. Genetics. 2000, 155: 1439-1447.

    PubMed Central  CAS  PubMed  Google Scholar 

  15. Luo L, Xu S: Mapping viability loci using molecular markers. Heredity. 2003, 90: 459-467. 10.1038/sj.hdy.6800264.

    Article  CAS  PubMed  Google Scholar 

  16. Luo L, Zhang Y-M, Xu S: A quantitative genetics model for viability selection. Heredity. 2005, 94: 347-355. 10.1038/sj.hdy.6800615.

    Article  CAS  PubMed  Google Scholar 

  17. Sillanpaa MJ, Arjas E: Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics. 1998, 148 (3): 1373-1388.

    PubMed Central  CAS  PubMed  Google Scholar 

  18. Wang H, Zhang Y-M, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S: Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics. 2005, 170: 465-480. 10.1534/genetics.104.039354.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Xu S: Estimating polygenic effects using markers of the entire genome. Genetics. 2003, 163: 789-801.

    PubMed Central  CAS  PubMed  Google Scholar 

  20. Xu S: An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics. 2007, 63: 513-521. 10.1111/j.1541-0420.2006.00711.x.

    Article  CAS  PubMed  Google Scholar 

  21. Gilmour AR, Anderson RD, Rae AL: The analysis of binomial data by a generalized linear mixed model. Biometrika. 1985, 72 (3): 593-599. 10.1093/biomet/72.3.593.

    Article  Google Scholar 

  22. Harville DA, Mee RW: A mixed-model procedure for analysing ordered categorical data. Biometrics. 1984, 40: 393-408. 10.2307/2531393.

    Article  Google Scholar 

  23. Gelman A, Carlin J, Stern H, Rubin D: Bayesian Data Analysis. 2003, London: Chapman & Hall

    Google Scholar 

  24. Gelman A, Jakulin A, Pittau MG, Su Y-S: A weakly informative defualt prior distribution for logistic and other regression models. The Annals of Applied Statistics. 2008, 2 (4): 1360-1383. 10.1214/08-AOAS191.

    Article  Google Scholar 

  25. Wolfinger R, O'Connell M: Generalized linear mixed models: A pseudo-likelihood approach. The Journal of Statistical Computation and Simulation. 1993, 48: 233-243. 10.1080/00949659308811554.

    Article  Google Scholar 

  26. McGilchrist CA: Estimation in generalized mixed model. Journal of the Royal Statistical Society, Series B. 1994, 56 (1): 61-69.

    Google Scholar 

  27. Cavalli-Sforza LL, Bodmer WF: The Genetics of Human Population. 1971, San Francisco: W. H. Freeman and Company

    Google Scholar 

  28. Yang R, Tian Q, Xu S: Mapping quantitative trait loci for longitudinal traits in line crosses. Genetics. 2006, 173 (4): 2339-2356. 10.1534/genetics.105.054775.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. Lan H, Chen M, Flowers JB, Yandell BS, Stapleton DS, Mata CM, Mui ET, Flowers MT, Schueler KL, Manly KF: Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genetics. 2006, 2 (1): e6-10.1371/journal.pgen.0020006.

    Article  PubMed Central  PubMed  Google Scholar 

  30. Jiang C, Zeng ZB: Mapping quantitative trait loci with dominant and missing markers in various crosses from two inbred lines. Genetica. 1997, 101 (1): 47-58. 10.1023/A:1018394410659.

    Article  CAS  PubMed  Google Scholar 

  31. Hu Z, Xu S: PROC QTL - A SAS procedure for mapping quantitative trait loci. International Journal of Plant Genomics. 2009, 2009: 1-3.

    Article  Google Scholar 

  32. Tibshirani R: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996, 58 (1): 267-288.

    Google Scholar 

  33. Yi N, Banerjee S: Hierarchical generalized linear models for multiple quantitative trait locus mapping. Genetics. 2009, 181 (3): 1101-1113. 10.1534/genetics.108.099556.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  34. Zhou H: The adaptive Lasso and its orcle properties. Journal of the American Statistical Association. 2006, 101 (476): 1418-1429. 10.1198/016214506000000735.

    Article  Google Scholar 

Download references

Acknowledgements

We greatly appreciate two anonymous reviewers and the associated editor for their comments on an early version of the manuscript and their suggestions in revision of the manuscript. The project was supported by the USDA National Institute of Food and Agriculture Grant 2007-02784 to SX.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shizhong Xu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HZ conducted the actual work in terms of programming and data analysis. SX proposed the idea, oversaw the project and wrote the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhan, H., Xu, S. Generalized linear mixed model for segregation distortion analysis. BMC Genet 12, 97 (2011). https://doi.org/10.1186/1471-2156-12-97

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2156-12-97

Keywords