Department of Botany and Plant Sciences, University of California, Riverside, CA 92521

Abstract

Background

Segregation distortion is a phenomenon that the observed genotypic frequencies of a locus fall outside the expected Mendelian segregation ratio. The main cause of segregation distortion is viability selection on linked marker loci. These viability selection loci can be mapped using genome-wide marker information.

Results

We developed a generalized linear mixed model (GLMM) under the liability model to jointly map all viability selection loci of the genome. Using a hierarchical generalized linear mixed model, we can handle the number of loci several times larger than the sample size. We used a dataset from an F_{2 }mouse family derived from the cross of two inbred lines to test the model and detected a major segregation distortion locus contributing 75% of the variance of the underlying liability. Replicated simulation experiments confirm that the power of viability locus detection is high and the false positive rate is low.

Conclusions

Not only can the method be used to detect segregation distortion loci, but also used for mapping quantitative trait loci of disease traits using case only data in humans and selected populations in plants and animals.

Background

Segregation distortion refers to a phenomenon that the observed genotypic frequencies deviate significantly from the expected Mendelian frequencies _{2 }population is 1:2:1 for the three genotypes _{1}_{1}: _{1}_{2}: _{2}_{2}. Many reasons can explain the observed distortion

It appears that segregation distortion is common rather than rare. If segregation distortion is indeed caused by viability selection loci, these loci themselves are of interest because they may help to understand the mechanism of natural selection and evolution. Chi-square tests are commonly used to test segregation distortion. Fu and Ritland

The quantitative genetic model of Luo et al.

Numerous algorithms have been developed to implement the generalized linear mixed model. The pseudo likelihood algorithm

It is not clear how to use the pseudo likelihood approach to mapping viability loci because there is no phenotypic data point to transform. However, the method of McGilchrist

Method

Liability model and viability selection

Let us define a continuous variable _{j }

where _{j }_{k }_{jk }_{2 }individual derived from the cross of two inbred lines can take one of three genotypes, _{1}_{1}, _{1}_{2 }and _{2}_{2}. Under the additive genetic model, _{jk }

and _{k }_{k }_{k }_{k }d_{k}^{T}_{k }

where _{i }

The liability _{j }_{j }

be the expectation of the unobserved liability (a linear predictor). We use the normal or the logistic function to model the probability of survival for individual _{j}_{j}_{j}_{j}

where

is the linear predictor excluding locus _{j(-k) }_{k }

as a multivariate Bernoulli variable with three categories (i.e., a multinomial variable with sample size one). If individual _{1}_{1}, then _{j(11) }= 1 and _{j(12) }= _{j(22) }= 0. The probabilities of individual

where

is the mean of the three penetrances and

is the expected Mendelian ratio. In an F_{2 }population, the expected Mendelian ratio is _{k }_{j }_{j(11) }_{j(12) }_{j(22)}] will be equivalent to the expected Mendelian ratio for every individual at the locus.

If there is no factor to be considered other than the markers, the term _{j}β _{j }_{j }_{j }_{j }_{j(1) }or _{j(2) }as the posterior probability that individual

where

is the mean penetrance of the two genders and

is the linear predictor excluding the gender effect.

We now assume that _{j }_{j }_{j}_{j}^{2}), where ^{2 }are known. Let _{j }_{j}β _{j(-β)}) as the probability that individual

where

Proof of this equation (16) is straightforward and thus given in the next paragraph.

Let _{j}_{j}^{2}) be the normal density for variable _{j }^{2}. The following Lemma

Let us rewrite equation (16) as

Comparing equation (18) with equation (17), we can see that _{j(-β)}/^{2 }= 1/^{2}. Substituting these into equation (17), we get

This concludes the derivation of equation (16) presented in the previous paragraph.

Likelihood, prior and posterior

It is difficult (if not impossible) to construct the joint likelihood for all loci, but conditional on the effects and the genotypes of other loci, the likelihood for locus

The exact notation for this log likelihood should be _{k}_{(-k)}) because it is conditioned on the gender effect and effects of other loci. We use the simplified notation to improve the readability. Let us assign a normal prior to _{k}

Furthermore, we assign a hierarchical prior to ∑_{k}

where _{k}

where a constant has been ignored.

For the sex effect (discrete co-factor), the likelihood for _{j(-β) }is

For the continuous co-factor, the log likelihood for parameter

Prior distribution for the non-genetic effect is assumed to be uniform (uninformative prior) and thus only the likelihood is needed to find the posterior mode estimate of

Posterior mode estimation

Due to the possible large number of parameters, we take a sequential approach to estimating the posterior mode parameters with one locus at a time. This approach is also called the coordinate descent algorithm. Once the parameters of all loci are updated, the sequence is repeated until a certain criterion of convergence is reached.

Let us define the first step of the Newton-Raphson iteration as

and denote the variance of this updated parameter by

where the first and second partial derivatives are evaluated at _{k }_{k}_{k}_{k }_{k }_{k }

where τ + 1 is the degree of freedom for the inverse Wishart posterior and the number 2 represents the dimension of vector _{k}

The posterior mode estimation of

with an estimation error variance approximated by

The iteration process of the posterior mode estimation is summarized as follows.

Step 0: Initialize all parameters.

Step 1: Update the non-genetic effect using equation (29).

Step 2: Update effect of marker

Step 3: Update ∑_{k }

Step 4: Repeat step 1 to step 3 until the iteration process converges.

Genetic contribution from an individual locus

An obvious advantage of the liability model is that we are able to calculate the proportion of the liability variance contributed by each SDL, similar to the proportion of quantitative trait variance contributed by each QTL. Suppose that we have detected one SDL with both additive and dominance effects. The theoretical variances of the _{2 }population are 0.5 for the additive part and 1.0 for the dominance part. The reason is that the three genotypes are coded as +1, 0 and -1 for the additive _{k }_{k }

The residual variance of the liability is set at unity and thus the variance of the liability is

The broad sense heritability is defined as

This is the proportion of the liability variance contributed by the

The liability model has unified QTL mapping and SDL mapping in the same framework of quantitative genetics.

Results

Mouse experiment

We used a published dataset of an F_{2 }mouse experiment to demonstrate the application of the method. The dataset was published by Lan et al. _{2 }_{j }_{j(11) }_{j(12) }_{j(22)}], is missing for every individual. In the data analysis, the missing variable was replaced by the conditional probability calculated using the multipoint method

The top panel of Figure _{1}_{1}, _{1}_{2 }and _{2}_{2}, plotted against the mouse genome. It is obvious that there is a severe distortion in the beginning of chromosome 6 where the population contains almost exclusively the _{2}_{2 }genotypes with _{1}_{1 }and _{1}_{2 }almost eliminated from the population. Chromosomes 14 and 18 also show mild segregation distortion. Interval mapping for segregation distortion using the QTL procedure in SAS _{1}_{1}, _{1}_{2 }and _{2}_{2}), respectively.

Frequencies of the three genotypes and LOD score profiles of the mouse genome

**Frequencies of the three genotypes and LOD score profiles of the mouse genome**. This is an F2 population derived from the cross of two inbred lines (BT×BTBR). (a) The top panel shows the frequencies of three genotypes with the blue, red and green patterns representing the _{1}_{1}, _{1}_{2 }and _{2}_{2 }genotypes, respectively. (b) The bottom panel shows the LOD score profiles for the mouse genome obtained from the interval mapping of segregation distortion. The profile in red (LOD SDL) represents the LOD score for segregation distortion. The curves in blue and black are the LOD scores for QTL of the 10 week body weight and joint testing of the QTL and segregation distortion.

We used the generalized linear mixed model to analyze all the 466 markers (193 true and 273 pseudo) jointly. In the mouse data, among the 110 mice, 52 were male and 58 were female. Apparently, the sex ratio is not biased and thus sex appears to have no effect on the survivorship. However, we included the sex factor as a fixed effect in the model to test the robustness of our model. We expected that our model would detect no sex effect on the survivorship. The generalized linear mixed model had 466 × 2 + 1 = 933 model effects, including 466 additive effects, 466 dominance effects and one sex effect. This GLMM with 110 individuals was indeed able to handle such a large model (933 model effects). The hyper parameters used in the analysis was (

QTL effects and LOD scores of the mouse genome estimated by GLMM

**QTL effects and LOD scores of the mouse genome estimated by GLMM**. The additive effect and the dominance effect are shown in the top panel and LOD score of additive effect and dominance effect are in the bottom panel. Additive effect and the LOD score profiles are colour coded in blue and the dominance effect and LOD score profiles are coded in red. Positions of the 193 true markers are indicated by the barcode like ticks on the horizontal axis. The critical value (95% of the LOD score generated under the null model) is 2.99, which is smaller than the observed LOD score of 25.0.

In the GLMM analysis, the QTL effect has been interpreted as an effect on a hypothetical liability. The total variance of the liability is (see the Method section)

Therefore, the proportion of the liability variance explained by this segregation distortion locus is

which is also called the broad sense heritability. This single locus contributes approximately 93% of the liability variance. We can also calculate the expected frequencies of the three genotypic based on the estimated QTL effect. Let

The expected frequencies for the three genotypes are

respectively, for _{1}_{1}, _{1}_{2 }and _{2}_{2}.

Simulation experiment

We simulated a single chromosome with 2400 cM in length covered by 481 markers evenly placed on the genome with 5 cM per marker interval. The additive QTL effects of six markers were simulated with the true positions and true effects as presented in Figure _{2 }family with 500 individuals are also presented in Figure _{1 }= 1.0. The second co-factor was a continuous variable with ^{2 }= 0.025. The effect of this co-factor on the liability was _{2 }= 1.0. The liability of each individual was generated using the linear model containing the two cofactors and the six QTL. An individual with a liability greater than 0 survived the selection, otherwise, it was eliminated. All the 500 individuals in the sample survived the selection. The simulated data were analyzed using the generalized linear mixed model with (

Genotype frequencies and the true QTL effects for segregation distortion in the simulation experiment

**Genotype frequencies and the true QTL effects for segregation distortion in the simulation experiment**. In the top panel, blue and green areas represent the frequencies of the two types of homozygote while the red area represents the frequency of the heterozygote. The true QTL effects are shown in the bottom panel.

The estimated additive effects and the LOD scores are given in Figure

Estimated additive QTL effects and the LOD scores for segregation distortion in the simulation experiment

**Estimated additive QTL effects and the LOD scores for segregation distortion in the simulation experiment**. The additive QTL effects (top panel) and LOD scores for the additive effects (bottom panel) are estimated by GLMM. The true effect is colour coded in red and the estimated effect is coded in blue.

The estimated QTL effects (top panel) and LOD scores (bottom panel) under the null model

**The estimated QTL effects (top panel) and LOD scores (bottom panel) under the null model**. The data was simulated with no segregation distortion.

Estimated parameters of the QTL identified by GLMM compared to true values in the simulation.

**True effect**

**True proportion**

**Estimate**

**StdErr**

**Position (cM)**

**LOD**

**Proportion**

QTL 1

1.4135

0.1543

1.1905

0.1357

50

16.6828

0.1224

QTL 2

-0.9993

0.0771

-0.8296

0.1252

125

9.5271

0.0594

QTL 3

0.9993

0.0771

0.9605

0.1328

360

11.3536

0.0796

QTL 4

-1.2048

0.1121

-1.1991

0.1353

905

17.0304

0.1241

QTL 5

1.0000

0.0772

0.8593

0.1310

1735^{a}

9.3347

0.0637

QTL 6

-1.41354

0.1543

-1.2959

0.1380

2115

19.1230

0.1450

Co-factor 1

1.0000

0.1545

1.0217

0.1020

--

21.7673

0.1803

Co-factor 2

1.0000

0.0386

1.1007

0.1809

--

8.0412

0.0523

0.8455^{b}

0.8272^{c}

^{a}1735 The true location is 1750 cM and the estimated location is 15 cM away from the true location.

^{b}0.8455 This is the true (total) proportion of the liability variance contributed by the six QTL and the two co-factors.

^{c}0.8272 This is the estimated (total) proportion of the liability variance contributed by the six QTL and the two co-factors.

This paragraph describes the result of 100 repeated simulations generated from the same set of parameters. This experiment allowed us to evaluate the power and false positive rate of QTL identification. The critical value for the LOD score was 2.99, which was generated empirically from multiple simulations under the null model (see the Method section). For each of the true QTL, if any marker with 15 cM away from the true QTL had a LOD score greater than 2.99, this QTL was declared as being detected. Since each marker interval was 5 cM, the 30 cM (15 cM left and right) coverage contained five markers (including the one with the true effect). If any marker more than 15 cM away from a simulated true QTL had a LOD score greater than 2.99, that marker was declared as a false positive. Results of the replicated simulation experiments are given in Table

Average estimates of effects and powers of simulated QTL and co-factors from 100 replicated simulations.

**True**

**Estimate**

**StdEv**

**Power (%)**

QTL 1

1.4135

1.1028

0.1329

99

QTL 2

-0.9993

-0.5964

0.1270

71

QTL 3

0.9993

0.7663

0.1474

91

QTL 4

-1.2048

-0.9858

0.1310

98

QTL 5

1.0000

0.7166

0.1375

87

QTL 6

-1.41354

-1.1977

0.1488

100

Co-factor 1

1.0000

0.9192

0.1299

100

Co-factor 2

1.0000

0.8894

0.1895

95

True - The true effects used to simulate the data.

Estimate - The average estimated effects obtained from 100 replicated simulation experiments.

StdEv - The standard deviation of effects from the 100 replications.

Power - The number of replicates in which the effect was detected out of 100 replicated samples

Discussion and conclusions

Genome-wide segregation distortion is a common phenomenon in genetic mapping, but it is usually ignored. The main reason is the difficulty in joint estimation and tests of the segregation distortion loci. We formulated the problem as a typical quantitative genetics problem using a hypothetical liability to describe the fitness of each individual. Using a generalized linear mixed model, we were able to estimate and test genome-wide quantitative trait loci controlling the hidden liability. We used a mouse dataset to demonstrate the method and detected a major QTL for the liability that explains 93% of the liability variance. The simulated data experiment showed that the method can detect a QTL (e.g., the second QTL simulated) explaining 7.71% of the liability variation with 71% power. The method was implemented in a SAS/IML program. The code is posted on our website (

As a Bayesian method, there are a rich array of prior distributions can be explored. In this study, we used the inverse Wishart as the prior distribution for the prior variance matrix of QTL effects. For the additive genetic model (one effect per locus), the inverse Wishart distribution becomes a scaled inverse Chi-square distribution. It is possible to use the exponential distribution (the Lasso prior) as an alternative prior

A caveat of this method is the requirement of Mendelian segregation ratio (before the selection). For populations generated through line crossing experiments, Mendelian ratios are known. However, for uncontrolled populations, the theoretical Mendelian frequencies are not available. In this case, one needs to survey the unselected population to obtain the genotypic frequencies as the controlled "Mendelian segregation". If one can genotype both the selected and unselected individuals, one may simply use the case-control study and there is little reason to use this case-only study approach. In reality, genotyping individuals is much more costly than pooling the DNA of a sample of individuals. The cost effective approach is to genotype each individual in the surviving sample and genotype the pooled DNA sample for the unselected population because we only need the frequencies of genotypes (not the genotypes of individuals) in the unselected population. For the co-factors, we also need the expected frequencies of the co-factors in the unselected population. We examined the sex effect (discrete co-factor) and a normally distributed co-factor. The expected 1:1 sex ratio was used as the expected frequency. For the normal co-factor, we used the mean and variance of the co-factor used in the simulation (the true values) to construct the expected distribution. In reality, one needs to survey the entire population to obtain the expected distribution. For continuous variables deviating from normality, one may discretize a variable to a few groups. For example, age is a quantitative variable but one can arbitrarily divide individuals into a few age groups. This discretization will eliminate the restriction of normal distribution.

The method developed here can be applied to more broad situations beyond genetics without much modification. For example, if we know the joint distribution of

QTL mapping is usually conducted in unselected populations. Individuals with undesired phenotypes must also be evaluated to obtain unbiased estimates of QTL effects. This is not a cost effective approach in breeding companies. Breeders wish to use only selected individuals to breed and keep no records for the unselected individuals. If we only evaluate the selected individuals, markers associated with the traits of interest will show distorted segregation. If the selection criterion is not well defined, for example, drought resistance, it is hard to map QTL. The segregation distortion loci are actually the QTL for drought resistance if one knows that there is no segregation distortion in the unselected population. The method developed here can be directly applied to mapping drought resistance QTL. Because we can perform QTL mapping using selected population, this approach may be called "mapping while selecting". For example, breeders may want to evaluate drought resistance of a family of recombinant inbred lines (RIL) by planting all seeds in a harsh drought environment. Eventually all plants die except the ones with strong resistance of drought. Breeders may have no records of the plants eliminated, but they can still perform QTL mapping for this trait (drought resistance) using all plants that have survived the selection. Other stress related traits can also be mapped using this approach, e.g., pest and salinity resistances.

In human genetics, case-control study is a common approach for mapping disease loci. In situations where there are no records for the control but the case, this case-only study may benefit from the new method. For example, one may easily get patient data from hospitals but hardly has individual records for the entire population. QTL mapping for the disease trait is still possible if we have the population records (frequencies) of genotypes in the entire population.

In summary, we developed a hierarchical generalized linear mixed model to map QTL for liability. This is a new approach to genetic mapping. It incorporates a seemingly different problem (segregation distortion) into the same QTL mapping framework for quantitative traits. Statistically, it shows that the generalized linear mixed model can be applied to situations where there are no phenotypic records; one only needs a likelihood function, a linear predictor and a prior distribution to infer the posterior mode estimation of the model effects.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HZ conducted the actual work in terms of programming and data analysis. SX proposed the idea, oversaw the project and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We greatly appreciate two anonymous reviewers and the associated editor for their comments on an early version of the manuscript and their suggestions in revision of the manuscript. The project was supported by the USDA National Institute of Food and Agriculture Grant 2007-02784 to SX.