Email updates

Keep up to date with the latest news and content from BMC Genetics and BioMed Central.

Open Access Research article

Genetic mapping of complex traits by minimizing integrated square errors

Song Wu12, Guifang Fu3, Yunmei Chen4, Zhong Wang23 and Rongling Wu23*

Author Affiliations

1 Department of Applied Mathematics and Statistics, the State University of New York at Stony Brook, Stony Brook, NY 11790, USA

2 Center for Computational Biology, Beijing Forestry University, Beijing 100083, China

3 Center for Statistical Genetics, Pennsylvania State University, Hershey, PA 17033, USA

4 Department of Mathematics, University of Florida, Gainesville, FL 32611, USA

For all author emails, please log on.

BMC Genetics 2012, 13:20  doi:10.1186/1471-2156-13-20


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/13/20


Received:2 June 2011
Accepted:23 March 2012
Published:23 March 2012

© 2012 Wu et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Genetic mapping has been used as a tool to study the genetic architecture of complex traits by localizing their underlying quantitative trait loci (QTLs). Statistical methods for genetic mapping rely on a key assumption, that is, traits obey a parametric distribution. However, in practice real data may not perfectly follow the specified distribution.

Results

Here, we derive a robust statistical approach for QTL mapping that accommodates a certain degree of misspecification of the true model by incorporating integrated square errors into the genetic mapping framework. A hypothesis testing is formulated by defining a new test statistics - energy difference.

Conclusions

Simulation studies were performed to investigate the statistical properties of this approach and compare these properties with those from traditional maximum likelihood and non-parametric QTL mapping approaches. Lastly, analyses of real examples were conducted to demonstrate the usefulness and utilization of the new approach in a practical genetic setting.

Background

Genetic mapping of quantitative trait loci, or QTLs, plays prominent roles in understanding the genetic basis of many phenotypic variations [1-4]. Depending on the biological nature of the organism and trait studied, several types of mapping populations generated from different experimental crosses can be constructed to map the QTL of interest. Among those, backcross and F2 intercross are probably two of the most widely used techniques and have been applied in many areas, such as maize and mice studies [5-7]. These experimental crosses separate individual gene components, including QTLs, in a controlled manner, which serves as a foundation for QTL mapping. The basic question is how to efficiently and effectively associate a quantitative trait with its corresponding QTLs and subsequently determine their locations and genetic effects through QTL-linked genetic markers. The past two decades have seen tremendous statistical methodological development in this area [8-16]. Usually, one significant assumption required to derive these statistical methods is that the phenotypic values of a trait can be modelled by a known parametric distribution, such as a normal distribution. By estimating the parameters that define the phenotypic distribution of each genotype at a putative QTL and testing their differences among different QTL genotypes within a mixture model framework, the existence of a QTL and its genetic effects can be inferred on the genome. Statistical approaches for parameter estimation with the mixture model are typically derived within the maximum likelihood (ML) context because of many good properties of a ML estimator, such as asymptotical unbiasedness and asymptotical efficiency. Recently, a surge of interest has also been exploded in solving the mixture models by Bayesian approaches [17-19].

Parametric modelling has the advantage of easy interpretation of results. However, in practice it is often hard or unrealistic to guarantee the assumed model for analysis truly reflects the phenotypic distribution of a trait. For example, significant measurement errors or outliers occurring as a usual case in data collection may lead the observed trait distribution to deviate from the underlying distribution of data. Figure 1 illustrates the empirical density for the growth rate of body mass from ages 5 weeks to 10 weeks in an F2 mapping population of 500 mice derived from two inbred lines [20]. It is obvious that the density function is not a perfect normal distribution, as it contains a small bump on the upper tail of the density function. Since the main body of the density curve resembles a normal distribution, the true distribution of observed body mass data (Figure 1) can be viewed as a distorted normal. In these cases, if the traditional methods, such as maximum likelihood, were applied, significant bias would be introduced by the potential outliers. Therefore, there is a pressing need to develop a more robust statistical approach for mapping those complex traits that display such a distribution. Non-parametric rank-based method has been introduced for mapping traits with outliers [11]; however, as it is nonparametric, the interpretability of the mapping results, especially on the genetic effects such as percentage of variance explained by the significant QTL, is usually poor.

thumbnailFigure 1. Mice data density plot. The empirical density for the growth rate of body mass from ages 5 weeks to 10 weeks in an F2 mapping population of 500 mice.

In this article, we derive a new mapping approach that is not only robust for genetic mapping of complex traits with the distorted normal distributions, as shown in Figure 1, but also maintains the easy interpretability of a parametric model by minimizing the integrated square errors. The integrated square error has been typically used as the goodness-of-fit criterion in nonparametric density estimation [21]. Some previous studies have also shown that this criterion can be applied in parametric settings and the parameter estimator from this method, or the L2 estimator (L2E), is robust to the model specification [22-25]. In this sense, this method allows moderate deviation of a proposed density function from the true underlying density. Here, we incorporate the principal of the integrated square errors into genetic mapping framework in a parametric way, and call it the L2E mapping method. The main advantage of this new mapping method is that it automatically manipulates data points that are apparently outliers by giving them less weight in parameter estimation, and therefore yields more accurate estimation of QTL locations and effects. In the case where the data cleaning is not possible or very hard to do so, L2E method would be a very beneficial choice.

Methods

Mapping population

Suppose there is an F2 population of N progenies, initiated with two different inbred lines, in which there are three groups of genotypes at each gene. A genetic linkage map is constructed for this population, aimed to identify trait-controlling QTLs on the genome. Let yi denote a phenotypic trait of interest for F2 progeny i. It is assumed that a QTL with allele Q and q exists to affect this trait. The QTL is bracketed by two flanking markers M1 (with alleles M1 and m1) and M2 (with alleles M2 and m2). Let r1, r2 and r be the recombination fractions between M1 and the QTL, between the QTL and M2, and between the two markers, respectively. Although QTL genotypes are not known, the probability with which a progeny i carries a specific QTL genotype can be inferred from the marker genotypes of this progeny. The conditional probability of QTL genotype j (j = 2 for QQ, 1 for Qq, and 0 for qq), conditional upon one of the nine genotypes of the flanking markers for progeny i in the F2 population, can be derived and expressed as a function of the recombination fractions (r1, r2 and r) [26].

Suppose each QTL genotype j has a genotypic mean gj. The comparisons of these means among three different genotypes can determine whether and how this putative QTL affects the trait. The trait phenotype of progeny i due to the QTL can be expressed by the following linear statistical model:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M1">View MathML</a>

(1)

Where ξij is an indicator variable for individual i that is defined as 1 for QTL genotype j considered and 0 otherwise, and ei ~ f (e) is the residual effect of progeny i, including the aggregate effect of polygenes and error effect.

We assume that f (e) is the true density of ei, which is unknown but has zero mean. Then, the density of yi would be a mixture of f with mean gj. Within the maximum likelihood context, the EM algorithm can be employed to estimate the genetic parameters and test the existence of the QTL [26].

L2E approach

Our proposed L2E method is to minimize a data-based estimation of the L2 divergence between the assumed model density φ and the true objective density (f) underlying the data. An energy function (E) can be defined to measure the divergence between φ and f:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M2">View MathML</a>

where u is a random variable with density of f. Since the goal is to minimize E, ∫ f2du can be dropped off because it is a constant of unknown parameters. Hence, the energy function to be minimized can be redefined as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M3">View MathML</a>

Although it is impossible to give the explicit form of E(φ), by applying the law of large numbers (LLN), it can be approximated by the observed data and then a new energy function can be formulated as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M4">View MathML</a>

(2)

Since the LLN has been employed in the formula derivation, the L2E method is not suitable for dataset with a small sample size. Let Θ denote all the parameters in E, then the L2E of Θ <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M5">View MathML</a> is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M6">View MathML</a>

The asymptotic properties of the parameter estimators by L2E can be shown by the following proposition.

Proposition 1

Consider a single trait y. Let φ(y | Θ) be the parametric model used in (2). Under mild conditions, the L2E parameters are consistent and asymptotically normal, i.e.,

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M7">View MathML</a>

where

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M8">View MathML</a>

Proof

The estimation functions for (2) are

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M9">View MathML</a>

Define <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M10">View MathML</a>. Then, from the theory for standard M-estimators, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M11">View MathML</a>

where

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M12">View MathML</a>

and

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M13">View MathML</a>

Then, the results follow.

In the setting of genetic mapping, where the density of a mixture of normal applies (model 1), two approaches can be used to implement the principle of minimum integrated squared errors. The most straightforward implementation is to directly model the true density of the error term (ei), and the second one is based on modelling the true density of the observed phenotype data (yi). The obvious difference between these two methods is that density for ei is f with mean zero, but the density for yi is a mixture of f with mean gj. A more subtle difference is that the error term based L2E method (eL2E) involves one additional approximation step in genetic positions between markers. Although simulation studies shown in later sections demonstrate that eL2E is inferior to the phenotype data based L2E method (pL2E), we would still like to present the eL2E procedure, because its formulation at marker positions can help derive the pL2E method, as will be seen below. Both eL2E and pL2E employ the energy function E defined in Equation (2), with u being the error term e or the observed data y, correspondingly.

Error term based L2E method (eL2E)

In model (1), the randomness is derived from the underlying error term. Thus, it is natural to directly model the density of the error term f(e). In a continuous case, a normal density function φ(e|0,σ2) is used to approximate the true error density f(e). Thus, using (2), the energy function for error (Ee) becomes

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M14">View MathML</a>

where

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M15">View MathML</a>

(3)

Notice that

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M16">View MathML</a>

Then, the estimators of the unknown parameter set in (Θ = (g0, g1, g2, σ2)) can be represented as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M17">View MathML</a>

where φ(ei) can be approximated by its expectation E[φ(ei)]. Based on the error (3), we have

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M18">View MathML</a>

where ωij is the conditional probability of QTL genotype j given the marker genotype of progeny i.

Thus, the estimator of the parameters is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M19">View MathML</a>

(4)

In practice, the genomic location of a QTL is estimated by scanning positions across the genome. When the QTL is assumed to exist between the two markers, the Ee is approximated twice, one by the LLN and the other by the calculation of φ(ei). However, if the QTL is scanned at a marker position, only the approximation by the LLN is needed because no mixture density is used in this situation. The energy function at the marker position Eem is expressed as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M20">View MathML</a>

(5)

where Nk is the number of progeny in the marker genotype group k and <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M21">View MathML</a>.

Phenotype data based L2E method (pL2E)

Unlike the error density, the phenotype data density contains a mixture of density functions each corresponding to a different QTL genotype. Also, because each marker genotype group k (k = 1,...,9 for two markers) has a different probability of linking with the QTL genotypes, the phenotype density is marker-dependent. The density for marker genotype k is expressed as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M22">View MathML</a>

(6)

where ωij is the conditional probability of QTL genotype j given the marker genotype k. From Eq. (2), the energy function for marker genotype k is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M23">View MathML</a>

Notice that

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M24">View MathML</a>

Thus, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M25">View MathML</a>

When a QTL is assumed to be at a marker position, φk(yi) = φk(yi | mj), which is not in a mixture form. The <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M26">View MathML</a> at a marker position can be simplified as:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M27">View MathML</a>

To combine the information from all nine marker genotypes, we take a weighted sum of the marker energy functions to calculate an overall energy function for phenotype data (Ed) as

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M28">View MathML</a>

or

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M29">View MathML</a>

Here h(x) is a monotone increasing function with respect to x. The reason for choosing such an h function is that the more progenies in one marker group, the better approximation accuracy achieved by the LLN, and the more weights should be put on this group. To determine the exact form of h, we need to use the formulation for eL2E. Because at marker positions, the eL2E and pL2E approaches use exactly the same information for the derivation of energy function, they should agree at those positions. Therefore, a comparison between the energy functions in (5) and (7) for eL2E and pL2E at marker positions suggests that h(Nk) = Nk/N where N is the total number of progeny. By using this form of h(Nk), the estimators for the unknown parameters in Ed is

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M30">View MathML</a>

Hypothesis testing

The existence of a significant QTL can be tested by the following hypotheses:

H0: g0 = g1 = g2H1: Not all equalities in H0 hold.

For these hypotheses, we can find their corresponding L2E estimates, <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M31">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M32">View MathML</a>, and energies, <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M33">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M34">View MathML</a>, respectively. Analogous to the likelihood ratio (LR) test statistics, we define an energy difference (ED) test statistics for our hypothesis testing:

<a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M35">View MathML</a>

Because the mixture of density functions is a larger family than its composite density functions, the <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M34">View MathML</a> is minimized over a larger space than the <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M33">View MathML</a>. Thus, <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M34">View MathML</a> should always be smaller than the <a onClick="popup('http://www.biomedcentral.com/1471-2156/13/20/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2156/13/20/mathml/M33">View MathML</a>, i.e., the test statistics ED should always be positive. As typically done in genome-wide QTL mapping, a permutation test [27] is performed to determine the critical threshold value for ED.

Results

Monte Carlo simulation

We performed Monte Carlo simulation studies to examine the statistical properties of the L2E-based mapping model. Consider a sample size N from an F2 population, with which one chromosome segment was simulated with a length of 200 cM covered by 11 evenly spaced markers. Suppose there is a QTL responsible for a quantitative trait that is placed at 86 cM from the first marker on the left-hand side. Both the QTL and markers are assumed to be codominant. Three QTL genotypes are assumed to have different mean values, with a common variance (which is scaled according to a given heritability).

By scanning the simulated chromosome with a step size of 2 cM from the left end to the right end, the ED values were calculated and smoothed. Figure 2 shows two typical ED profiles obtained by modelling the error density (Figure 2A) or the phenotype density (Figure 2B). The peak value from modelling the error density always occurs at a marker position, although the true QTL location is placed between the fifth and sixth markers, whereas the method by using the phenotype density can find a peak ED value close to the true QTL location, suggesting that the pL2E approach performs better than the eL2E. Therefore, for the simulation studies and real-data analysis, the pL2E method will be used. This is reasonable because the derivation of eL2E involves two approximations but pL2E involves only one by the LLN. For the ease of notation, hereinafter L2E means pL2E.

thumbnailFigure 2. Comparison between the two implementations of the L2E methods by simulation. (a) Using the true density of the error term. (b) Using the true density of the observed data. The arrows to the x-axe indicate the peak of the ED profile. The true position of the QTL is at 86 cM from the left end of the simulated chromosome.

Additional simulations were performed to examine the statistical properties of the L2E method, under different sample sizes (N = 100, 200, 400) and heritabilities (H2 = 0.1, 0.2, 0.4). In each case, 100 replicates were run to evaluate the consistency and efficiency of the mapping methods. First, we consider simulation scenarios where error distributions in model (1) are normally distributed without any outlier data; i.e., the normal distribution is the true model. Because the simulation results from different sample sizes display similar patterns, here we only show the result for N = 400, which is tabulated in Table 1. Both L2E and traditional ML methods obtained consistent estimators and similar standard errors for the genetic effect parameters. However, the ML estimators have better efficiency as evidenced by smaller MSEs. This is expected because under the true model the MLE is asymptotically efficient.

Table 1. Simulation scenario 1.

Second, we simulated scenarios where error distributions in model (1) are non-normal, using a t-distributions as error terms. In addition to different combinations of sample size and heritability, we also changed the degrees of freedom (df) of the t-distributed errors. When df is high (e.g., df = 4), where the t-distribution approximate a normal distribution, the two methods perform similarly (Table 2). However, when df is low (i.e., df = 2), where the t-distribution has much heavier tails than the normal distribution, the MLE method failed to give correct parameter estimates and yielded much larger standard errors. In the contrast, the L2E maintained the correct estimates with smaller standard errors. This demonstrates the robustness of the L2E method against model misspecification.

Table 2. Simulation scenario 2.

Third, we simulated experiments where data contains outlier data points. Because NP mapping is popular for traits with outliers [11], we compared the L2E model with both ML and NP approaches. The outliers were generated from another normal density on the upper tail of a mixture density. Different percentages of noise points (0, 5%, 10%, and 20%) were considered. The main results are shown in Table 3 with 10% outliers with noise mean at 45 and in Table 4 with 10% outliers with noise mean at 55. Our findings are summarized as follows: (1) With the existence of noise points, the L2E estimators are consistent but the ML estimators become biased towards the direction of the outliers. As the outliers move further away from the true density (from 45 to 55), the ML estimators perform significantly worse, but the L2E estimators stay consistent with very little impact. (2) As the heritability becomes smaller and smaller, the difference between the two methods becomes less. This is because the variation of the mixture density increases with decreasing heritability and, thus, the relative positions of the outliers become closer. This is consistent with the point (1) (3) L2E and NP methods show similar robustness to the outliers. However, the L2E method maintains the interpretability of a parametric model and gives accurate estimates of genetic effects. Overall, the simulation results demonstrate that the L2E method is preferred to the MLE and NP methods when the true model is misspecified or non-ignorable outliers exist.

Table 3. Simulation scenario 3.

Table 4. Simulation scenario 4.

A worked example

Vaughn et al. [20] constructed a linkage map with 96 microsatellite markers for 502 F2 mice (259 males and 243 females) derived from two inbreed strains, the Large (LG/J) and Small (SM/J). This map has a total map distance of 1780 cM (in 19 linkage groups) and an average interval length of 23 cM. The F2 progeny was measured for body mass at 10 weekly intervals starting at age 7 days. The raw weights were corrected for the effects of each covariant due to dam, litter size at birth, parity, and sex [20].

Our analysis here focuses on identifying QTLs that may affect the body mass growth rate from ages 5 weeks to 10 weeks, which is defined as body mass ratio between week 10 and week 5. On the right side of the empirical density of this trait (Figure 1), there is an obvious bump, suggesting the existence of some outliers. Both L2E and ML methods were applied to map this trait. The profiles of the two test statistics, energy difference (ED) and likelihood ratio statistic (LRS) across the whole mice genome is shown in Figure 3. The empirical distribution of test statistics were calculated on the basis of 1000 permutations and the 5% significance level was chosen.

thumbnailFigure 3. L2E and MLE mapping of the mice data. Genomic scanning profiles for mapping QTLs controlling the growth rate of body mass from weeks 5 to 10 by L2E (a) and ML approaches (b). The y-axes are the ED and LR test statistics, respectively. The dash dot line and the dash line are the chromosome-wide and genome-wide 0.05 cutoffs at the significant level of 0.05 based on the 1000 permutations, respectively. The x-axis ticks indicates the marker positions, the arrows to the × axes shows the genomic positions of the significant QTL at chromosome level, and the asterisk at chromosome 8 in the L2E profile marks a genome-wide significant QTL.

Although the overall profiles of ED and LRS look similar, they did detect different significant QTLs. The ML method cannot identify any significant QTL at the genome level; however, the L2E method successfully detects one genome-wide significant QTL at 2 cM to the leftmost proximal marker on the chromosome 8. Coincidently, in 2005, Rance et al. [28] reported a significant QTL for the mature mice body mass located at 7 cM to the leftmost proximal marker on the chromosome 8, almost at the same location for the significant QTL identified here. Our finding hence further validates the existence of a significant QTL for mice body mass at the beginning of the chromosome 8. The genetic effects of the significant QTL identify by the L2E method are summarized in Table 5. This example shows the power of the L2E method to detect significant QTLs in practice.

Table 5. L2E mapping results of the mice data.

Discussion

Current mapping technologies allow us to dissect the variation of quantitative traits into individual genetic components (QTLs). Through this dissection the genetic architecture behind the quantitative traits can be elucidated, which provides a sound basis for future trait improvement. To better utilize the genomic data, considerable attention has been paid to develop powerful analytic methodologies that can increase the power, precision, and resolution of QTL mapping (8-16). Currently, almost all the QTL mapping methods proposed so far assume a parametric (mostly normal) distribution density of a trait. However, there is an increasing recognition of the limitation for the parametric assumption, given that in practice the true distribution of a trait is never known.

In this article, we propose a QTL mapping methodology based on the principle of L2E, which may allow the fitted model to be different from the true model. We derived two different implementation of the L2E method into the mapping framework and show how they are connected. The simulation studies suggest that the pL2E method works better than eL2E method and were used for our further analyses. Additional simulation studies were performed to test the statistical behaviour of the L2E-based mapping approach. The L2E method is more robust in the model choice at a cost of lower efficiency. For a "perfect" data, the ML performs better than the L2E. However, when the data contains noises, the L2E outperforms the ML. The relative efficiency of the L2E increases with increasing percentage of noises. In practice, it would be unrealistic for us to know the true model underlying the data, but it can be almost assured that no data is perfect. Thus, a better strategy is that the L2E method can first used to explore the data, with results compared with the MLE method.

This work is our first attempt to incorporate the principle of the integrated square errors into the genetic mapping framework. There are many areas that can be explored in the future, such as how to apply this principle to examine the gene-gene interaction or gene-environment interactions. The L2E method would be an excellent addition to the current toolbox of the QTL mapping.

Conclusions

In this article, we derive a robust approach for genetic mapping of complex traits by incorporating the principal of the integrated square error into the general mapping framework. This approached, called the L2E mapping, automatically manipulates data points that are apparently outliers by giving them less weight in parameter estimation, and therefore yields more accurate estimation of QTL locations and effects. In the case where the data cleaning is not possible or very hard to do so, our new method could be a very beneficial choice. Simulation studies showed that in the presence of outliers, L2E method outperforms the traditional MLE and non-parametric methods in terms of both accuracy and efficacy of the parameter estimations. A real data analysis of the mice body mass data also demonstrates the usefulness and utilization of the new approach in a practical genetic setting. We strongly encourage researchers to explore both the L2E and MLE mapping procedures in practice.

Authors' contributions

SW carried out the analysis, prepared and drafted the manuscript. GF participated in the design of the study. YC initiated the project design. ZW participated in the design of the study. RW initiated and established the overall project design, prepared and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank three anonymous reviewers for valuable comments that significantly improved the manuscript. This work is supported by NSF/IOS-0923975 and NIH/UL1RR0330184 We thank Dr. J. Cheverud at Washington University for providing his mouse data to validate our new model.

References

  1. Horvat S, Bunger L, Falconer VM, Mackay P, Law A, Bulfield G, Keightley PD: Mapping of obesity QTLs in a cross between mouse lines divergently selected on fat content.

    Mamm Genome 2000, 11:2-7. PubMed Abstract | Publisher Full Text OpenURL

  2. Haston CK, Zhou X, Gumbiner-Russo L, Irani R, Dejournett R, Gu X, Weil M, Amos CI, Travis EL: Universal and radiation-specific loci influence murine susceptibility to radiation-induced pulmonary fibrosis.

    Cancer Res 2002, 62:3782-3788. PubMed Abstract | Publisher Full Text OpenURL

  3. Wang X, Le Roy I, Nicodeme E, Li R, Wagner R, Petros C, Churchill GA, Harris S, Darvasi A, Kirilovsky J, Roubertoux PL, Paige B: Using advanced intercross lines for high-resolution mapping of HDL cholesterol quantitative trait loci.

    Genome Res 2003, 13:1654-1664. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Hu KM, Qiu DY, Shen XL, Li XH, Wang SP: Isolation and manipulation of quantitative trait loci for disease resistance in rice using a candidate gene approach.

    Mol Plant 2008, 5:786-793. OpenURL

  5. Lynch M, Walsh JB: Genetics and analysis of quantitative traits. In Sinauer Associations. Sinauer Associations, Sunderland; 1998. OpenURL

  6. Wu RL, Ma CX, Hou W, Corva P, Medrano JP: Functional Mapping of Quantitative Trait Loci That Interact With the hg Mutation to Regulate Growth Trajectories in Mice.

    Genetics 2005, 17:239-249. OpenURL

  7. Coelho CM, Wu S, Li YC, Hunter BG, Dante RA, Cui YH, Wu RL, Larkins BA: Identification of quantitative trait loci that affect endoreduplication in maize endosperm.

    Theor Appl Genet 2007, 115:1137-1145. PubMed Abstract | Publisher Full Text OpenURL

  8. Lander ES, Botstein D: Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps.

    Genetics 1989, 121:185-199. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Zeng ZB: Precision mapping of quantitative trait loci.

    Genetics 1994, 136:1457-1468. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Xu S, Atchley WR: A random model approach to interval mapping of quantitative trait loci.

    Genetics 1995, 141:1189-1197. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  11. Kruglyak L, Lander ES: A nonparametric approach for mapping quantitative trait loci.

    Genetics 1995, 139:1421-1428. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Kao CH, Zeng ZB, Teasdale RD: Multiple interval mapping for quantitative trait loci.

    Genetics 1999, 152:1203-1216. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Wu RL, Ma CX, Casella G: Joint linkage and linkage disequilibrium mapping of quantitative trait loci in natural populations.

    Genetics 2002, 160:779-792. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Jourjon MF, Jasson S, Marcel J, Ngom B, Mangin B: MCQTL: multi-allelic QTL mapping in multi-cross design.

    Bioinformatics 2004, 21:128-130. PubMed Abstract | Publisher Full Text OpenURL

  15. Jin C, Fine J, Yandell BA: A unified semiparametric framework for QTL analyses, with application to spike phenotypes.

    J Am Stat Assoc 2006, 102:56-67. OpenURL

  16. Siegmund D, Yakir B: The Statistics of Gene Mapping. New York: Springer; 2007. OpenURL

  17. Satagopan JM, Yandell BS, Newton MA, Osborn TC: A Bayesian approach to detect quantitative trait loci using Markov chain Monte Carlo.

    Genetics 1996, 144:805-816. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Yi NJ, Xu SZ, Allison DB: Bayesian model choice and search strategies for mapping interacting quantitative trait loci.

    Genetics 2003, 165:867-883. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  19. Liu T, Wu RL: A Bayesian algorithm for functional mapping of dynamic complex traits.

    Algorithms 2009, 2:667-691. Publisher Full Text OpenURL

  20. Vaughn TT, Pletscher LS, Peripato A, King-Ellison K, Adams E, Erikson C, Cheverud JM: Mapping quantitative trait loci for murine growth: a closer look at genetic architecture.

    Genet Res 1999, 74:313-322. PubMed Abstract | Publisher Full Text OpenURL

  21. Scott DW: Multivariate Density Estimation: Theory, Practice and Visualization. New York: John Wiley; 1992. OpenURL

  22. Beran R: Robust location estimates.

    Ann Stat 1977, 5:431-444. Publisher Full Text OpenURL

  23. Hjort NL: Minimum L2 and robust Kullback-Leibler estimation.

    Proceedings of the 12th Prague Conference 1994, 102-105. OpenURL

  24. Basu A, Harris IR, Hjort NL, Jones MC: Robust and efficient estimation by minimising a density power divergence.

    Biometrika 1998, 85:49-559. OpenURL

  25. Scott DW: Parametric statistical modeling by minimum integrated square error.

    Technometrics 2001, 43:273-285. OpenURL

  26. Wu RL, Ma CX, Casella G: Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer; 2007. OpenURL

  27. Doerge RW, Churchill GA: Permutation test for multiple loci affecting a quantitative character.

    Genetics 1996, 142:285-294. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  28. Rance KA, Fustin JM, Dalgleish G, Hambly C, Bünger L, Speakman JR: A paternally imprinted QTL for mature body mass on mouse chromosome 8.

    Mamm Genome 2005, 16:567-577. PubMed Abstract | Publisher Full Text OpenURL