Department of Applied Mathematics and Statistics, the State University of New York at Stony Brook, Stony Brook, NY 11790, USA
Center for Computational Biology, Beijing Forestry University, Beijing 100083, China
Center for Statistical Genetics, Pennsylvania State University, Hershey, PA 17033, USA
Department of Mathematics, University of Florida, Gainesville, FL 32611, USA
Abstract
Background
Genetic mapping has been used as a tool to study the genetic architecture of complex traits by localizing their underlying quantitative trait loci (QTLs). Statistical methods for genetic mapping rely on a key assumption, that is, traits obey a parametric distribution. However, in practice real data may not perfectly follow the specified distribution.
Results
Here, we derive a robust statistical approach for QTL mapping that accommodates a certain degree of misspecification of the true model by incorporating integrated square errors into the genetic mapping framework. A hypothesis testing is formulated by defining a new test statistics  energy difference.
Conclusions
Simulation studies were performed to investigate the statistical properties of this approach and compare these properties with those from traditional maximum likelihood and nonparametric QTL mapping approaches. Lastly, analyses of real examples were conducted to demonstrate the usefulness and utilization of the new approach in a practical genetic setting.
Background
Genetic mapping of quantitative trait loci, or QTLs, plays prominent roles in understanding the genetic basis of many phenotypic variations
Parametric modelling has the advantage of easy interpretation of results. However, in practice it is often hard or unrealistic to guarantee the assumed model for analysis truly reflects the phenotypic distribution of a trait. For example, significant measurement errors or outliers occurring as a usual case in data collection may lead the observed trait distribution to deviate from the underlying distribution of data. Figure
Mice data density plot
Mice data density plot. The empirical density for the growth rate of body mass from ages 5 weeks to 10 weeks in an F_{2 }mapping population of 500 mice.
In this article, we derive a new mapping approach that is not only robust for genetic mapping of complex traits with the distorted normal distributions, as shown in Figure
Methods
Mapping population
Suppose there is an F_{2 }population of
Suppose each QTL genotype
Where
We assume that
L_{2}E approach
Our proposed L_{2}E method is to minimize a databased estimation of the L_{2 }divergence between the assumed model density
where
Although it is impossible to give the explicit form of
Since the LLN has been employed in the formula derivation, the L_{2}E method is not suitable for dataset with a small sample size. Let Θ denote all the parameters in
The asymptotic properties of the parameter estimators by L_{2}E can be shown by the following proposition.
Proposition 1
where
The estimation functions for (2) are
Define
where
and
Then, the results follow.
In the setting of genetic mapping, where the density of a mixture of normal applies (model 1), two approaches can be used to implement the principle of minimum integrated squared errors. The most straightforward implementation is to directly model the true density of the error term (e_{i}), and the second one is based on modelling the true density of the observed phenotype data (y_{i}). The obvious difference between these two methods is that density for e_{i }is
Error term based L_{2}E method (eL_{2}E)
In model (1), the randomness is derived from the underlying error term. Thus, it is natural to directly model the density of the error term
where
Notice that
Then, the estimators of the unknown parameter set in (Θ = (
where
where
Thus, the estimator of the parameters is
In practice, the genomic location of a QTL is estimated by scanning positions across the genome. When the QTL is assumed to exist between the two markers, the
where N_{k }is the number of progeny in the marker genotype group
Phenotype data based L_{2}E method (pL_{2}E)
Unlike the error density, the phenotype data density contains a mixture of density functions each corresponding to a different QTL genotype. Also, because each marker genotype group
where
Notice that
Thus, we have
When a QTL is assumed to be at a marker position,
To combine the information from all nine marker genotypes, we take a weighted sum of the marker energy functions to calculate an overall energy function for phenotype data (
or
Here
Hypothesis testing
The existence of a significant QTL can be tested by the following hypotheses:
H_{0}: g_{0 }= g_{1 }= g_{2}H_{1}: Not all equalities in H_{0 }hold.
For these hypotheses, we can find their corresponding L_{2}E estimates,
Because the mixture of density functions is a larger family than its composite density functions, the
Results
Monte Carlo simulation
We performed Monte Carlo simulation studies to examine the statistical properties of the L_{2}Ebased mapping model. Consider a sample size
By scanning the simulated chromosome with a step size of 2 cM from the left end to the right end, the ED values were calculated and smoothed. Figure
Comparison between the two implementations of the L_{2}E methods by simulation
Comparison between the two implementations of the L_{2}E methods by simulation. (a) Using the true density of the error term. (b) Using the true density of the observed data. The arrows to the xaxe indicate the peak of the ED profile. The true position of the QTL is at 86 cM from the left end of the simulated chromosome.
Additional simulations were performed to examine the statistical properties of the L_{2}E method, under different sample sizes (
Simulation scenario 1.
Parameter
True Value
L_{2}E
ML
L_{2}E
ML
L_{2}E
ML
35
35(0.0685)
35(0.0514)
35.1(0.1061)
35(0.0833)
35.2(0.1653)
35.2(0.1396)
30
30(0.0332)
30(0.0286)
30.1(0.0706)
30.1(0.0522)
30(0.1087)
30(0.0909)
25
25(0.0724)
25.1(0.057)
25.1(0.0881)
25.1(0.0768)
24.9(0.1489)
24.8(0.1118)
sigma
4.3
4.3(0.0228)
4.3(0.0165)
sigma
7.1
7.0(0.0344)
7.1(0.0282)
sigma
10.6
10.4(0.0548)
10.6(0.0375)
Position
86
85.7(0.1386)
85.8(0.101)
85.9(0.2335)
86.4(0.1433)
86.0(0.5101)
85.8(0.2537)
The L_{2 }and ML estimates of QTL parameters from an F2population of 400 individuals for the phenotypic data simulated from normal distributions. Numbers in the parentheses are the mean square errors (MSE) of the estimates
Second, we simulated scenarios where error distributions in model (1) are nonnormal, using a
Simulation scenario 2.
df = 2
df = 3
df = 4
Parameter
True Value
L_{2}E
ML
L_{2}E
ML
L_{2}E
ML
35
35.0(0.0168)
39.3(4.0988)
35.0(0.0139)
35.1(0.0185)
35.0(0.0117)
35.1(0.0133)
30
30.0(0.0105)
30.0(0.0349)
30.0(0.0104)
30.0(0.0102)
30.0(0.0102)
30.0(0.0093)
25
25.0(0.0163)
19.0(4.4676)
25.0(0.0158)
24.9(0.0192)
25.0(0.0131)
25.0(0.0133)
sigma

1.2(0.0083)
2.6(0.0971)
1.1(0.0077)
1.5(0.0337)
1.1(0.0056)
1.3(0.0099)
Position
86
86.4(0.0649)
85.6(0.0971)
86.1(0.053)
86.2(0.0591)
86.1(0.0609)
86.4(0.0498)
The L_{2 }and ML estimates of QTL parameters from an F2population of 400 individuals with heritability of 0.4 for the phenotypic data simulated from t distributions. Numbers in the parentheses are the mean square errors (MSE) of the estimates
Third, we simulated experiments where data contains outlier data points. Because NP mapping is popular for traits with outliers
Simulation scenario 3.
Parameter
True Value
L_{2}E
ML
NP
L_{2}E
ML
NP
L_{2}E
ML
NP
35
35.3(0.0709)
35.9(0.0606)

35.7(0.1028)
35.9(0.0905)

36(0.1646)
36(0.1404)

30
30.1(0.0335)
31.4(0.0389)

30.7(0.074)
31.5(0.0573)

31(0.1108)
31.4(0.0916)

25
25(0.0696)
26.7(0.0774)

25.4(0.0911)
26.8(0.0881)

25.8(0.1628)
26.6(0.1244)

sigma
4.3
4.7(0.0238)
6.2(0.022)


sigma
7.1
7.6(0.0386)
8.3(0.0312)
sigma
10.6
11.1(0.0567)
11.5(0.0376)
Position
86
85.5(0.1466)
85.2(0.1712)
86.7(0.1387)
86(0.2272)
85.1(0.2528)
85.9(0.2562)
85.7(0.4935)
86.6(0.362)
85.4(0.3452)
The L_{2 }and ML estimates of QTL parameters from an F_{2 }population of 400 individuals for the phenotypic data simulated from normal distributions containing 10% noise points with mean
Simulation scenario 4.
Parameter
True Value
L_{2}E
ML
NP
L_{2}E
ML
NP
L_{2}E
ML
NP
35
35(0.0664)
36.8(0.0789)

35.1(0.1061)
35(0.0833)

36.1(0.1731)
36.8(0.1494)

30
30(0.0325)
32.3(0.0514)

30.1(0.0706)
30.1(0.0522)

30.8(0.1156)
32.5(0.1054)

25
25(0.0699)
27.7(0.0872)

25.1(0.0881)
25.1(0.0768)

25.4(0.1531)
27.4(0.1412)

sigma
4.3
4.6(0.0231)
8.4(0.0253)

sigma
7.1
7.0(0.0344)
7.1(0.0282)

sigma
10.6
11.5(0.0588)
12.8(0.0421)

Position
86
85.6(0.1419)
84.8(0.2242)
86.7(0.1426)
85.9(0.2335)
86.4(0.1433)
86.6(0.1737)
85.5(0.5162)
85(0.6221)
85.8(0.4071)
The L_{2 }and ML estimates of QTL parameters from an F_{2 }population of 400 individuals for the phenotypic data simulated from normal distributions containing 10% noise points with mean g = 55. Numbers in the parentheses are the mean square errors (MSE) of the estimates
A worked example
Vaughn et al.
Our analysis here focuses on identifying QTLs that may affect the body mass growth rate from ages 5 weeks to 10 weeks, which is defined as body mass ratio between week 10 and week 5. On the right side of the empirical density of this trait (Figure
L_{2}E and MLE mapping of the mice data
L_{2}E and MLE mapping of the mice data. Genomic scanning profiles for mapping QTLs controlling the growth rate of body mass from weeks 5 to 10 by L_{2}E (a) and ML approaches (b). The yaxes are the ED and LR test statistics, respectively. The dash dot line and the dash line are the chromosomewide and genomewide 0.05 cutoffs at the significant level of 0.05 based on the 1000 permutations, respectively. The xaxis ticks indicates the marker positions, the arrows to the × axes shows the genomic positions of the significant QTL at chromosome level, and the asterisk at chromosome 8 in the L_{2}E profile marks a genomewide significant QTL.
Although the overall profiles of ED and LRS look similar, they did detect different significant QTLs. The ML method cannot identify any significant QTL at the genome level; however, the L_{2}E method successfully detects one genomewide significant QTL at 2 cM to the leftmost proximal marker on the chromosome 8. Coincidently, in 2005, Rance et al.
L_{2}E mapping results of the mice data.
Chromosome
Map
Flanking Markers
QTL associated effects
position^{a}
Marker 1
Marker 2
Additive^{b}
Dominance^{b}
%var^{c}
8
2
D8Mit293
D8Mit25
0.012
0.044
8.68
Significant QTL for body mass ratio between week 10 and week 5 in an F_{2 }mouse population detected from the genomewide interval mapping scan by the L_{2}E and ML methods at the 0.05 significance level
^{a}Map position = populationestimated position in cM from the leftmost proximal marker.
^{b}Additive and dominance effects of the QTL
^{c}%Var = percentage variance explained by the QTL.
Discussion
Current mapping technologies allow us to dissect the variation of quantitative traits into individual genetic components (QTLs). Through this dissection the genetic architecture behind the quantitative traits can be elucidated, which provides a sound basis for future trait improvement. To better utilize the genomic data, considerable attention has been paid to develop powerful analytic methodologies that can increase the power, precision, and resolution of QTL mapping (816). Currently, almost all the QTL mapping methods proposed so far assume a parametric (mostly normal) distribution density of a trait. However, there is an increasing recognition of the limitation for the parametric assumption, given that in practice the true distribution of a trait is never known.
In this article, we propose a QTL mapping methodology based on the principle of L_{2}E, which may allow the fitted model to be different from the true model. We derived two different implementation of the L_{2}E method into the mapping framework and show how they are connected. The simulation studies suggest that the pL_{2}E method works better than eL_{2}E method and were used for our further analyses. Additional simulation studies were performed to test the statistical behaviour of the L_{2}Ebased mapping approach. The L_{2}E method is more robust in the model choice at a cost of lower efficiency. For a "perfect" data, the ML performs better than the L_{2}E. However, when the data contains noises, the L_{2}E outperforms the ML. The relative efficiency of the L_{2}E increases with increasing percentage of noises. In practice, it would be unrealistic for us to know the true model underlying the data, but it can be almost assured that no data is perfect. Thus, a better strategy is that the L_{2}E method can first used to explore the data, with results compared with the MLE method.
This work is our first attempt to incorporate the principle of the integrated square errors into the genetic mapping framework. There are many areas that can be explored in the future, such as how to apply this principle to examine the genegene interaction or geneenvironment interactions. The L_{2}E method would be an excellent addition to the current toolbox of the QTL mapping.
Conclusions
In this article, we derive a robust approach for genetic mapping of complex traits by incorporating the principal of the integrated square error into the general mapping framework. This approached, called the L_{2}E mapping, automatically manipulates data points that are apparently outliers by giving them less weight in parameter estimation, and therefore yields more accurate estimation of QTL locations and effects. In the case where the data cleaning is not possible or very hard to do so, our new method could be a very beneficial choice. Simulation studies showed that in the presence of outliers, L_{2}E method outperforms the traditional MLE and nonparametric methods in terms of both accuracy and efficacy of the parameter estimations. A real data analysis of the mice body mass data also demonstrates the usefulness and utilization of the new approach in a practical genetic setting. We strongly encourage researchers to explore both the L_{2}E and MLE mapping procedures in practice.
Authors' contributions
SW carried out the analysis, prepared and drafted the manuscript. GF participated in the design of the study. YC initiated the project design. ZW participated in the design of the study. RW initiated and established the overall project design, prepared and drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank three anonymous reviewers for valuable comments that significantly improved the manuscript. This work is supported by NSF/IOS0923975 and NIH/UL1RR0330184 We thank Dr. J. Cheverud at Washington University for providing his mouse data to validate our new model.