Division of Biostatistics, Institute for Health and Society, Medical College of Wisconsin, Milwaukee, WI 53226, USA

Abstract

Background

In genetic association study of quantitative traits using F_{∞ }models, how to code the marker genotypes and interpret the model parameters appropriately is important for constructing hypothesis tests and making statistical inferences. Currently, the coding of marker genotypes in building F_{∞ }models has mainly focused on the biallelic case. A thorough work on the coding of marker genotypes and interpretation of model parameters for F_{∞ }models is needed especially for genetic markers with multiple alleles.

Results

In this study, we will formulate F_{∞ }genetic models under various regression model frameworks and introduce three genotype coding schemes for genetic markers with multiple alleles. Starting from an allele-based modeling strategy, we first describe a regression framework to model the expected genotypic values at given markers. Then, as extension from the biallelic case, we introduce three coding schemes for constructing fully parameterized one-locus F_{∞ }models and discuss the relationships between the model parameters and the expected genotypic values. Next, under a simplified modeling framework for the expected genotypic values, we consider several reduced one-locus F_{∞ }models from the three coding schemes on the estimability and interpretation of their model parameters. Finally, we explore some extensions of the one-locus F_{∞ }models to two loci. Several fully parameterized as well as reduced two-locus F_{∞ }models are addressed.

Conclusions

The genotype coding schemes provide different ways to construct F_{∞ }models for association testing of multi-allele genetic markers with quantitative traits. Which coding scheme should be applied depends on how convenient it can provide the statistical inferences on the parameters of our research interests. Based on these F_{∞ }models, the standard regression model fitting tools can be used to estimate and test for various genetic effects through statistical contrasts with the adjustment for environmental factors.

Background

Genetic markers with multiple alleles are common phenomena in genetic studies. It is well known that the ABO blood types in human are determined by three alleles at a genetic locus on chromosome 9. Molecular markers such as microsatellites often have multiple alleles. The major histocompatibility complex (MHC), a highly polymorphic genome region that resides on the human chromosome 6, encompasses multiple genes that encode for many human leukocyte antigens (HLA) and play an important role in regulation of the immune responses. Depending on the resolution level of allele typing, each of the HLA-A, B, C, DR, DQ and DP gene loci could contain tens to hundreds of allele types. In addition, in the haplotype analysis of single-nucleotide polymorphisms (SNPs), various haplotypes from a set of SNPs can also be treated as different alleles from a 'super' marker locus that consists of the set of SNPs.

Presently, there are mainly three types of genetic models that are commonly used in the genetic analysis of quantitative traits. One is Fisher's analysis of variance (ANOVA) models that focus on a decomposition of the genotypic variance into genetic variance components contributed by various genetic effects at quantitative trait loci (QTL) _{∞ }models that concentrate on direct statistical modeling of the expected genotypic values at target genetic markers or QTL and the association testing of various genetic effects. The other one is the so-called functional genetic models that emphasize on modeling the functional effects of genes _{∞ }models can be referred to as statistical models, while the functional genetic models have fundamentally different objectives and estimation methods from the statistical models. A considerable amount of discussion has been made about the distinction between these different types of genetic models

The F_{∞ }models have been widely used in genetic association studies of quantitative traits. In building F_{∞ }models, how to code genotypes at a marker (or QTL) and interpret the model parameters are fundamental issues for constructing appropriate testing hypotheses and making correct statistical inferences. While the Fisher's ANOVA models can be directly applicable to genetic markers with multiple alleles, the F_{∞ }models by contrast have been mainly discussed in the biallelic case _{∞ }models to multi-allele models with a focus on the definition of various genetic effects and their relationships with the average genetic effects defined in the Fisher's models. A thorough work on coding of marker genotypes and interpretation of model parameters for F_{∞ }models has not been done in the past especially for genetic markers with multiple alleles.

In general, there are two different strategies in coding the marker or QTL genotypes. One is to treat each marker or QTL as a potential risk factor with its genotypes as the risk units. Then, similar to the strategy in handling categorical covariates in classical regression models, at each locus we can create one dummy variable per genotype and then include all but one (as the reference) of these dummy variables into a model. But this genotype coding is often limited by the available sample sizes especially when the number of alleles at the marker locus is large. Alternatively, as alleles are often supposed to be the basic genetic risk units that may contribute to disease phenotypes in genetic studies, we may want to treat alleles at each marker or QTL as the risk units and examine the effects of alleles. However, genetic data has some specialty that needs to be taken into account in order to build the allele-based models. In the genome of diploid species such as human being, alleles normally appear in pairs to form a genotype at each marker locus or QTL with one from the father and one from the mother, except for the sex chromosomes in males. That is, at each locus we have two within-locus risk factors that reside on a homologous pair of chromosomes. Unlike the classical two-way ANOVA model in which the two risk factors own different risk units, the paternal and maternal risk factors at a locus often share the same set of alleles. Besides, the parental origins (i.e., the phase) of the two alleles at each locus are quite often unknown. These features could sometimes complicate the allele-based coding of marker genotypes and generate confusion in interpretation of the model parameters.

In this study, we introduce three allele-based coding schemes for building F_{∞ }models, namely allele, F_{∞ }and allele-count codings. First, we formulate F_{∞ }models under a general regression framework to model the expected genotypic values at given markers or QTL. Then, under a standard ANOVA model setting, we present several fully parameterized one-locus models using the three allele-based coding schemes. Some potential collinearity relationships among the coding variables of the marker genotypes are clarified. Strategies to avoid the redundant model parameters are also proposed. After that, we examine the definition of model parameters under a reduced one-locus model framework. The impact of a linear relationship among the coding variables of marker genotypes on the estimability of the model parameters is fully explored based on the linear model theory. Finally, we consider extension of the one-locus models to two-locus situation. Several fully parameterized as well as reduced two-locus models are addressed. A focus of this study is to establish the relationships between the model parameters and the expected genotypic values at given marker loci or QTL for various F_{∞ }models from these three coding schemes under various different model frameworks, and explain how to estimate and test for various genetic effects through statistical contrasts. Relationships among different coding schemes and models are also illustrated through simulation.

Results

Fully parameterized one-locus models

In genetic studies, a quantitative trait _{i }
_{i }
_{i }
_{i}
_{i }

where _{i}
_{i}
_{i}
_{i }
_{i}
_{i }
_{i}
_{i}
_{i}
_{i}
_{i}
_{i}

Now, consider one target marker locus with multiple alleles _{1}, ..., _{m}
_{j}A_{j}
_{j}A_{k}
_{jk }
_{j}A_{k}
_{j}A_{k }
_{jk }
_{kj }
_{jk}
_{jk }

where _{j }
_{j }
_{k}
_{1}, ..., _{m}
_{j }
_{j }
_{j }
_{k }
_{k }
_{j}
_{jk }
_{jk }

In order to avoid the inestimability issue, one way is to add constraints on the model parameters. However, those constraints, together with the symmetry property of

and

for each allele type _{j}

for _{j}
_{1j
}, _{2j
}are not observable because we do not know exactly which allele is inherited from paternal or maternal gamete for the sampled individuals without their parental information. But this unknown phase problem does not affect the definitions of _{j}
_{jk }
_{j }
_{j }
_{jk }
_{j}A_{k }

for _{k}
_{jk}
_{jk}
_{k}
_{lk}
_{m }
_{km}

for _{jk }
_{j }
_{jk }
_{m}
_{mm}
_{j }
_{jm }
_{mm }
_{jk }
_{jk }
_{km}
_{jm }
_{mm}
_{j }
_{m }
_{j }
_{m }
_{jk }
_{m }
_{j }
_{k}
_{k }
_{j}
_{m}
_{jk }
_{m }
_{j }
_{k}
_{k }
_{j}
_{m}
_{j }
_{kj}
_{m }
_{km}
_{j }
_{j }
_{0 }: _{j }
_{jk }
_{0 }: _{j }
_{jk }
_{∞ }model has been widely used in genetic association studies. In the simple biallelic case with two alleles _{∞ }model gives

where _{AA }
_{Aa }
_{aa }
_{AA }
_{aa}
_{Aa }
_{AA }
_{aa}
_{∞ }model can also be written in a linear model form as

where

We refer to the above coding of the marker genotypes as the F_{∞ }coding. As a straightforward extension of the F_{∞ }coding scheme to multiple alleles, we can define the following coding variables

for each _{j}
_{j }
_{j}
_{jk}
_{j}
_{j}
_{j}
_{j}
_{jj}
_{jk}
_{j}
_{k}
_{∞ }coding as

for

Therefore, _{j }
_{jj }
_{mm}
_{jj }
_{jm }
_{jj }
_{mm}
_{jk}
_{jk }
_{m }
_{j }
_{k }
_{m}
_{jj }
_{j }
_{m }
_{j }
_{m}
_{j }
_{0 }: _{j }
_{jk }
_{0 }: _{j }
_{jk }

In addition to the allele and F_{∞ }codings, another way of coding the marker genotypes which occasionally appears in practice is to count the number of alleles in marker genotypes for each specific allele _{j}
_{j}
_{j }
_{j}

for each _{1j
}(_{j}
_{j}
_{jj}
_{2j
}(_{jj}

for

Therefore, _{j }
_{m }
_{j }
_{m}
_{j}A_{m }
_{j }
_{m}A_{m }
_{jj }
_{jj }
_{j}A_{j }
_{j }
_{mm }
_{m}A_{m}
_{jk }
_{jk }
_{jk}
_{j }
_{0 }: _{j }
_{jk }
_{0 }: _{j }
_{jk }

Each of the three models (4), (5) and (6) provides a full re-parameterization of the _{j }
_{jj }
_{j }
_{jj }
_{j }
_{jj }
_{jj }
_{jm }
_{mm}
_{j }
_{jj }
_{j }
_{jj }
_{j }
_{jj }

Parameterization of fully parameterized one-locus models (4), (5), (6).

**Codings**

**Relationships**

Allele

F_{∞}

Allele-count

For a biallelic locus with alleles _{1}) and _{2}), we have _{AA }
_{Aa }
_{aa }
_{2}(_{1}(_{12}(_{1}(_{11}(_{22}(_{1}(_{11}(_{∞ }coding, we have _{2}(_{1}(_{2}(_{1}(_{2 }in model (5). For the allele-count coding, we have _{12}(_{11}(_{22}(_{11}(_{21}(

Parameterization of one-locus models (4), (5), (6) when

**Codings**

**Models**

**Relationships**

Allele

F_{∞}

Allele-count

For a locus with three alleles _{1}, _{2 }(i.e., _{11}, _{22}, _{33}, _{12}, _{13 }and _{23}. Each of the three fully parameterized models (4), (5) and (6) can provide a full re-parameterization of the six expected genotypic values. In a matrix form, from the allele coding model (4), we have

From the F_{∞ }coding model (5), we have

And the allele-count coding model (6) gives

By multiplying the design matrices on the left side of the equations, we can show that the model parameters and the expected genotypic values have the relationships as summarized in Table

Parameterization of one-locus models (4), (5), (6) when

**Codings**

**Relationships**

Allele

F_{∞}

Allele-count

Reduced one-locus models

Due to limited available sample sizes in practice, it may not always be feasible to use the fully parameterized models. Quite often, one may want to check the main effects of alleles first before including all possible allelic interactions. Here we consider the case of including possible interactions between _{j }
_{j }
_{j}
_{j }
_{k }

for

for _{j}
_{jj}
_{jj}

Model (8) contains only one redundant parameter in the _{m }

for _{j}
_{j}
_{jk}
_{m}
_{m }
_{j}

By definition, a reduced model can be derived from its original model by adding certain restrictions on the model parameters. Typically, the model parameters in a reduced model could be interpreted similarly as that in its original model when these restrictions are simple enough (e.g., by setting a subset of them being zero). When the restrictions on the original model parameters are complicated, however, the interpretation of the reduced model parameters could be different from that presented in the original model. For model (9), we can establish the relationship between its model parameters and the expected genotypic values using a classical matrix approach, as shown in Appendix B. An alternative way of building this relationship is to simply treat model (9) as a reduced form of model (8) by adding a restriction

Comparing with the parameters in model (4), we can see that the interpretation of the parameters in model (9) have changed slightly. The intercept _{mm}
_{j }
_{m }
_{j }
_{k }
_{m}
_{j }
_{k }
_{j }
_{j }
_{l }
_{j }
_{j }
_{jj }
_{jm }
_{jk }
_{km }

Under the same model framework (8), the F_{∞ }coding leads to the following model

for _{j}
_{j}
_{j}
_{j}
_{j}

In other words, model (10) leads to a restriction

Now _{j }
_{k }
_{j }
_{j }
_{l }

With the allele-count coding, we can actually construct two equivalent models in this case

and

for

On the other hand, model (12) can be treated as a reduced model by adding the restriction

While the effect _{jj }
_{jj }
_{mm}
_{j }
_{m }
_{j }
_{j }
_{k }
_{jj }
_{mm }
_{jm }
_{mm }
_{jk }
_{km }
_{0 }: _{j }
_{j }
_{jj }
_{jm }
_{jk }
_{km }
_{mm}

Under the same model framework (8), each of the above four models (9), (10), (11) and (12) contains 2_{11}, _{22}, _{33}, _{12}, _{13 }and _{23}. The relationships between the four model parameters and the expected genotypic values are summarized in Table

Parameterization of one-locus models (9), (10), (11), (12) when

**Codings**

**Restrictions**

**Relationships**

Allele

F_{∞}

Allele-count

Allele-count

Comparing Table _{0 }: _{j }
_{j }
_{j }
_{j }
_{jk }
_{km }
_{0 }: _{j }
_{j }
_{jj }
_{mm }
_{jj }
_{jm }
_{jk }
_{km }
_{j }
_{j }
_{j }
_{j }
_{1 }= _{1 }= 0 is equivalent to _{1 }= _{1 }= 0 which implies _{12 }= _{23 }and _{11 }= _{13}; while _{1 }= _{1 }= 0 is equivalent to _{11 }= _{33 }and _{12 }+ _{13 }= _{11 }+ _{23}. So, depending on the underlying true setting of the expected genotypic values, the null hypotheses of _{1 }= _{1 }= 0 in model (9) could be different from that of _{1 }= _{1 }= 0 in model (10).

Parameterization of one-locus models (9), (10), (11), (12) when

**Codings**

**Restrictions**

**Relationships**

Allele

F_{∞}

Allele-count

Allele-count

Extension to two-locus models

In this section, we further explore some extensions of the previous one-locus models to two-locus models. Consider two marker loci with alleles _{1}
_{2}(_{1 }+ 1)(_{2 }+ 1)/4 possible distinctive expected genotypic values: _{jkrs }
_{1j
}
_{1k
}
_{2r
}
_{2s
}) for _{1}, _{2},

_{1}, for marker genotypes at locus 1 and

_{2}, for marker genotypes at locus 2, where _{1j
}(or _{2r
}) at locus 1 (or 2). A fully parameterized two-locus model for _{jkrs }

for

For the F_{∞ }coding, we can define the following coding variables for the genotypes at the two marker loci separately.

for _{1}, and

for _{2}. A fully parameterized two-locus model using this F_{∞ }coding is then

for _{1j
}= 1 + _{1j
}, _{2r
}= 1 + _{2r
}, _{1jj
}= (1 + _{1j
}- _{1j
}), _{2rr
}= (1 + _{2r
}- _{2r
}), _{1jk
}= _{1j
}
_{1k
}for _{2rs
}= _{2r
}
_{2s
}for _{∞ }coding variables and the allele coding variables, we can establish the relationships between the model parameters and the expected genotypic values as shown in (C.2) of Appendix C. We can easily verify that the biallelic two-locus effects _{1 }= _{2 }= 2. It is also interesting to see that the interpretation of model parameters in terms of the expected genotypic values becomes much more complicated than that in the previous allele coding model. When _{1}, _{2 }> 2, the low-order within-locus main effect _{1j
}is a weighted combination of the differences _{2 }refer to various homozygous genotypes _{2r
}
_{2r
}at locus 2. The within-locus effect _{1jj
}is a weighted combination of the allelic interactions _{2}, at locus 1 with reference _{2r
}
_{2r
}at locus 2. Even the intercept τ of the model becomes a complex function of various homozygous genotypic values.

Applying the allele-count coding, we can define

for _{1}, and

for _{2}. Another fully parameterized two-locus model for _{jkrs }

for _{1jj
}
_{2rr
}), (_{1jk
}
_{2rs
}) and (_{1jk
}
_{2rr
}) have simpler relationships than the corresponding ones in the allele coding model (13).

Finally, let us consider some reduced cases of the two-locus models. By ignoring locus-by-locus interactions (i.e., epistases), we have the following simplified two-locus model framework

for _{1 }and _{2}. If we further ignore the within-locus allelic interactions between different alleles, then another reduced two-locus model framework is

Similar to the one-locus models, under each of the two reduced model frameworks we can construct the two-locus models from the three coding schemes. The relationships between the model parameters and the expected genotypic values under framework (14) are summarized in Table _{1 }and _{2 }in (15) will lead to an additive model framework, which has its model parameters interpretable similar to that in Table _{∞ }coding models have the definition of their lower-order main effects vary depending on whether there are epistases involved in the models.

Parameterization of two-locus models under model framework (16).

**Codings**

**Relationships**

Allele

F_{∞}

Allele-count

Parameterization of two-locus models under model framework (17) when _{1}, _{2 }≥ 3.

**Codings**

**Restrictions**

**Relationships**

Allele

F_{∞}

Allele-count

Allele-count

As pointed out in _{1 }= _{AB }
_{aB }
_{Ab }
_{ab }
_{2 }= _{AB }
_{Ab }
_{aB }
_{ab }
_{AB }
_{A}p_{B }

Simulation Examples

We use some numerical examples to illustrate properties of the models we have discussed. First, we consider the same example discussed in _{1 }= 0.2 for _{1}, _{2 }= 0.3 for _{2}, and _{3 }= 0.5 for _{3}. The six genotypic values are _{11 }= 10, _{12 }= 30, _{22 }= 50, _{13 }= 36, _{23 }= 46 and _{33 }= 42. We adopt a similar strategy to specify the genotype frequencies as: _{jk }
_{j}pk ^{- }≤ ^{+ }with

and

We consider two cases: i) D = 0 for HWE, and ii) ^{2}), where the ^{2 }is chosen to be either 0 or ^{2 }= 288 with the latter one corresponds to a 20% heritability level when D = 0. For each of the four configurations, we simulate 10,000 random samples with 1000 individuals each. For each random sample, we fit the three fully parameterized one-locus models (4), (5) and (6) under model framework (2) using the least square approach and estimate the model parameters as well as the six genotypic values. The means and standard deviations (SD) of the least square estimates (LSE) of the model parameters and the six genotypic values from the 10,000 random samples in fitting these three models are summarized in Table

Means (SD) of LSE for three one-locus models (4), (5) and (6) when

**Allele**

**
μ
**

**
α
_{1}
**

**
α
_{2}
**

**
δ
_{11}
**

**
δ
_{22}
**

**
δ
_{12}
**

**
σ
^{2}
**

True

42

-6

4

-20

0

-10

^{2 }= 0

42.0(0.00)

-6.00(0.00)

4.00(0.00)

-20.00(0.00)

0.00(0.00)

-10.00(0.00)

0.00(0.00)

^{2 }= 288

41.99(1.07)

-5.98(1.61)

3.99(1.44)

-20.06(3.80)

0.02(2.85)

-9.98(2.42)

287.84(12.91)

^{2 }= 0

42.00(0.00)

-6.00(0.00)

4.00(0.00)

-20.00(0.00)

0.00(0.00)

-10.00(0.00)

0.00(0.00)

^{2 }= 288

41.98(1.14)

-5.97(1.60)

4.01(1.46)

-20.07(6.21)

0.03(3.09)

-10.04(2.31)

287.81(12.91)

_{11}

_{22}

_{33}

_{12}

_{13}

_{23}

10

50

42

30

36

46

10.00(0.00)

50.00(0.00)

42.00(0.00)

30.00(0.00)

36.00(0.00)

46.00(0.00)

9.96(2.73)

49.99(1.79)

41.99(1.07)

30.02(1.55)

36.01(1.20)

45.98(0.98)

10.00(0.00)

50.00(0.00)

42.00(0.00)

30.00(0.00)

36.00(0.00)

46.00(0.00)

9.96(5.66)

50.03(2.21)

41.98(1.14)

29.97(1.38)

36.01(1.12)

45.99(0.93)

F_{∞}

_{1}

_{2}

_{11}

_{22}

_{12}

^{2}

True

30

-16

4

10

0

-10

^{2 }= 0

30.00(0.00)

-16.00(0.00)

4.00(0.00)

10.00(0.00)

0.00(0.00)

-10.00(0.00)

0.00(0.00)

^{2 }= 288

29.98(1.64)

-16.01(1.46)

4.00(1.05)

10.03(1.90)

-0.01(1.42)

-9.98(2.42)

287.84(12.91)

^{2 }= 0

30.00(0.00)

-16.00(0.00)

4.00(0.00)

10.00(0.00)

0.00(0.00)

-10.00(0.00)

0.00(0.00)

^{2 }= 288

29.99(3.05)

-16.01(2.88)

4.03(1.25)

10.04(3.10)

-0.01(1.54)

-10.04(2.31)

287.81(12.91)

_{11}

_{22}

_{33}

_{12}

_{13}

_{23}

10

50

42

30

36

46

10.00(0.00)

50.00(0.00)

42.00(0.00)

30.00(0.00)

36.00(0.00)

46.00(0.00)

9.96(2.73)

49.99(1.79)

41.99(1.07)

30.02(1.55)

36.01(1.20)

45.98(0.98)

10.00(0.00)

50.00(0.00)

42.00(0.00)

30.00(0.00)

36.00(0.00)

46.00(0.00)

9.96(5.66)

50.03(2.21)

41.98(1.14)

29.97(1.38)

36.01(1.12)

45.99(0.93)

Allele-count

_{0}

_{1}

_{2}

_{11}

_{22}

_{12}

^{2}

True

42

-6

4

-32

8

-10

^{2 }= 0

42.00(0.00)

-6.00(0.00)

4.00(0.00)

-32.00(0.00)

8.00(0.00)

-10.00(0.00)

0.00(0.00)

^{2 }= 288

41.99(1.07)

-5.98(1.61)

3.99(1.44)

-32.03(2.92)

8.00(2.09)

-9.98(2.42)

287.84(12.91)

^{2 }= 0

42.00(0.00)

-6.00(0.00)

4.00(0.00)

-32.00(0.00)

8.00(0.00)

-10.00(0.00)

0.00(0.00)

^{2 }= 288

41.98(1.14)

-5.97(1.60)

4.01(1.46)

-32.02(5.76)

8.05(2.51)

-10.04(2.31)

287.81(12.91)

_{11}

_{22}

_{33}

_{12}

_{13}

_{23}

10

50

42

30

36

46

10.00(0.00)

50.00(0.00)

42.00(0.00)

30.00(0.00)

36.00(0.00)

46.00(0.00)

9.96(2.73)

49.99(1.79)

41.99(1.07)

30.02(1.55)

36.01(1.20)

45.98(0.98)

10.00(0.00)

50.00(0.00)

42.00(0.00)

30.00(0.00)

36.00(0.00)

46.00(0.00)

9.96(5.66)

50.03(2.21)

41.98(1.14)

29.97(1.38)

36.01(1.12)

45.99(0.93)

As each of the three models provides a re-parameterization of the six genotypic values, for each random sample the three models always give exactly the same estimates of the six genotypic values and the residual variance as we expected, even though their model parameters are defined in different ways. As a result, under each configuration, the three models have the same means and SD for the LSE of the six genotypic values and the residual variance. Without environmental variation, each model can accurately estimate its model parameters and the six genotypic values for each random sample regardless of whether there is HWE or HWD. When there is environmental variation on the phenotypes, it is known that the least square estimators of the model parameters are unbiased under either HWE or HWD. However, the HWD may affect the variance of the least square estimators of the model parameters and the six genotypic values. Note that the genotypic frequencies are _{11 }= 0.04, _{22 }= 0.09, _{33 }= 0.25, _{12 }= 0.12, _{13 }= 0.20 and _{23 }= 0.30 under HWE, while with D = 0.02 the genotypic frequencies become _{11 }= 0.02, _{22 }= 0.07, _{33 }= 0.23, _{12 }= 0.14, _{13 }= 0.22 and _{23 }= 0.32. So, under HWD, we tend to have more individuals carrying genotypes _{1}
_{2}, _{1}
_{3}, _{2}
_{3 }but less individuals carrying genotypes _{1}
_{1}, _{2}
_{2}, _{3}
_{3 }in the random samples than that under HWE. Without knowing the accurate genotypic values, more individuals with certain genotypes in a random sample can then provide better estimates of the corresponding genotypic values. This explains why under HWD the estimates of _{11}, _{22 }and _{33 }have larger SD (or variances) than that under the HWE, and the estimates of _{12}, _{13 }and _{23 }under HWD have smaller variances than that under the HWE.

As another example, let us consider the statistical modeling of two-locus genotypic values _{jkrs}
_{1}, _{2}, _{3 }and the second locus have two alleles _{1}, _{2}. Assume that the alleles at locus 1 have the same allele frequencies as that in the previous example; i.e., _{1 }= 0.2 for _{1}, _{2 }= 0.3 for _{2}, and _{3 }= 0.5 for _{3}, while the two alleles at locus 2 have frequencies _{1 }= 0.2 for _{1 }and _{2 }= 0.8 for _{2}. The two-locus genotypic values _{2 }= (_{jkrs}

which are modified values from the previous one-locus model in a way that the _{
jk11 }= _{jk}
_{
jk12 }= _{jk }
_{1jk
}and _{
jk22 }= _{jk }
_{2jk
}with _{1jk
}and _{2jk
}being some small positive fluctuations according to the genotypes _{1}
_{2 }and _{2}
_{2 }at locus 2. We assume Hardy-Weinberg equilibria at both loci and specify their haplotype frequencies as: _{11 }= _{1}
_{1 }- _{1}, _{12 }= _{1}
_{2 }+ _{1}, _{21 }= _{2}
_{1 }- _{2}, _{22 }= _{2}
_{2 }- _{2}, _{31 }= _{3}
_{1 }+ (_{1 }- _{2}), _{32 }= _{3}
_{2 }- (_{1 }- _{2}), where _{1 }(and _{2}) are the linkage disequilibria (LD) between alleles _{1 }and _{2 }(and _{2 }and _{1}) at the two loci. We consider two scenarios: i) _{1 }= _{2 }= 0 for linkage equilibrium (LE); and ii) _{1 }= 0, _{2 }= 0.03 for LD. The phenotypic value of an individual is still simulated as a sum of its genotypic value and an environmental noise from ^{2}), where the ^{2 }was chosen to be either 0 or ^{2 }= 286 with the latter one corresponds to a 20% heritability level when _{1 }= _{2 }= 0. For each of the four configurations, we simulate 10,000 random samples with 1000 individuals each. For each random sample, we consider fitting models under three model frameworks: i) one-locus models (4), (5) and (6) at locus 1 under model framework (2); ii) two-locus models without epistases from the three coding schemes under model framework (14); iii) fully parameterized two-locus models (13), (14) and (15) with epistases. Still, for each random sample, the three allele coding models under the same model framework give exactly the same estimates of the 18 genotypic values as we expected (results not shown here). As the result, under each model framework, the three models have the same means and SD for the LSE of the 18 genotypic values and the residual variance, although the means and SD for the LSE of their model parameters are different. To compare the LSE of model parameters for models from the same coding under different model frameworks, we summarize in Table

Means (SD) of LSE for three allele-coding models regarding the two-locus genotypic values

**One-locus model**

**
μ
**

**
α
_{11}
**

**
α
_{12}
**

**
δ
_{111}
**

**
δ
_{122}
**

**
δ
_{112}
**

**
σ
^{2}
**

True

41.68

-5.81

4.03

-20.03

0.29

-10

_{1 }= _{2 }= 0, ^{2 }= 0

41.68(0.04)

-5.81(0.06)

4.03(0.06)

-20.03(0.14)

0.29(0.09)

-10.00(0.08)

0.37(0.01)

_{1 }= _{2 }= 0, ^{2 }= 286

41.69(1.07)

-5.83(1.61)

4.03(1.44)

-20.00(3.79)

0.27(2.85)

-9.99(2.43)

286.58(12.82)

True

41.55

-5.74

4.21

-20.04

0.09

-10.06

_{1 }= 0, _{2 }= 0.03, ^{2 }= 0

41.55(0.04)

-5.74(0.06)

4.21(0.06)

-20.04(0.14)

0.09(0.09)

-10.06(0.08)

0.36(0.01)

_{1 }= 0, _{2 }= 0.03, ^{2 }= 286

41.54(1.07)

-5.74(1.61)

4.23(1.45)

-20.09(3.81)

0.07(2.83)

-10.09(2.43)

286.27(12.94)

Two-locus model - no epistases

_{11}

_{12}

_{111}

_{122}

_{112}

_{21}

True

41.88

-5.81

4.03

-20.03

0.29

-10

0.64

_{1 }= _{2 }= 0, ^{2 }= 0

41.88(0.02)

-5.81(0.01)

4.03(0.01)

-20.03(0.01)

0.29(0.05)

-10.00(0.02)

0.64(0.02)

_{1 }= _{2 }= 0, ^{2 }= 286

41.91(2.88)

-5.82(1.61)

4.03(1.44)

-19.99(3.79)

0.27(2.85)

-9.99(2.43)

0.63(2.90)

_{211}

^{2}

-1.92

-1.92(0.03)

0.024(0.002)

-1.92(3.41)

285.64(12.79)

True

41.85

-5.80

4.06

-20.04

0.14

-10.04

0.65

_{1 }= 0, _{2 }= 0.03, ^{2 }= 0

41.85(0.02)

-5.80(0.01)

4.06(0.01)

-20.04(0.01)

0.14(0.05)

-10.04(0.02)

0.65(0.02)

_{1 }= 0, _{2 }= 0.03, ^{2 }= 286

41.87(2.94)

-5.80(1.61)

4.07(1.45)

-20.09(3.81)

0.12(2.83)

-10.07(2.43)

0.62(2.88)

_{211}

^{2}

-1.92

-1.92(0.03)

0.02(0.00)

-1.88(3.38)

285.36(12.94)

Two-locus model with epistases

_{11}

_{12}

_{111}

_{122}

_{112}

_{21}

True

42

-6

4

-20

0

-10

0.6

_{1 }= _{2 }= 0, ^{2 }= 0

42.00(0.00)

-6.00(0.00)

4.00(0.00)

-20.00(0.00)

0.00(0.00)

-10.00(0.00)

0.60(0.00)

_{1 }= _{2 }= 0, ^{2 }= 286

41.92(5.73)

-5.99(8.65)

4.04(7.67)

-19.79(19.86)

-0.04(15.62)

-9.82(13.54)

0.66(6.04)

_{1 }= 0, _{2 }= 0.03, ^{2 }= 0

42.00(0.00)

-6.00(0.00)

4.00(0.00)

-20.00(0.00)

0.00(0.00)

-10.00(0.00)

0.60(0.00)

_{1 }= 0, _{2 }= 0.03, ^{2 }= 286

42.24(8.60)

-6.11(11.83)

3.66(10.05)

-20.04(22.95)

0.51(14.85)

-9.77(14.64)

0.38(8.86)

_{211}

(_{11}_{21})

(_{12}_{21})

(_{111}_{21})

(_{122}_{21})

(_{112}_{21})

(_{11}_{211})

-2

0.2

0.1

-0.1

-0.5

-0.4

-0.2

-2.00(0.00)

0.20(0.00)

0.10(0.00)

-0.10(0.00)

-0.50(0.00)

-0.40(0.00)

-0.20(0.00)

-2.05(6.99)

0.23(9.12)

0.07(8.08)

-0.35(20.98)

-0.47(16.35)

-0.65(14.30)

-0.29(10.58)

-2.00(0.00)

0.20(0.00)

0.10(0.00)

-0.10(0.00)

-0.50(0.00)

-0.40(0.00)

-0.20(0.00)

-1.80(9.71)

0.24(12.24)

0.39(10.44)

-0.03(24.03)

-0.94(15.63)

-0.52(15.27)

-0.15(13.52)

(_{12}_{211})

(_{111}_{211})

(_{122}_{211})

(_{112}_{211})

^{2}

-0.2

0.2

1.7

1

-0.20(0.00)

0.20(0.00)

1.70(0.00)

1.00(0.00)

0.00(0.00)

-0.18(9.39)

0.55(24.51)

1.70(18.94)

1.35(16.46)

282.45(12.83)

-0.20(0.00)

0.20(0.00)

1.70(0.00)

1.00(0.00)

0.00(0.00)

-0.44(11.64)

0.07(27.44)

2.11(18.14)

0.98(17.29)

282.81(12.74)

As we mentioned before, the one-locus models are actually modeling the expected genotypic values given the genotypes at locus 1. When _{1 }= _{2 }= 0, we can show that the expected genotypic values at locus 1 are _{11 }= 10.03, _{22 }= 50.03, _{33 }= 41.68, _{12 }= 29.90, _{13 }= 35.87 and _{23 }= 45.71, which correspond to _{11 }= -5.81, _{12 }= 4.03, _{111 }= -20.03, _{122 }= 0.29 and _{112 }= -10 as the true parameters in the allele coding one-locus model. When _{1 }= 0, _{2 }= 0.03, the expected genotypic values at locus 1 become _{11 }= 10.03, _{22 }= 50.08, _{33 }= 41.55, _{12 }= 29.97, _{13 }= 35.81 and _{23 }= 45.77, which correspond to _{11 }= -5.74, _{12 }= 4.21, _{111 }= -20.04, _{122 }= 0.09 and _{112 }= -10.06 as the true parameters in the allele coding one-locus model. In both cases, the least square estimators of the one-locus model parameters are unbiased estimators of the true parameters. Note that, unlike the one-locus model in the previous example, the LSE of the model parameters are no longer exactly the same as the true values even when no environmental noises are involved. The reason is that the expected genotypic values at locus 1 depend on not only the genotypic values but also the joint genotype frequencies in the sample, which may change slightly from sample to sample due to the sampling variation.

For the two-locus model without epistases, it cannot provide unbiased estimators for all the genotypic values because of the model mis-specification. However, the LSE of its parameters associated with locus 1 are similar to the ones in the one-locus model at locus 1. In fact, as we know from the linear model theory, the true values of its parameters associated with locus 1 are the same as the ones defined in the one-locus model at locus 1 when the two loci are in LE. Under LD, the least square estimators of its model parameters associated with locus 1 could be biased, and the biasness depends on the LD setting.

The two-locus model with epistases gives a full re-parameterization of the 18 genotypic values. Therefore, when no environmental noises are involved, the LSE of its model parameters are exactly the same as their true values for each random sample regardless of the LD between the two loci. It has to be pointed out that this phenomenon holds only when the random sample contains all the 18 possible genotypes. In our simulation setting, the frequencies for certain genotypes such as _{1}
_{1}
_{1}
_{1}, _{1}
_{3}
_{1}
_{1 }and _{2}
_{2}
_{1}
_{1 }are pretty small. As the result, we occasionally (about 22-23% of the 1000 random samples) may obtain a random sample that has no individuals carrying certain genotypes. In this case, the design matrix in the fully parameterized model becomes singular and the LSE of the model parameters are no longer unique. To keep our illustration of the model properties simple, we excluded those random samples in fitting the two-locus model with epistases (reduced models are less likely to have singular design matrices). Other techniques such as ridge regression could be applied to handle those skewed random samples. In the presence of environmental noises, it is also noted that the LSE for some of its model parameters such as _{111}, (_{111}
_{21}) and (_{111}
_{211}) have much larger SD than the LSE of other parameters. This is due to the low frequencies of genotypes _{1}
_{1}
_{1}
_{1}, _{1}
_{3}
_{1}
_{1 }and _{2}
_{2}
_{1}
_{1}. As a random sample has few individuals carrying these genotypes, it has reduced accuracy in estimation of their corresponding true genotypic values to which the model parameters _{111}, (_{111}
_{21}) and (_{111}
_{211}) are related.

Discussion

In this study, we introduced three genotype coding schemes to build F_{∞ }models for multi-allele markers. The relationship between the model parameters and the expected genotypic values were established in some fully parameterized as well as reduced one-locus and two-locus F_{∞ }models. Our results showed that the relationships between the model parameters and the expected genotypic values could become more intricate in the multi-allele case than that in the biallelic case, even though the extension of the coding schemes from biallelic to multiple alleles appears straightforward. We built the relationships between different model parameters mainly through their coding variables of marker genotypes, which simplified the tedious derivation process comparing with the classical matrix approach. The F_{∞ }models we proposed can be used directly for association testing of multi-allele markers and their possible interactions with quantitative traits using random unrelated samples. These F_{∞ }models could also be applied to test for the risk haplotypes and their interactions when incorporated with the likelihood approach (e.g.,

Throughout the paper, we assumed that all the possible genotypes are available from the sampled individuals. If certain genotypes are not observable, then the expected genotypic values on these genotypes will not be estimable by themselves, which could change the interpretation of the model parameters as well. The models we have presented can also be modified to handle the situation when some individuals have missing genotypes at certain marker loci. When the missing genotypes at a marker locus have both alleles missing at the same time, we can simply introduce an indicator variable to code for the missing genotype at the marker. The regression coefficient of this indicator variable for this missing genotype can usually be interpreted as the difference between the expected genotypic value with missing genotype at the marker locus and the intercept of the model, while the other regression coefficients would keep the same interpretation as before.

It has to be pointed out that the relationships between the model parameters and the expected genotypic values are based on the assumption that the models can correctly specify the structure of the expected genotypic values. When a fully parameterized model is applied, the definition of its model parameters do not depend on the allele frequencies, HWD among alleles within a locus, or LD structure between alleles at different loci. In fitting a reduced model, however, a simplified model may not be totally correct in modeling all the expected genotypic