Department of Epidemiology, Michigan State University, East Lansing, Michigan 48824, USA

Department of Statistics and Probability, Michigan State University, East Lansing, Michigan 48824, USA

The Perinatology Research Branch, NICHD, NIH, DHHS, Bethesda, MD, and Detroit, MI 48201, USA

Abstract

Background

The genetic etiology of complex diseases in human has been commonly viewed as a complex process involving both genetic and environmental factors functioning in a complicated manner. Quite often the interactions among genetic variants play major roles in determining the susceptibility of an individual to a particular disease. Statistical methods for modeling interactions underlying complex diseases between single genetic variants (e.g. single nucleotide polymorphisms or SNPs) have been extensively studied. Recently, haplotype-based analysis has gained its popularity among genetic association studies. When multiple sequence or haplotype interactions are involved in determining an individual's susceptibility to a disease, it presents daunting challenges in statistical modeling and testing of the interaction effects, largely due to the complicated higher order epistatic complexity.

Results

In this article, we propose a new strategy in modeling haplotype-haplotype interactions under the penalized logistic regression framework with adaptive _{1}-penalty. We consider interactions of sequence variants between haplotype blocks. The adaptive _{1}-penalty allows simultaneous effect estimation and variable selection in a single model. We propose a new parameter estimation method which estimates and selects parameters by the modified Gauss-Seidel method nested within the EM algorithm. Simulation studies show that it has low false positive rate and reasonable power in detecting haplotype interactions. The method is applied to test haplotype interactions involved in mother and offspring genome in a small for gestational age (SGA) neonates data set, and significant interactions between different genomes are detected.

Conclusions

As demonstrated by the simulation studies and real data analysis, the approach developed provides an efficient tool for the modeling and testing of haplotype interactions. The implementation of the method in R codes can be freely downloaded from

Background

It has been commonly recognized that most human diseases are complex involving joint effort of multiple genes, complicated gene-gene as well as gene-environment interactions

Mapping genetic interactions has been traditionally pursued in model organisms to identify functional relationships among genes

Genetic interaction, or termed epistasis, occurs when the effect of one genetic variant is suppressed or enhanced by the existence of other genetic variants

The high-dimensional SNP data present unprecedented opportunities as well as daunting challenges in statistical modeling and testing in identifying genetic interactions. However, for most complex diseases, it remains largely unknown which combination of genetic variants is causal to the disease. Given that most traits or diseases are multifactorial and genetically complex, it is very unlikely that the function of a single variant can induce an overt disease signal without modeling the gene networks or pathways. Lin and Wu

In Cui _{1}-penalty, commonly termed as the adaptive LASSO _{1}-penalty allows effect estimation and variable selection simultaneously in a single model. Moreover, it preserves the oracle property of variable selection

Methods

We first explain our method for a model involving interactions of haplotypes in 2 different haplotype blocks containing 2 SNPs in each. More complex models could be easily extended. Assume we have a study sample of _{1 }cases and _{2 }controls. A number of SNPs are genotyped either in a genome-wide scale or in a candidate gene-based scale. Following the notation given in Liu

The configuration of two SNP combinations

**Observed Genotype**

**Diplotype**

**Composite Diplotype**

**Configuration**

**Frequency**

**Relative Freq**.

11/11

[11][11]

1

11/12

[11][12]

2 _{11}_{12}

1

11/22

[12][12]

1

12/11

[11][21]

2 _{11}_{21}

1

12/12

12/22

[12][22]

2 _{12}_{22}

1

22/11

[21][21]

1

22/12

[21][22]

2 _{21}_{22}

1

22/22

[22][22]

1

Where

The epistasis model

We consider two haplotype blocks _{1}, _{2}, _{1}_{1}_{2}_{2 },

where

_{t }_{t }_{s(t)}and _{s(t) }can be interpreted as the additive and dominance effects for the risk haplotype at block _{aa}, _{ad}, _{da}, _{dd }can be interpreted as the additive×additive, additive×dominance, dominance×additive, and dominance×dominance interaction effects between the two blocks, respectively.

Let _{i }denote a measured disease trait for subject _{g }_{e }_{ig }_{ie }^{th }row of _{g }_{e}_{g}β _{e}γ_{s}_{t}_{s}_{t}_{aa}_{ad}_{da}_{dd}^{T }_{1}, _{2}, ...,_{8}]^{T }

Compared to most non-parametric methods in detecting gene-gene interactions, such as the multifactor dimensionality reduction (MDR) method which only provides an interaction test

Likelihood function

We first introduce notations. Let _{is }_{it }_{is}_{it }_{is }_{it }_{is }_{it}_{1}, _{2}, _{3 }and _{4 }as four distinct genotype groups corresponding to the classification of phase (un)ambiguous haplotype blocks:

To construct likelihood function, all three groups, _{2}, _{3}, _{4}, except group _{1}, involve phase ambiguity genotypes, hence need to be modeled with mixture distributions.

Define

We further define a set of the logistic regression functions for each genotype group as

Assuming independence between individuals, we construct the joint likelihood function as follows:

Because the phase ambiguous state _{si }_{ti }

Variable selection methods such as LASSO

where λ is a tuning parameter for the likelihood and penalty term, and is chosen by the minimum Bayesian Information Criterion (BIC); _{1}, _{2}, ..., _{8}) is a weight vector for the genetic effects _{j }_{OLS}_{OLS }

Missing data and the EM algorithm

The phase ambiguous genotypes lead to missing data. The currently developed algorithms LASSO or adaptive LASSO estimation can not be directly applied to maximize the penalized likelihood (3). However, this could be solved by applying an EM algorithm detailed as follows:

1) Initialize β, γ, and calculate

2) **E-step**: Estimate _{si}_{ti }_{ji}

for _{k }(

For i ∈ _{4}, we have

where

3) **M-step**: Update β,γ by maximizing the penalized log likelihood function (3);

4) Repeat step 1)-3) until convergence.

Computational algorithm for maximizing the penalized log likelihood

In the M step, parameters β, γ are updated by calculating LASSO estimate. The LASSO regression with continuous response has been well studied. Some very efficient algorithms have been proposed, such as the shooting algorithm and the LARS

We first derive the first order optimality conditions for the penalized likelihood (3). It is noticed that the penalized likelihood _{j }_{j}_{j }

For the phase known genotypes, _{j }

With the phase ambiguous genotypes, _{j }_{si}_{ti}

Based on the above conditions, we define

Therefore, the optimal conditions could be achieved when _{j }_{j}_{z }_{j }_{nz }_{j }

1) Initialize _{j }

2) While any _{j }_{z}

Find the maximum violator _{k}

Update _{k }

While any _{j }_{nz}

Find the maximum violator _{l}

Update _{l }

Until no violator exists in _{nz}

Until no violator exists in _{z}

For computational precision purpose, the condition _{j }_{j }^{-5 }in our computation.

This method is based on the convexity of the likelihood function. The computation procedure updates one _{j }

**Strict convexity of the log likelihood function**. The file contains the proof of strict convexity of the log likelihood function.

Click here for file

Risk haplotype selection

We treat each possible haplotype as a potential "risk" haplotype. The one with minimum BIC information defined below is chosen as the "risk" haplotype.

where

Results

Simulation study

We conducted a series of simulation with various scenarios to evaluate the statistical property of the proposed method. Within each block, the minor allele frequencies of the two SNPs were assumed to be 0.3 and 0.4 with a linkage disequilibrium

Data were simulated by assuming one haplotype was distinct from the other ones for each block. Haplotypes were simulated assuming Hardy-Weinberg equilibrium. A disease status was simulated from a Bernoulli distribution with given genetic effects under different scenarios (Table

List of parameter values under different simulation designs

**Scenario**

**
a
**

**
a
**

**
d
**

**
d
**

**
i
**

**
i
**

**
i
**

_{
dd
}

S0

0

0

0

0

0

0

0

0

S1

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

S2

0.8

0.8

0

0

0

0

0

0

S3

0.8

0.8

0.8

0.8

0

0

0

0

S4

0.8

0

0.8

0

0.8

0.8

0.8

0.8

Figure

The bar plot of variable selection results under different simulation scenarios

**The bar plot of variable selection results under different simulation scenarios**. Parameter values are listed in Table 2. The three sets of colored bars correspond to different sample sizes (Blue:200; Green:500; Red:1000). The horizontal dashed line indicates the nominal level of 0.05.

A case study

We applied our model to a perinatal case-control study on small for gestational age (SGA) neonates as part of a large-scale candidate gene-based genetic association studies of pregnancy complication conducted in Chile. A total of 991 mother-offspring pairs (406 SGA cases and 585 controls) were genotyped for 1331 SNPs involving 200 genes. Maternal and fetal genome interaction was a primary genetic resource for SGA neonates. So we focused our analysis on identifying haplotype interactions between the maternal and fetal genome.

We first excluded SNPs that had a minor allele frequency of less than 5% or that did not satisfy Hardy-Weinberg equilibrium (HWE) in the combined mother and offspring control population by a Chi-squares test with a cut-off p-value of 0.001. We further used the computer software Haploview

We picked two SNPs within each block and applied our model to study the main effects as well as the haplotype interaction effects between a mother and her offspring genome. By fitting our model as described in previous section and controlling other variables including maternal age and BMI, we successfully identified several SNP haplotypes with interaction effects through the adaptive LASSO logistic regression model. To ensure the significance, permutation tests of 1000 runs were further conducted to assess the significance. In each permutation test, the phenotypes were permuted and the model was fitted with different parameter estimate. An empirical p-value for effect

Results of the real data analysis were summarized in Table

List of selected genes, corresponding "risk" haplotype structure, effect estimates and permutation p-values

**SNP ID (allele)**

**Gene (region)**

**"Risk" haplotype**

**
a
_{s}
**

**
d
_{s}
**

**
a
_{t}
**

**
d
_{t}
**

**
i**

**
i**

**
i**

**
i**

9508994

(C/T)

PON1

(intron 1)

[TC]^{M}

0

0

0

0

0

-0.45

0

0

20209376

(C/T)

PON1

(intron 5)

[CC]^{O}

p* = 0.001

659435566

(C/T)

NFKB1

(exon 12)

[CC]^{M}

0

0

0

0

-0.33

0

0

0

659435702

(C/G)

NFKB1

(intron 22)

[TC]^{O}

p* = 0.001

22767327 (A/T)

FLT4

(intron 7)

[AT]^{M}

0

0

0

0

0

-0.30

0

0

22175087 (C/T)

FLT4

(intron 8)

[TC]^{O}

p* < 0.001

1125300 (G/T)

SPARC (intron 3)

[TT]^{M}

0

-0.38

0

0

0

0

0

0.245

1125290 (G/T)

SPARC (intron 5)

[TT]^{O}

p* = 0.001

p* < 0.001

634841108 (A/C)

TIMP2 (intron 2)

[AG]^{M}

0

0

0

0

0

0

0

0.68

634841123 (A/G)

TIMP2

(exon 3)

[CG]^{O}

p* < 0.001

634018768 (A/G)

HPGD (promoter)

[AG]^{M}

0

0

0.44

0

0

0

0

0

636105057 (A/G)

HPGD (promoter)

[GA]^{O}

p* < 0.001

17252653 (G/T)

MMP9 (intron)

[GC]^{M}

0

0

0.53

0

0

0

0

0

17254821 (C/G)

MMP9

(exon 10)

[TC]^{O}

p* < 0.001

^{M }mother's "risk" haplotype information; ^{O }offspring's "risk" haplotype information

p* is the permutation p-value.

Our approach conducts the variable selection and effect estimation simultaneously, which allows us to have a direct biological interpretation for the mode of gene action. Here, we use gene

Following Eq. (1), we can see that R_{1 }corresponds to the baseline reference group, R_{2 }corresponds to the risk group with -1/2 interaction coefficient, and R_{3 }corresponds to the risk group with 1/2 interaction coefficient. Correspondingly, the log odds of the disease development in each 'risk' group and the odds ratio (OR) between groups can be estimated by:

Other non-parametric methods, such as multifactor dimensionality reduction (MDR), have been shown to be successful for the identification of interaction effects in many studies. Because MDR can only be applied to studies with balanced case/control design, generalized MDR (GMDR) has been proposed as an extension to MDR

In the example of

Model extension

Our method has been illustrated with two SNPs only. The model can be easily extended to more than two SNPs. When three or more SNPs are involved in each haplotype block, Cui

Another possible solution to the challenges mentioned above is to do a sliding window search with each window covering two SNPs at a time. This is similar to the sliding window haplotype analysis commonly applied in some software such as PLINK.

Discussion and Conclusions

Although it has been reported that gene-gene interaction plays a major role in genetic studies of complex diseases, the detection of gene-gene interaction has been traditionally pursued on a single SNP level, i.e., focusing on single SNP interaction. Intuitively, SNP-SNP interaction can not represent gene-gene interaction because single SNPs cannot capture the total variation of a gene. Thus, extending the idea of single SNP interaction to haplotype interaction could potentially gain much in terms of capturing variations in genes. The proposed method defines gene-gene interaction through haplotype block interactions and offers an alternative strategy in finding potential interactions between two genes. We argue that the definition of haplotype block interaction could provide additional biological insights into a disease etiology, compared to a single SNP-based interaction analysis.

One of the advantages of our method is in grouping, hence reducing data dimension. By mapping genotypes to composite diplotypes, the data dimension is significantly reduced. Then we can use Bayesian information criterion to select potential "risk" haplotypes

Our simulation study showed that our method has reasonable false positive control and selection power for the genetic parameters. As we expected, the interaction effects have lower selection power compared to the main effects. As sample size increases, we are able to achieve an optimal power for the interaction effects. Another novelty of the method is the modeling of the "risk" haplotype, which leads to the partition of composite diplotypes. No matter how many SNPs are involved, it always ends up with three types of composite diplotypes. Thus, the number of genetic parameters is always fixed regardless of the number of SNPs. The only cost is the search for possible "risk" haplotypes through a larger parameter space.

We applied our method to a SGA study data set. Several SNP pairs were selected with either main or interaction effects. The permutation test confirmed the statistical significance of the selected effect. Our findings confirmed other findings of gene selection in the literature. Gene

We also identified potential interaction effects for several additional genes, including

Authors' contributions

ML performed the analysis and wrote the manuscript; RR collected the data; WF participated in the design and manuscript writing; YC conceived the idea, designed the model and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors wish to thank the two anonymous referees for their helpful comments that improved the manuscript, and thank Dr. Kelian Sun for helping data processing. This work was supported in part by NSF grant DMS-0707031 and by the Perinatology Research Branch, Division of Intramural Research,