Abstract
Background
The genetic etiology of complex diseases in human has been commonly viewed as a complex process involving both genetic and environmental factors functioning in a complicated manner. Quite often the interactions among genetic variants play major roles in determining the susceptibility of an individual to a particular disease. Statistical methods for modeling interactions underlying complex diseases between single genetic variants (e.g. single nucleotide polymorphisms or SNPs) have been extensively studied. Recently, haplotypebased analysis has gained its popularity among genetic association studies. When multiple sequence or haplotype interactions are involved in determining an individual's susceptibility to a disease, it presents daunting challenges in statistical modeling and testing of the interaction effects, largely due to the complicated higher order epistatic complexity.
Results
In this article, we propose a new strategy in modeling haplotypehaplotype interactions under the penalized logistic regression framework with adaptive L_{1}penalty. We consider interactions of sequence variants between haplotype blocks. The adaptive L_{1}penalty allows simultaneous effect estimation and variable selection in a single model. We propose a new parameter estimation method which estimates and selects parameters by the modified GaussSeidel method nested within the EM algorithm. Simulation studies show that it has low false positive rate and reasonable power in detecting haplotype interactions. The method is applied to test haplotype interactions involved in mother and offspring genome in a small for gestational age (SGA) neonates data set, and significant interactions between different genomes are detected.
Conclusions
As demonstrated by the simulation studies and real data analysis, the approach developed provides an efficient tool for the modeling and testing of haplotype interactions. The implementation of the method in R codes can be freely downloaded from http://www.stt.msu.edu/~cui/software.html webcite.
Background
It has been commonly recognized that most human diseases are complex involving joint effort of multiple genes, complicated genegene as well as geneenvironment interactions [1]. The identification of disease risk factors for monogenic diseases has been quite successful in the past. Due to the small effect of many single genetic variants on the risk of a disease, the identification of disease variants for complex multigenic diseases has not been very successful [2]. There are multiple reasons for this. First, most complex diseases involve multiple genetic variants each conferring a small or moderate effect on a disease risk. Second, the complexity relies on the complicated interactions among disease variants, on a singlesingle variants or multiplemultiple variants basis. Third, but not the last, geneenvironment interaction also plays pivotal roles in determining the underlying complexity of disease etiology. Studies on testing genegene interactions have been commonly pursued in the past, but little has been achieved, despite its importance in determining a disease risk (see [3] for a comprehensive review).
Mapping genetic interactions has been traditionally pursued in model organisms to identify functional relationships among genes [46]. With the seminal work in quantitative trait loci (QTL) mapping by Lander and Botstein [7], extensive work has been focused on experimental crosses to study the genetic architecture of complex traits. Along the line, methods for mapping QTL interactions have also been developed [8,9]. The recent development of human HapMap and radical breakthrough in genotyping technology have enabled us to generate high throughput single nucleotide polymorphisms (SNPs) data which are dense enough to cover the whole genome [10]. This advancement allows us to characterize variants at a sequence level that encode a complex disease phenotype, and opens a prospective future for disease variants identification [11,12].
Genetic interaction, or termed epistasis, occurs when the effect of one genetic variant is suppressed or enhanced by the existence of other genetic variants [13]. In align with this definition, Mani et al. [14] recently defined two distinct genetic interactions, namely the synergistic interaction in which extreme phenotype is expected whenever double mutations are present, and the alleviating interaction where one mutation in one gene masks the effect of another mutation by impairing the function of relative pathways. As an important component of the genetic architecture of many biological traits, the role of epistasis in shaping an organism's development has been unanimously recognized [15,16]. An increasing number of empirical studies have also revealed the role of epistasis in the pathogenesis of most common human diseases, such as cancer or cardiovascular disease [17,18].
The highdimensional SNP data present unprecedented opportunities as well as daunting challenges in statistical modeling and testing in identifying genetic interactions. However, for most complex diseases, it remains largely unknown which combination of genetic variants is causal to the disease. Given that most traits or diseases are multifactorial and genetically complex, it is very unlikely that the function of a single variant can induce an overt disease signal without modeling the gene networks or pathways. Lin and Wu [19] proposed a sequence interaction model in a linear regression framework for a quantitative phenotype. Zhang et al. [20] proposed an entropybased method for searching haplotypehaplotype interactions using unphased genotype data with applications in type I diabetes. Musani et al. [21] and Cordell [3] recently gave a comprehensive review of statistical methods developed for detecting genegene interactions. While most methods are nonparametric in nature such as the popular multifactor dimensionality reduction (MDR) method [22], they do not provide effect estimates for genegene interactions. Thus methods focusing on data reduction ignore the biological interpretation of the interaction. For instance, if two SNPs are identified to have interaction, how do they interact in genetics? What are the modes of gene action?
In Cui et al. [12], a novel approach was proposed to group haplotypes to detect risk haplotypes associated with a disease. In an extension to this work, we proposed a new statistical method to model haplotypehaplotype interactions responsible for a binary disease phenotype. We assume a populationbased casecontrol design where a disease phenotype is assumed dichotomous. Due to highorder interactions, we propose a penalized logistic regression framework with adaptive L_{1}penalty, commonly termed as the adaptive LASSO [23]. The adaptive L_{1}penalty allows effect estimation and variable selection simultaneously in a single model. Moreover, it preserves the oracle property of variable selection [23]. Due to the binary nature of the response, we proposed a modified GaussSeidel method nested within the EM algorithm to estimate parameters. The model is applied to a real data set in which significant haplotype interactions are detected between mother and offspring genomes that might be responsible for disease risks in pregnancy.
Methods
We first explain our method for a model involving interactions of haplotypes in 2 different haplotype blocks containing 2 SNPs in each. More complex models could be easily extended. Assume we have a study sample of n unrelated subjects with n_{1 }cases and n_{2 }controls. A number of SNPs are genotyped either in a genomewide scale or in a candidate genebased scale. Following the notation given in Liu et al. [11] and Cui et al. [12], we construct composite diplotypes by defining a distinct haplotype termed as "risk" haplotype for each haplotype block. Assuming two SNPs in each block, there could be nine possible genotypes, numerically denoted as 11/11, 11/12, 11/22, 12/11, 12/12, 12/22, 22/11, 22/12, 22/22. Without loss of generality, we assume [11] to be the "risk" haplotype. We denote the risk haplotype as H and all other nonrisk haplotype as . In doing so, we can map the observed genotypes to three possible composite diplotypes, i.e., HH, and . Except for the double heterozygote 12/12 which is phase ambiguous and could be from two possible composite diplotypes, all other genotypes can be mapped to unique composite diplotypes. A detailed list of the configuration is given in Table 1.
Table 1. The configuration of two SNP combinations
The epistasis model
We consider two haplotype blocks s and t, each with two SNPs. There are total 81 possible genotype combinations. In each block, only the double heterozygote has ambiguous linkage phase, thus 64 genotypes could be mapped to unique composite diplotypes. Let (H_{1}, ) and (H_{2}, ) be the risk and nonrisk haplotypes at blocks s and t, respectively. Expressed in terms of composite diplotypes, the four haplotypes can form nine distinct composite diplotypes expressed as H_{1}H_{1}H_{2}H_{2 }, , , , , , , and . The effects of the nine distinct composite diplotypes can be modeled through the traditional quantitative genetics model. Specifically, we use the Cockerham's orthogonal partition method [24] in which the genetic mean of an interaction model between blocks s and t can be expressed as
where
x_{t }and z_{t }can be defined similarly. With the above definition, a_{s(t)}and d_{s(t) }can be interpreted as the additive and dominance effects for the risk haplotype at block s(t); i_{aa}, i_{ad}, i_{da}, i_{dd }can be interpreted as the additive×additive, additive×dominance, dominance×additive, and dominance×dominance interaction effects between the two blocks, respectively.
Let y_{i }denote a measured disease trait for subject i, which is dichotomous taking value 1 or 0, corresponding to affected or unaffected individual, respectively. Let X_{g }denote a matrix of numerical codes corresponding to the two composite diplotypes as well as their interactions, and let X_{e }denote a matrix of measured covariates, including the intercept as the first column. Let x_{ig }and x_{ie }denote the i^{th }row of X_{g }and X_{e}. Assuming that these factors influence the mean of a trait, so that their effects can be summarized by a function of linear predictors η = X_{g}β + X_{e}γ, where β = [a_{s}, a_{t}, d_{s}, d_{t}, i_{aa}, i_{ad}, i_{da}, i_{dd}]^{T }contain regression parameters for the genetic effects of composite diplotypes on a disease trait; γ contain the effects of overall mean and the covariates. To simplify the notations, we also use β =[β_{1}, β_{2}, ...,β_{8}]^{T }for the genetic effects in the equations below. Given a binary disease response, we can apply a conditional logistic model with the form
Compared to most nonparametric methods in detecting genegene interactions, such as the multifactor dimensionality reduction (MDR) method which only provides an interaction test [19], the above interaction model allows one to identify which ones are the risk haplotypes in two haplotype blocks, and to further quantify the specific structure and effect size of epistatic interactions between the two haplotype blocks. We argue that this modelbased epistatic test provides biologically more meaningful results than a nonparametric method such as MDR.
Likelihood function
We first introduce notations. Let g_{is }and g_{it }denote the observed genotypes in haolotype block s and t respectively for subject i. With the same numerical notation defined previously, we have g_{is}, g_{it }∈ {11/11, 11/12, 11/22, 12/11, 12/12, 12/22, 22/11, 22/12, 22/22}. Let G_{is }and G_{it }be the underlying composite diplotypes for g_{is }and g_{it}, respectively. We have and . We further define M_{1}, M_{2}, M_{3 }and M_{4 }as four distinct genotype groups corresponding to the classification of phase (un)ambiguous haplotype blocks:
To construct likelihood function, all three groups, M_{2}, M_{3}, M_{4}, except group M_{1}, involve phase ambiguity genotypes, hence need to be modeled with mixture distributions.
Define
We further define a set of the logistic regression functions for each genotype group as
Assuming independence between individuals, we construct the joint likelihood function as follows:
Because the phase ambiguous state c_{si }and c_{ti }are not observable, we treat them as missing data and use EM algorithm to estimate them iteratively (See below).
Variable selection methods such as LASSO [25] or adaptive LASSO [23] have been commonly applied when the number of predictors is large. These methods can achieve parameter estimation and variable selection simultaneously and have gained large popularity in genetic and genomic data analysis. Considering the large number of genetic parameters to be estimated in the model, we apply the adaptive LASSO to our model for its oracle property; namely, it performs variable selection and parameter estimation as if the true underlying model is known in advance [23]. Instead of maximizing the above log likelihood, we estimate the parameters by maximizing the log likelihood with the adaptive LASSO penalty.
where λ is a tuning parameter for the likelihood and penalty term, and is chosen by the minimum Bayesian Information Criterion (BIC); ω = (w_{1}, w_{2}, ..., w_{8}) is a weight vector for the genetic effects β. When w_{j }= 1 for every j, this leads to a general LASSO penalty. Although the general LASSO estimator may not be consistent, some data dependent weight vector ω is able to warrant the oracle property for the corresponding adaptive LASSO estimator. Specifically, one choice of ω is ω = 1/β_{OLS}, where β_{OLS }is the ordinary least square (OLS) estimator. This makes the adaptive LASSO estimate much more attractive than the general LASSO estimate [23].
Missing data and the EM algorithm
The phase ambiguous genotypes lead to missing data. The currently developed algorithms LASSO or adaptive LASSO estimation can not be directly applied to maximize the penalized likelihood (3). However, this could be solved by applying an EM algorithm detailed as follows:
1) Initialize β, γ, and calculate for subject i;
2) Estep: Estimate c_{si}, c_{ti }for subjects with phase ambiguous genotypes with E(c_{ji})by
for i ∈ M_{k }(k, j) ∈ {(2, s}, (3,t)}.
For i ∈ M_{4}, we have
3) Mstep: Update β,γ by maximizing the penalized log likelihood function (3);
4) Repeat step 1)3) until convergence.
Computational algorithm for maximizing the penalized log likelihood
In the M step, parameters β, γ are updated by calculating LASSO estimate. The LASSO regression with continuous response has been well studied. Some very efficient algorithms have been proposed, such as the shooting algorithm and the LARS [26,27]. The estimation has been a challenge for the generalized linear model due to the nonlinearity of the likelihood function, especially with an adaptive penalty term. No exact solution exists for parameter estimation in this setting. Here we propose a computational algorithm using a GaussSeidel method [28] to solve an unconstrained optimization problem. More detail about this method can be found in Shevade et al. [29]. To simplify the notations, we explain our method without environmental covariates.
We first derive the first order optimality conditions for the penalized likelihood (3). It is noticed that the penalized likelihood L' is piecewise differentiable. Following the notation in Shevade [29], denote F_{j }= ∂(2 L)/∂β_{j}. The first order optimality conditions ∂L'/∂β_{j }= 0 could be achieved as follows:
For the phase known genotypes, F_{j }will have an explicit form as:
With the phase ambiguous genotypes, F_{j }can be calculated accordingly with the mixture proportion E(c_{si})and E(C_{ti})that are estimated from Estep.
Based on the above conditions, we define
Therefore, the optimal conditions could be achieved when Viol_{j }= 0 for ∀j. For a given λ and w_{j}, j = 1.....p, we further define I_{z }= {j: β_{j }= 0, j > 0}; and I_{nz }= {0}∪{j: β_{j }≠ 0, j > 0}. The detailed estimation procedure is given as:
1) Initialize β_{j }= 0, j = 0, 1...... p;
2) While any Viol_{j }> 0 in I_{z},
Find the maximum violator V_{k},
Update β_{k }by optimizing L';
While any Viol_{j }> 0 in I_{nz},
Find the maximum violator V_{l},
Update β_{l }by optimizing L',
Until no violator exists in I_{nz};
Until no violator exists in I_{z}
For computational precision purpose, the condition Viol_{j }> 0 is relaxed to Viol_{j }> 10^{5 }in our computation.
This method is based on the convexity of the likelihood function. The computation procedure updates one β_{j }at a time until all the optimality conditions are achieved. The algorithm is relatively efficient because it does not involve matrix inverse. The convexity condition warrants one and only one solution for each update (See additional file 1). Similar algorithm has been used in linear regression setting, commonly referred to as 'the shooting algorithm' [26], and in logistic regression setting for general LASSO [29]. The asymptotic convergence of this method for nonlinear optimization problem has been proven in [[28], Ch.3Prop 4.1].
Additional file 1. Strict convexity of the log likelihood function. The file contains the proof of strict convexity of the log likelihood function.
Format: DOCX Size: 27KB Download file
Risk haplotype selection
We treat each possible haplotype as a potential "risk" haplotype. The one with minimum BIC information defined below is chosen as the "risk" haplotype.
where d is the number of nonzero parameters in the model and n is the total sample size.
Results
Simulation study
We conducted a series of simulation with various scenarios to evaluate the statistical property of the proposed method. Within each block, the minor allele frequencies of the two SNPs were assumed to be 0.3 and 0.4 with a linkage disequilibrium D = 0.02. The simulation was conducted under different sample sizes (i.e., n = 200, 500, 1000)
Data were simulated by assuming one haplotype was distinct from the other ones for each block. Haplotypes were simulated assuming HardyWeinberg equilibrium. A disease status was simulated from a Bernoulli distribution with given genetic effects under different scenarios (Table 2). The intercept was adjusted to make the sample size ratio between cases and controls at approximately 1. Scenario S0 assumed no genetic effect at all. Other scenarios assumed different structure of genetic effects. Scenario S1 was an extreme case where all parameters were significant. The purpose of this simulation was to compare the selection power of different genetic parameters. Scenario S2 assumed that only one haplotype block has effects; Scenario S3 assumed both blocks had a genetic contribution to the disease phenotype without interaction between them; and Scenario S4 assumed both main and interaction effects between the two blocks. Data simulated with these configurations were subject to analysis with the proposed method. Results from 200 Monte Carlo repetitions were recorded.
Table 2. List of parameter values under different simulation designs
Figure 1 showed the results for variable selection under different simulation scenarios. For each genetic parameter, the three bars in color correspond to different sample sizes (see figure legend). The top figure corresponded to Scenario S0, in which the proportion of selection was equivalent to the false positive (or selection) rate. It can be seen that the false selection rates for all parameters were all under the nominal level of 0.05, indicating a good false positive control. For the other scenarios (S1S4), the selection power increased as the sample size increased. Compared to S0, the selection rates for true negatives increased, but were also under reasonable control. Also as we expected, the selection power for the main effects was generally larger than the interaction effect (S1). Among the four interaction effects, the dominance×dominance effect performed the worst (S1 and S4). The simulation results also indicated that small sample size (n = 200) generally performed badly given the large number of genetic parameters to be estimated. Generally, at least 500 samples were required to achieve reasonable power to detect interactions.
Figure 1. The bar plot of variable selection results under different simulation scenarios. Parameter values are listed in Table 2. The three sets of colored bars correspond to different sample sizes (Blue:200; Green:500; Red:1000). The horizontal dashed line indicates the nominal level of 0.05.
A case study
We applied our model to a perinatal casecontrol study on small for gestational age (SGA) neonates as part of a largescale candidate genebased genetic association studies of pregnancy complication conducted in Chile. A total of 991 motheroffspring pairs (406 SGA cases and 585 controls) were genotyped for 1331 SNPs involving 200 genes. Maternal and fetal genome interaction was a primary genetic resource for SGA neonates. So we focused our analysis on identifying haplotype interactions between the maternal and fetal genome.
We first excluded SNPs that had a minor allele frequency of less than 5% or that did not satisfy HardyWeinberg equilibrium (HWE) in the combined mother and offspring control population by a Chisquares test with a cutoff pvalue of 0.001. We further used the computer software Haploview [30] to identify haplotype blocks for SNPs within each gene. Two tag SNPs were used to represent each block. A sliding window approach was applied to search for interactions between two blocks.
We picked two SNPs within each block and applied our model to study the main effects as well as the haplotype interaction effects between a mother and her offspring genome. By fitting our model as described in previous section and controlling other variables including maternal age and BMI, we successfully identified several SNP haplotypes with interaction effects through the adaptive LASSO logistic regression model. To ensure the significance, permutation tests of 1000 runs were further conducted to assess the significance. In each permutation test, the phenotypes were permuted and the model was fitted with different parameter estimate. An empirical pvalue for effect j was calculated which is defined by
Results of the real data analysis were summarized in Table 3. Among the identified pairs, genes HPGD and MMP9 only showed main block effects. All the other five showed significant interaction effect. Permutation pvalues confirmed the statistical significance of the detected effects. We used the maternalfetal pairs to show the utility of our method. We could also do the analysis focusing on the fetal genome only. We thought an interaction between the maternal and fetal genome was more interesting, thus used this as an example.
Table 3. List of selected genes, corresponding "risk" haplotype structure, effect estimates and permutation pvalues
Our approach conducts the variable selection and effect estimation simultaneously, which allows us to have a direct biological interpretation for the mode of gene action. Here, we use gene PON1 as an example to illustrate the implementation of our model. In gene PON1, the selected risk haplotypes are [TC] for the mother and [CC] for the offspring. We find significant additive × dominant haplotype interaction effect. The two haplotypes separate all the motheroffspring pairs into three 'risk' groups with respect to the development of SGA:
Following Eq. (1), we can see that R_{1 }corresponds to the baseline reference group, R_{2 }corresponds to the risk group with 1/2 interaction coefficient, and R_{3 }corresponds to the risk group with 1/2 interaction coefficient. Correspondingly, the log odds of the disease development in each 'risk' group and the odds ratio (OR) between groups can be estimated by:
Other nonparametric methods, such as multifactor dimensionality reduction (MDR), have been shown to be successful for the identification of interaction effects in many studies. Because MDR can only be applied to studies with balanced case/control design, generalized MDR (GMDR) has been proposed as an extension to MDR [31]. GMDR maps phenotypic traits into residual scores through certain link functions under the generalized liner model setting, and further conducts SNP selection and testing based on the residual scores. To compare with our method, we applied GMDR to the data. The motheroffspring paired genotype data were used as input for GMDR, and a logistic link was used to calculate the residual scores.
In the example of PON1, SNP 20209376 (C/T) in the fetal genome was first selected by GMDR (pvalue = 0.0107). SNPs were then paired with each other to identify potential significant pairwise interactions. Only SNP 9508994 (C/T) in the mother genome was found to interact with SNP 20209376 with marginal significance (pvalue = 0.0547). More complex model were found to be nonsignificant (pvalue = 0.1719 and pvalue = 0.3770 for 3 SNP and 4 SNP model, respectively). Even though GMDR indicated a maternalfetal interaction between these two SNPs, it did not provide an estimation of the genetic effect and the underlying interaction mechanism between the SNPs.
Model extension
Our method has been illustrated with two SNPs only. The model can be easily extended to more than two SNPs. When three or more SNPs are involved in each haplotype block, Cui et al. [12] gave an explicit derivation for possible "risk" haplotype structure. In fact no matter how may SNPs are involved, three possible composite diplotypes can be constructed as illustrated by Cui et al. [12]. The only challenge for this extension is to deal with the number of heterozygous loci. For example, when three SNPs are considered in a block, there are a total of seven possible phaseambiguous genotypes. In a single block haplotype analysis, there could be four mixture distributions when constructing the likelihood function. When we consider interactions between two blocks, there are a total of 16 possible mixture distributions in the likelihood function. This will, however, definitely increase the programming challenge and the computing burden. Fortunately, the increaes of the mixture components will not affect the number of parameters to be estimated. We still have four main effects and four interactions, as these parameters are defined based on the "risk" haplotype structure.
Another possible solution to the challenges mentioned above is to do a sliding window search with each window covering two SNPs at a time. This is similar to the sliding window haplotype analysis commonly applied in some software such as PLINK.
Discussion and Conclusions
Although it has been reported that genegene interaction plays a major role in genetic studies of complex diseases, the detection of genegene interaction has been traditionally pursued on a single SNP level, i.e., focusing on single SNP interaction. Intuitively, SNPSNP interaction can not represent genegene interaction because single SNPs cannot capture the total variation of a gene. Thus, extending the idea of single SNP interaction to haplotype interaction could potentially gain much in terms of capturing variations in genes. The proposed method defines genegene interaction through haplotype block interactions and offers an alternative strategy in finding potential interactions between two genes. We argue that the definition of haplotype block interaction could provide additional biological insights into a disease etiology, compared to a single SNPbased interaction analysis.
One of the advantages of our method is in grouping, hence reducing data dimension. By mapping genotypes to composite diplotypes, the data dimension is significantly reduced. Then we can use Bayesian information criterion to select potential "risk" haplotypes [12]. The selection of "risk" haplotype renders another advantage of the method. We can identify significant haplotype structures and further quantify its main and interaction effects. This greatly enhances our model interpretability and biological relevance.
Our simulation study showed that our method has reasonable false positive control and selection power for the genetic parameters. As we expected, the interaction effects have lower selection power compared to the main effects. As sample size increases, we are able to achieve an optimal power for the interaction effects. Another novelty of the method is the modeling of the "risk" haplotype, which leads to the partition of composite diplotypes. No matter how many SNPs are involved, it always ends up with three types of composite diplotypes. Thus, the number of genetic parameters is always fixed regardless of the number of SNPs. The only cost is the search for possible "risk" haplotypes through a larger parameter space.
We applied our method to a SGA study data set. Several SNP pairs were selected with either main or interaction effects. The permutation test confirmed the statistical significance of the selected effect. Our findings confirmed other findings of gene selection in the literature. Gene PON1 was previously reported to be associated with preterm birth, which is one of the potential genetic resources leading to SGA [32]. Gene FLT4 had been found to be association with the growth of human fetal endothelia cells and early human development [33,34]. Gene HPGD was also reported being involved in human intrauterine growth restriction [35]. Gene MMP9 had been suggested to be related with placenta function [36]. These evidences strongly indicated the biological relevance of our method.
We also identified potential interaction effects for several additional genes, including NFKB1, SPARC and TIMP2. To our knowledge, no experimental evidence has been reported for these genes regarding the biological function related to fetal development or SGA. However, we found that each of these genes had been suggested to be involved in many biological pathways. Studies indicated that gene NFKB1 was functionally related to stressimpaired neurogenesis and depressive behavior [37], myelin formation [38], and adipose tissue growth [39]. Gene SPARC had been suggested to be associated with angiogenesis and tumor growth [40] and the progression of crescentic glomerulonephritis [41]. Gene TIMP2 was reported to be related to myogenesis [42] and the progression of cerebral aneurysms [43]. Further replicate studies are needed to confirm the biological relevance of these genes to SGA.
Authors' contributions
ML performed the analysis and wrote the manuscript; RR collected the data; WF participated in the design and manuscript writing; YC conceived the idea, designed the model and wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank the two anonymous referees for their helpful comments that improved the manuscript, and thank Dr. Kelian Sun for helping data processing. This work was supported in part by NSF grant DMS0707031 and by the Perinatology Research Branch, Division of Intramural Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, NIH, DHHS.
References

Zhao J, Jin L, Xiong M: Test for interaction between two unlinked loci.
Am J Hum Genet 2006, 79(5):83145. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Drysdale CM, McGraw DW, Stack CB, Stephens JC, Judson RS, Nandabalan K, Arnold K, Ruano G, Liggett SB: Complex promoter and coding region beta 2adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness.
Proc Natl Acad Sci 2000, 97(19):104838. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cordell HJ: Detecting genegene interactions that underlie human diseases. [http://www.nature.com/nrg/journal/v10/n6/abs/nrg2579.htmla1] webcite
Nat Rev Genet 2009, 10:392404. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Phillips PC, Otto SP, Whitelock MC: Beyond the average: The evolutionary importance of epistasis and the variability of epistatic effects. In Epistasis and the Evolutionary Process. Edited by Wold JB, Brodie ED, Wade MJ. Oxford Univ Press, New York; 2000.

Hartman JL, Garvik B, Hartwell L: Principles for the buffering of genetic variation.
Science 2001, 291:10011004. PubMed Abstract  Publisher Full Text

Boone C, Bussey H, Andrews BJ: Exploring genetic interactions and networks with yeast.
Nat Rev Genet 2007, 8:437449. PubMed Abstract  Publisher Full Text

Lander ES, Botstein D: Mapping mendelian factors underlying quantitative traits using RFLP linkage maps.
Genetics 1989, 121(1):18599,. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kao CH, Zeng ZB, Teasdale RD: Multiple interval mapping for quantitative trait loci.
Genetics 1999, 152(3):120316. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cui Y, Wu R: Mapping genomegenome epistasis: a highdimensional model.
Bioinformatics 2005, 21(10):244755. PubMed Abstract  Publisher Full Text

The international HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs.
Nature 2007, 449:851861. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu T, Johnson JA, Casella G, Wu R: Sequencing complex diseases with HapMap.
Genetics 2004, 168:503511. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cui Y, Fu W, Sun K, Romero R and Wu R: Mapping Nucleoide sequences that encode complex binary disease traits with Hapmap.
Current Genomics 2007, 5:30722. Publisher Full Text

Bateson W: Mendel's Principles of Heredity. Cambridge University Press, Cambridge; 1909. PubMed Abstract  PubMed Central Full Text

Mani R, St Onge RP, Hartman JL, Giaever G, Roth FP: Defining genetic interaction.
Proc Natl Acad Sci 2008, 105(9):34616. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wolf JB, Frankino WA, Agrawal AF, Brodie ED, Moore AJ: Developmental interactions and the constituents of quantitative variation.
Evolution 2001, 55(2):23245. PubMed Abstract

Segrè D, DeLuna A, Church GM, Kishony R: Modular epistasis in yeast metabolism.
Nat Genet 2005, 37:7783. PubMed Abstract  Publisher Full Text

Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human diseases.
Hum Hered 2003, 56:7382. PubMed Abstract  Publisher Full Text

Nagel RL: Epistasis and the genetics of human diseases.
C R Biol 2005, 328(7):606615. PubMed Abstract  Publisher Full Text

Lin M, Wu RL: Detecting sequencesequence interactions for complex diseases.
Current Genomics 2006, 7:5972. Publisher Full Text

Zhang J, Liang F, Dassen WR, Veldman BA, Doevendans PA, DeGunst M: Search for haplotype interactions that influence susceptibility to type 1 diabetes through use of unphased genotype data.
Am J Hum Genet 2003, 73(6):1385401. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene × gene interactions in genomewide association studies of human population data.
Hum Hered 2007, 63(2):6784. PubMed Abstract  Publisher Full Text

Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor Dimensionality Reduction Reveals HighOrder Interactions among Estrogen Metabolism Genes in Sporadic Breast Cancer.
American Journal of Human Genetics 2001, 69:138147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zou H: The adaptive Lasso and its oracle properties.
Journal of the American Statistical Association 2006, 101:14181429. Publisher Full Text

Cockerham CC: An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistatis is present.
Genetics 1954, 39:859882. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tibshirani R: Regression shrinkage and selection via the lasso.

Fu W: Penalized regressions: the Bridge versus the Lasso.
J Computational and Graphical Statistics 1998, 7(3):397416. Publisher Full Text

Efron B, Hastie T, Johnstone I, Tibshirani R: Least Angle Regression.
Annals of Statistics 2004, 32(2):407499. Publisher Full Text

Bertsekas DT, Tsitsiklis JN: Parallel and Distributed Computation: Numerical Methods.

Shevade SK, Keerthi SS: A simple and efficient algorithm for gene selection using sparse logistic regression.
Bioinformatics 2003, 19(17):224653. PubMed Abstract  Publisher Full Text

Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps.
Bioinformatics 2005, 21(2):2635. PubMed Abstract  Publisher Full Text

Lou XY, Chen GB, Yan L, Ma J, Zhu J, Elston R, Li MD: A generalized combinatorial approach for detecting geneby gene and genebyenvironment interactions with application to Nicotine Dependence.
Am J Hum Genet 2007, 80:11251137. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lawlor DA, Gaunt TR, Hinks LJ, Davey SG, Timpson N, Day IN, Ebrahim S: The association of the PON1 Q192R polymorphism with complications and outcomes of pregnancy: findings from the British Women's Heart and Health cohort study.
Paediatr Perinat Epidemiol 2006, 20(3):24450. PubMed Abstract  Publisher Full Text

Kaipainen A, Korhonen J, Pajusola K, Aprelikova O, Persico MG, Terman BI, Alitalo K: The related FLT4, FLT1, and KDR receptor tyrosine kinases show distinct expression patterns in human fetal endothelial cells.
J Exp Med 1993, 178(6):207788. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boutsikou T, MalamitsiPuchner A, Economou E, Boutsikou M, Puchner KP, Hassiakos D: Soluble vascular endothelial growth factor receptor1 in intrauterine growth restricted fetuses and neonates.
Early Hum Dev 2006, 82(4):2359. PubMed Abstract  Publisher Full Text

Nevo O, Many A, Xu J, Kingdom J, Piccoli E, Zamudio S, Post M, Bocking A, Todros T, Caniggia I: Placental expression of soluble fmslike tyrosine kinase 1 is increased in singletons and twin pregnancies with intrauterine growth restriction.
J Clin Endocrinol Metab 2008, 93(1):28592. PubMed Abstract  Publisher Full Text

Kiess W, Chernausek SD, HokkenKoelega ACS, eds: Small for Gestational Age. Causes and Consequences.

Koo JW, Russo SJ, Ferguson D, Nestler EJ, Duman RS: Nuclear factorkappaB is a critical mediator of stressimpaired neurogenesis and depressive behavior.
PNAS 2010, 107(6):266974. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Limpert AS, Carter BD: Axonal neuregulin 1 type III activates NFkappaB in Schwann cells during myelin formation.
J Biol Chem 2010, 285(22):1661422. PubMed Abstract  Publisher Full Text

Tang T, Zhang J, Yin J, Staszkiewicz J, GawronskaKozak B, Jung DY, Ko HJ, Ong H, Kim JK, Mynatt R, Martin RJ, Keenan M, Gao Z, Ye J: Uncoupling of inflammation and insulin resistance by NFkappaB in transgenic mice through elevated energy expenditure.
J Biol Chem 2010, 285(7):463744. PubMed Abstract  Publisher Full Text

Bhoopathi P, Chetty C, Gujrati M, Dinh DH, Rao JS, Lakka SS: The role of MMP9 in the antiangiogenic effect of secreted protein acidic and rich in cysteine.
Br J Cancer 2010, 102(3):53040. PubMed Abstract  Publisher Full Text

Sussman AN, Sun T, Krofft RM, Durvasula RV: SPARC accelerates disease progression in experimental crescentic glomerulonephritis.
Am J Pathol 2009, 174(5):182736. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lluri G, Langlois GD, Soloway PD, Jaworski DM: Tissue inhibitor of metalloproteinase2 (TIMP2) regulates myogenesis and beta1 integrin expression in vitro.
Exp Cell Res 2008, 314(1):1124. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Aoki T, Kataoka H, Moriwaki T, Nozaki K, Hashimoto N: Role of TIMP1 and TIMP2 in the progression of cerebral aneurysms.
Stroke 2007, 38(8):233745. PubMed Abstract  Publisher Full Text

Jon Dattorro : Convex Optimization & Euclidean Distance Geometry. Meboo publish; 2005.