Abstract
It is believed that almost all common diseases are the consequence of complex interactions between genetic markers and environmental factors. However, few such interactions have been documented to date. Conventional statistical methods for detecting gene and environmental interactions are often based on the linear regression model, which assumes a linear interaction effect. In this study, we propose a nonparametric partitionbased approach that is able to capture complex interaction patterns. We apply this method to the real data set of hypertension provided by Genetic Analysis Workshop 18. Compared with the linear regression model, the proposed approach is able to identify many additional variants with significant geneenvironmental interaction effects. We further investigate one singlenucleotide polymorphism identified by our method and show that its geneenvironmental interaction effect is, indeed, nonlinear. To adjust for the family dependence of phenotypes, we apply different permutation strategies and investigate their effects on the outcomes.
Background
Genomewide association studies (GWAS) have successfully discovered many common variants associated with complex diseases, but the singlenucleotide polymorphisms (SNPs) identified so far account for a small proportion of the total heritability in quantitative traits [1]. Increasing evidence shows that geneenvironment (G×E) interactions are widely involved in the etiology of complex diseases, including diabetes, cancer, and psychiatric disorders [2,3]. The investigation of G×E interactions will not only facilitate the identification of novel genes whose marginal effects are undetectable, but also provide insights into disease etiology and hence greatly benefit drug development and personalized therapy.
The commonly applied methods to detect G×E interactions are based on linear or logistic regression models [4]. In particular, for quantitative outcomes, a linear model is considered in the form of
where G is the genotype of a SNP, E is the environmental factor,
Methods
Data set
The GAW18 data set consists of GWAS data and whole genome sequence data with longitudinal phenotypes for hypertension and related traits from Type 2 Diabetes Genetic Exploration by Nextgeneration sequencing in Ethnic Samples (T2DGENES) Project 2. There are 939 individuals in total, and we include in our analysis only the 849 individuals with both phenotype data and imputed sequence information. Each individual has measurements for up to 4 time points. At each visit, systolic blood pressure (SBP) and diastolic blood pressure (DBP) were measured; covariates including age, use of antihypertensive medication, and current tobacco smoking status were also recorded. Gender and pedigree are known for each subject. Genotypes of oddnumbered chromosomes are provided. In our study, we focused on chromosome 3 as suggested by the workshop organizer for the sake of comparison. Although we had access to the answers for the simulated data set, we used only the real data set in our analysis.
A general frameworka partitionbased association measure
Suppose there are n independent subjects that can be separated by a partition
where
G×E association measure I
Consider a marker G and an environmental factor E. Suppose G has 3 phenotypes, AA, Aa, and aa (A refers to the major allele and a the minor allele), coded as 0, 1, and 2. Suppose E is divided into 3 categories: 0, 1, and 2. Hereby G and E together create 9 partitions for all subjects (Table 1). From the general framework in the last section, an association measure that evaluates the total effect of G and E on the phenotype is:
Table 1. Partitions created by genotypic and environmental factors
where all the terms are similarly defined as before and y denotes the phenotype. The marginal effects of G and E can be obtained in a similar fashion:
The test statistic that measures the G×E interaction effect is defined as the difference between the total effect and the maximum of the two marginal effects:
The significance of I_{G×E }is evaluated by the method of permutation.
Permutation strategies
We consider 3 permutation strategies in our analysis: global permutation, local permutation,
and residual permutation. Let y_{ij }denote the phenotype of the j^{th }individual in the i^{th }pedigree. Global permutation is to permute phenotypes over all individuals. For local
permutation, the phenotypes are permuted within each pedigree. In residual permutation,
we first compute the residuals for each individual
Results
Partitions created by environmental factors
The real data set from GAW18 contains the records of 4 environmental factors: age, gender, smoking status, and antihypertensive medication usage (medicine). Because gender is a binary variable, it partitions all individuals into 2 groups. Although this data set provides longitudinal measurements of age, smoking, and medicine, the records have many missing values (only 187 subjects have complete measurements for all 4 visits). Therefore, for each individual, we summarized these covariates by either the averaged value (for age) or the sum (for smoking and medicine) across different time points from available records and used these summarized quantities in our analysis. Similarly, averaged SBP and averaged DBP were considered as outcomes. Here we created 3 partitions by each of age, smoking, and medicine (Table 2).
Table 2. Partitions based on the summarized quantities of age, smoking status, or medicine
SNPs with significant G×E interaction effects
In the GWAS data set provided by GAW18, there are 62,915 SNPs on chromosome 3. For
each SNP, we evaluated its interaction effect with each of the 4 environmental factors
on both SBP and DBP using the linear regression model (LRM) and the proposed partitionbased
score I (PBI). p Values of LRM were derived from the asymptotic distribution of the regression coefficient
Figure 1. G×E interaction effect of SNP rs17206492 and medicine. The marginal effect of the genotype (left), the medication effect when genotype = 1 (middle), and the medication effect when genotype is 0 (right).
Table 3. Number of significant SNPs with p value less than 7.9*10^{−}^{7 }*
Effect of different permutation strategies
There are 20 pedigrees in the GAW18 data set. Both the analysis of variance (ANOVA)
test and the nonparametric KruskalWallis test indicate that the mean DBP values of
different pedigrees are different, whereas the mean SBP values are the same (Table
4). When evaluating the p values of PBI, we performed 3 types of permutation: global (GP), local (LP), and residual (RP)
permutations. Both LP and RP adjust for familial relatedness between individuals.
For SBP, except for the environmental factor age, the results from 3 permutation methods coincide substantially (see Table 3 and Figure 2), which is consistent with the conclusion from ANOVA and KruskalWallis test. In
contrast, for DBP, the results of GP are quite different from the results of LP or
RP, especially when assessing the interaction effect with medicine (see Table 3 and Figure 2). In this situation, the results from LP or RP are more reliable because they take
into account the family dependence of the phenotype. In addition, LP tends to select
more markers than RP; this may be because the data violate the assumption that
Discussion
In this paper, we have proposed a partitionbased approach PBI to detect G×E interactions, which is nonparametric and modelfree. The test statistic is derived from a partitionbased measure I, and the interaction information score I_{G×E }is defined as the difference between the total score I_{T }and the maximum of the marginal scores. Intuitively, if the genetic and the environmental factors have strong interaction effect, I_{T }will be far greater than both marginal scores; hence I_{G×E }will be positive and large. If not, I_{T }will be no greater than at least 1 of the marginal scores. Therefore, I_{G×E }evaluates the amount of influence of the G×E interactions on the phenotype.
When applied to the real data set about hypertension provided by GAW18, PBI identified many more markers than the traditional linear regression method. Because our approach is modelfree, it is able to capture complicated interaction patterns that are difficult to detect in linear model. The significance of I_{G×E }is evaluated by permutation. LP and RP adjust effectively for the family dependence of the phenotype. Despite the fact that the proposed procedure selects more SNPs than linear regression, there is very little experimental evidence of G×E interactions for hypertension in the current literature to verify our findings. Therefore, biological studies will be required to investigate our results. Modifications of PBI have successfully identified genegene interactions and constructed genetic networks for breast cancer [6] and rheumatoid arthritis [7]. Moreover, PBI can be extended to evaluate the interaction effects between rare variants and environmental factors. Because of the low frequencies of rare variants (<1%), we can apply a genebased approach by collapsing rare variants in a gene [811] and creating partitions based on the collapsed information.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SHL and RF designed the study. RF, CHH and SHL performed the study. RF, CHH, IH, HW, TZ and SHL contributed to analysis of the data. RF and SHL drafted the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This research is supported by National Institutes of Health Grant R01 GM070789, GM0707890551 and by Hong Kong Research Grant Council (642207 and 601312).
The GAW18 whole genome sequence data were provided by the T2DGENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
References

Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al.: Finding the missing heritability of complex diseases.
Nature 2009, 461:747753. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Andreasen CH, Mogensen MS, BorchJohnsen K, Sandbæk A, Lauritzen T, Sorensen TI, Hansen L, Almind K, Jorgensen T, Pedersen O, et al.: Nonreplication of genomewide based associations between common variants in INSIG2 and PFKP and obesity in studies of 18,014 Danes.
PLoS One 2008, 3:e2872. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hamza TH, Chen H, HillBurns EM, Rhodes SL, Montimurro J, Kay DM, Tenesa A, Kusel VI, Sheehan P, Eaaswarkhanth M, et al.: Genomewide geneenvironment study identifies glutamate receptor gene GRIN2A as a Parkinson's disease modifier gene via interaction with coffee.
PLoS Genetics 2011, 7:e1002237. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kraft P, Yen YC, Stram DO, Morrison J, Gauderman WJ: Exploiting geneenvironment interaction to detect genetic associations.
Hum Hered 2007, 63:111119. PubMed Abstract  Publisher Full Text

Chernoff H, Lo SH, Zheng T: Discovering influential variables: a method of partitions.

Lo SH, Zheng T: A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data.
Proc Natl Acad Sci U S A 2004, 101:1038610391. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huang CH, Cong L, Xie J, Qiao B, Lo SH, Zheng T: Rheumatoid arthritisassociated genegene interaction network for rheumatoid arthritis candidate genes.
BMC Proc 2009, 3(suppl 7):S75. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Dering C, Pugh E, Ziegler A: Statistical analysis of rare sequence variants an overview of collapsing methods.
Genet Epidemiol 2011, 35:S12S17. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fan R, Huang CH, Lo SH, Zheng T, IonitaLaza I: Identifying rare disease variants in the Genetic Analysis Workshop 17 simulated data: a comparison of several statistical approaches.

Chen G, Wei P, DeStefano AL: Incorporating biological information into association studies of sequencing data.
Genet Epidemiol 2011, 35:S29S34. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X: Rarevariant association testing for sequencing data with the sequence kernel association test.
Am J Hum Genet 2011, 89:8293. PubMed Abstract  Publisher Full Text  PubMed Central Full Text