Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University

Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520, USA

Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China

Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China

Abstract

Background

The detection of epistasis among genetic markers is of great interest in genome-wide association studies (GWAS). In recent years, much research has been devoted to find disease-associated epistasis in GWAS. However, due to the high computational cost involved, most methods focus on specific epistasis models, making the potential loss of power when the underlying epistasis models are not examined in these analyses.

Results

In this work, we propose a computational efficient approach based on complete enumeration of two-locus epistasis models. This approach uses a two-stage (screening and testing) search strategy and guarantees the enumeration of all epistasis patterns. The implementation is done on graphic processing units (GPU), which can finish the analysis on a GWAS data (with around 5,000 subjects and around 350,000 markers) within two hours. Source code is available at

Conclusions

This work demonstrates that the complete compositional epistasis detection is computationally feasible in GWAS.

Background

The concept of epistasis was first introduced in 1909 by Bateson and Mendel

Researchers generally distinguish three types of epistasis: functional epistasis, statistical epistasis, and compositional epistasis

Estimating statistical epistasis between two loci requires the estimation of their additive main effects, which involves iterations (see details in the Methods Section). Because hundreds of billions of SNP pairs need to be measured for epistasis in a standard GWAS, any extra time spent on analyzing each pair will significantly increase the computational cost. To tackle this computational problem, many earlier methods

It has been argued that compositional epistasis is closer to the biological understanding of gene-gene interactions than statistical epistasis

In this article, we propose a fast approach to enable exhaustive search of compositional epistasis in GWAS. The proposed approach uses a two-stage (screening and testing) search strategy. In the screening stage, only a limited number of epistatic patterns are evaluated for each pair of SNPs and those passing a specified threshold are selected. All non-significant pairs are filtered out and those pairs, which are significant in the test of compositional epistasis, will be kept in the remaining set. In the testing stage, we evaluate all epistatic patterns for each remaining pair. The implementation is done on graphic processing units (GPU), where the analysis of one GWAS data set (with around 5,000 subjects and around 350,000 markers) can be finished within a few hours.

Methods

SNPs are mostly bi-allelic genetic markers. In general, we use capital letters (e.g., A, B, ⋯) to denote the major alleles and lowercase letters (e.g., a, b, ⋯) to denote the minor alleles. For each SNP, there are three genotypes: the homozygous reference genotype (AA), the heterozygous genotype (Aa), and the homozygous variant genotype (aa). The popular way of coding the genotype is to use {1, 2, 3} to represent {

Epistasis tests

The statistical epistasis and the compositional epistasis are two major types of epistasis that have been considered in the literature. The statistical epistasis is defined as the statistical deviation from the additive effects of two loci on the phenotype _{
p
} and _{
q
}, there are three steps in such a procedure:

• Fit the logistic regression model for only individual effect terms and obtain the MLE

• Fit the logistic regression model for both individual effect terms and interaction terms and obtain the MLE

• Conduct the ^{2} test on

We call this test as interaction test. However, estimating the MLE

• Remove all significant SNPs based on the single-locus test with a given threshold.

• For every pair (_{
p
},_{
q
}) in the remaining SNPs,

– Compute the log-likelihoods _{
∅
} of the null logistic regression model, defined as

– Compute the log-likelihoods _{
F
} of the full logistic regression model in Eq.(2).

– Conduct ^{2} tests on 2·(_{
F
}−_{
∅
}) with 8 degrees of freedom.

We call the test with 8 degrees of freedom as full association test. In the full association test, a threshold is required to filter out the significant SNPs. Otherwise, it will produce many false epistasis involving one marginally significant SNP with an irrelevant one.

The full association test is totally different from the interaction test. It measures the sum of individual effects and interaction effects and thus its degrees of freedom is 8 while the interaction test only only measures the interaction effect with 4 degrees of freedom. Both tests have their pros and cons. In the full association test, it is very difficult to decide the threshold to filter out the significant SNPs. For a stringent threshold, many SNPs below the threshold may produce strong associations in the full model with a little interaction effect. For a loose threshold, some SNPs involved in true epistasis may be filtered out. In the interaction test, those epistasis involving SNPs having medium individual effects and meanwhile having medium interaction effect will be ignored. Most importantly, they all suffer from the issue where the underlying degree of freedom is lower than the one assumed in their statistical tests, which is caused by the low MAF. The relatively robust solution to tackle this issue is to use the test of compositional epistasis.

The definition of two-locus compositional epistasis

A two-locus compositional epistasis can be defined by a 3-by-3 penetrance table (see Table _{
i
j
} in this table is the probability of developing a disease with the corresponding joint genotype at the two SNPs. One common approach of defining disease models is to restrict the value of _{
i
j
} to two levels, e.g., 0 or 1, which corresponds low risk or high risk. With this restriction, the total number of possible epistasis patterns is 2^{9}=512. Each model can be associated with a unique label which is defined as the decimal number of (_{11}
_{12}
_{13}
_{21}
_{22}
_{23}
_{31}
_{32}
_{33})_{2}. For example, Table _{2}=27. Because of the symmetry in the model definition, the number of non-redundant epistasis models is less than 512. In

**
S
**

**
S
**

**
S
**

The element **
p
**

_{1}=1

_{11}

_{12}

_{13}

_{1}=2

_{21}

_{22}

_{23}

_{1}=3

_{31}

_{32}

_{33}

**
S
**

**
S
**

**
S
**

Its unique label is 27=(000011011)_{2}.

_{1}=1

0

0

0

_{1}=2

0

1

1

_{1}=3

0

1

1

The trivial _{
i
j
} in Table

The test of two-locus compositional epistasis

To identify the compositional epistasis for _{
i
} and _{
j
}, a contingency table of these two SNPs and the class label _{
i
j
k
} denotes the observed count in the cell (

**
Y
**

**
S
**

**
S
**

**
S
**

**
Y
**

**
S
**

**
S
**

**
S
**

_{
i
}= 1

_{110}

_{120}

_{130}

_{
i
}= 1

_{111}

_{121}

_{131}

_{
i
}= 2

_{210}

_{220}

_{230}

_{
i
}= 2

_{211}

_{221}

_{231}

_{
i
}= 3

_{310}

_{320}

_{330}

_{
i
}= 3

_{311}

_{321}

_{331}

Next, for a particular compositional epistasis model defined by a penetrance table, Table _{110}+_{120}+_{130}+_{210}+_{310}, _{220}+_{230}+_{320}+_{330}, _{111}+_{121}+_{131}+_{211}+_{311}, and _{221}+_{231}+_{321}+_{331}. The risk table allows us to compare the proportion of samples in cases and controls with the assumption that the given epistasis model is true. If the proportions of samples in different rows vary significantly between columns, we draw a conclusion that the risk factors (genotypes) and the disease traits (class labels) are not independent for the given epistasis model. The significance of the difference between the two proportions can be assessed with Pearson’s chi-squared test. The test statistic is defined in Eq.(4) with the degree of freedom

**Low risk**

**High risk**

Control (

Case (

For _{
i
} and _{
j
} and each of 51 possible compositional epistasis models, the chi-square test statistic is calculated using Eq.(4). Those models with test statistics passing a given significance threshold will be considered as the possible interaction patterns of _{
i
} and _{
j
}.

Compositional epistasis detection in GWAS

In a typical GWAS, there are hundreds of billions of pairs of SNPs to be tested. It is computationally expensive to evaluate every possible compositional epistasis for all pairs of SNPs. However, it is widely believed that among the very large number of SNP pairs, only a small portion may be relevant with the disease trait. Therefore, it is a huge waste to test all SNP pairs to find significant compositional epistasis. If we can quickly compute the best fit of compositional epistasis model given the observed data for a SNP pair, we can first remove those pairs unlikely to be significant and then focus on evaluating all possible compositional epistasis model for the remaining SNP pairs. By doing so, the entire process will be substantially sped up. The approach in selecting the best splits for classification trees with categorical variables provides a solution to identify the compositional epistasis model best fitting the observe data.

In classification trees, leaves represent class labels, internal nodes represent features and branches represent conjunctions of features that induce class labels. In this work, class labels are phenotypes and features are genotypes. To construct a binary classification tree, a typical method iteratively searches all features for the best split. If the feature is categorical with ^{
M−1}. However, for a two-class classification problem,

**Theorem 1. **

Theorem 1 only holds for the two-class problem. Some extensions to the multi-class problem have been proposed on the basis of Theorem 1 but they are only locally optimal.

In the test of compositional epistasis, we can re-arrange Table

_{1}

_{2}

_{3}

_{4}

_{5}

_{6}

_{7}

_{8}

_{9}

_{1}

_{2}

_{3}

_{4}

_{5}

_{6}

_{7}

_{8}

_{9}

In this table, _{1}≤⋯≤_{
i
}≤⋯≤_{9}.

_{1}

_{2}

_{3}

_{4}

_{5}

_{6}

_{7}

_{8}

_{9}

Based on Theorem 1, we propose a two-stage (screening and testing) search method to find compositional epistasis in GWAS data.

• In the screening stage, the method evaluates all SNP pairs by checking 8 splits to find an upper bound and remove pairs with the upper bound less than ^{−6}, which is a relatively liberal significance level for a genome-wide study.

• In the testing stage, the method checks each selected pair using all non-redundant compositional epistasis models. The

GPU implementation

To accelerate the analysis process in GWAS, the proposed method is implemented using the parallel computation of graphical processing units (GPUs) (

Results

The compositional epistasis and statistical epistasis are two most commonly considered epistasis. In general, there are two types of statistical epistasis, named ‘Interaction’ and ‘Full Association’. In this section, we will evaluate these three types of epistasis using both simulated data and real data. To compare the statistical power among them, we have another issue of multiple test correction to consider. For each pair of SNPs, both the interaction test and the full association test compute one statistic and conduct the hypothesis test with the corresponding degrees of freedom. In the test of compositional epistasis, each SNP pair is associated with multiple epistatic patterns and thus with multiple statistics. In our comparison experiments, we choose the maximum one. Since we need to check 8 patterns to get the maximum statistic (see Theorem 1), we need to multiply the computed

Simulation 1: epistasis with main effects

Data generation

In this experiment, we select four epistasis models whose odds tables are given in Table ^{2}. The disease prevalence ^{2} are computed as

model 1

BB

Bb

bb

model 2

BB

Bb

bb

AA

AA

Aa

^{2}

Aa

aa

^{2}

^{4}

aa

The parameters ^{2} (Eq.(6)).

model 3

BB

Bb

bb

model 4

BB

Bb

bb

AA

AA

Aa

Aa

aa

aa

where _{
i
}) denote the probability of an individual being affected given its genotype combination _{
i
} (i.e., the penetrance of _{
i
}). Let _{
i
}. The odds of a disease for genotype _{
i
} is defined as

Then the penetrance _{
i
}) of the genotype _{
i
} can be calculated using

In our simulation, the prevalence ^{2} are controlled by the parameters ^{2}, and then numerically solve the parameters (^{2}=0.03 in model 1, we have

Variance composition in the different epistasis models.

**Variance composition in the different epistasis models.** The total variance of disease traits is decomposed into two parts: the variance explained by marginal effects and the variance explained by interactions.

Performance comparison

The performance comparison of three tests is provided in Figure _{
i
j
} in Table

The performance comparison of three epistasis tests.

**The performance comparison of three epistasis tests.** The significance thresholds are selected as 0.1, 0.2 and 0.3 after the Bonferroni correction.

Simulation 2: epistasis without main effects

This type of epistasis demonstrates weak main effects, but strong interaction effect. Finding such type of epistasis is a challenging task. It is the advantage of the interaction test to detect such type of epistasis. We use the commonly used data sets from the Dartmouth Medical School in this experiment. The web-site,

The power comparison between the compositional epistasis (CE) and the interaction (IA) in models without main effects.

**The power comparison between the compositional epistasis (CE) and the interaction (IA) in models without main effects.**

Simulation 3: type-1 error rate

To show the type I errors of our method, we conduct the following null simulation. We generate 100 null data sets. Each data set contains 2,000 SNPs and 2,000 samples. All SNPs are generated independently with MAFs uniformly distributed in [0.05,0.5]. The result is shown in Figure

The type-I error rates in null simulation.

**The type-I error rates in null simulation.**

Experiments on seven data sets from WTCCC

The Wellcome Trust Case Control Consortium (WTCCC) is a collaboration of many British research groups. In the first phase, the WTCCC has examined the genetic signals (500K SNPs) of seven common human diseases: bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D) (14,000 cases in total and 3,000 shared controls). Before we analyze these data sets, we first apply a similar quality control procedure as suggested in (WTCCC, 2007) to pre-process the data. Next we filter out those SNPs with significant individual effects. The threshold is chosen as ^{−7}, which is equivalent with _{
c
}=0.10 after the Bonferroni correction. The number of remaining SNPs is roughly 350,000 for each disease. The results from the three epistasis tests are reported in Table

**BD**

**CAD**

**CD**

**HT**

**RA**

**T1D**

**T2D**

Compositional Epistasis

0

0

17

0

47

234

3

Interaction

0

0

1

0

0

317

0

Full Association

0

0

0

0

10

346

0

T1D

For T1D, all identified SNP pairs by three epistasis tests are located in the major histocompatibility complex (MHC) regions. The MHC region in chromosome 6 has long been comprehensively studied for many decades because its high diversity and significance in infection, inflammation, autoimmunity, and transplant medicine

The distributions of SNP pairs among three epistasis tests in T1D.

**The distributions of SNP pairs among three epistasis tests in T1D.**

Compositional epistasis patterns in T1D and RA.

**Compositional epistasis patterns in T1D and RA.**

RA

For RA, the test of compositional epistasis reports 47 pairs, which includes the 10 pairs reported by the test of full association. The test of interaction does not report any significant pairs. A careful inspection of these pairs reveals that the epistatic effect of these pairs consists of partial individual effects and partial interaction effects. Among 47 reported pairs, 43 pairs involve SNP rs2107191 and the paired SNPs are all located in a very gene-rich region (the genome location is from 29,778,109 to 30,363,351). There are about 31 pairs involving SNP rs2107191 displaying a recessive-interference pattern (M2)

Discussions

In this work, we have focused on the genome-wide case-control studies; i.e., the disease phenotype can be represented as a binary variable. In its current testing, the compositional epistasis can not be easily extended to consider continuous phenotypes. Moreover, the current work only detect two-way compositional epistasis. However, we note that there is no widely accepted definition of high-order compositional epistasis. These issues are worth pursuing in the future.

Conclusions

Studying the epistasis between two loci is a natural step following traditional and well-established single locus analysis. In this paper, we have proposed a computationally efficient and statistically sound method to test compositional epistasis in GWAS data. The method is applicable to case-control studies and consists of a two-step (screening and testing) process. In the screening stage, only a limited number of epistatic patterns are evaluated for each pair of SNPs and those passing a specified threshold are selected to be more thoroughly studied in the testing stage, where all epistatic patterns for each selected pair are evaluated. The method is implemented using the parallel computational capability of commercially available GPUs to greatly reduce the computation time involved. We have successfully applied our method to analyze seven data sets from the WTCCC. Our experimental results demonstrate that the complete compositional epistasis detection is computationally feasible in GWAS.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XW and CY designed the models and simulation studies. QY and WY initialized the study and proposed the modeling framework. YH and WY directed the evaluation of methodologies. All authors contributed to the writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was partially supported with grants RPC10EG04 from the Hong Kong University of Science and Technology.