Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA

Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA

Department of Electrical Engineering, The Pennsylvania State University, University Park, PA, USA

Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA

Abstract

Background

Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted.

Results

We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding

Conclusion

This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at:

Background

Genome-wide association studies (GWAS) have been widely applied recently to identify SNPs associated with common human diseases

Novel Methods for Detecting Interacting SNPs

A variety of SNP interaction detection methods have been recently proposed. In particular, multifactor dimensionality reduction (MDR) ^{d}

Evaluation of Methods to Detect Interacting SNPs

Despite strong current interest in this area and a number of recent review articles

Finally, we note that there are very few true (strict)

The aforementioned limitations of previous studies are not surprising because of the following challenges associated with comparison studies: (1) it is impractical to evaluate methods on all of the (numerous possible) interaction models; (2) multiple aforementioned factors (MAF, penetrance, LD) jointly decide interaction effects, which thus entails extensive study design, experimentation, and computational efforts; (3) many replicated data sets are required to accurately estimate power and family-wise type I error rate, further increasing computational burden; (4) computational costs of some methods are inherently high; thus a thorough evaluation of these methods is a difficult hurdle; and (5) fair evaluation criteria are not easily designed because distinct methods have different inductive biases and produce different forms of output (e.g., some give P-value assessments while others only provide SNP rankings); (6) there is no consensus definition of power when seeking to identify

Addressing the above challenges, a ground-truth based comparative study is reported in this paper. The goals are three-fold: (1) to describe and make publicly available simulation tools for evaluating performance of any technique designed to detect interactions among genetic variants in case-control studies; (2) to use these tools to compare performance of eight popular SNP detection methods; (3) to develop analytical relationships between power to detect interacting SNPs and the factors on which it depends (penetrance, MAF, main effects, LD), which support and help explain the experimental results.

Our simulation tools allow users to vary the parameters that impact performance, including interaction pattern, MAF, penetrance (which together determine the strength of the association) and the sporadic disease rate, while maintaining the normally occurring linkage disequilibrium structure. Also, the simulation tools allow users to embed multiple interaction models within each data set. These tools can be used to produce any number of test sets composed of user specified numbers of subjects and SNPs.

Our comparison study, based on these simulation tools, involves thousands of data sets and consists of three steps, as graphically illustrated in Figure

A flowchart for the performance evaluation of interaction detection methods

**A flowchart for the performance evaluation of interaction detection methods**.

In particular, foreshadowing our Step 1 results, we will find that

As aforementioned, Step 2 (with a variety of ground-truth interaction models present) investigates power. We formulate a more challenging, yet more realistic situation than most previous studies by including

The main contributions and novelty of our comparison study are: (1) comprehensive comparison of state-of-the-art techniques on realistic simulated data sets, each of which includes multiple interaction models; (2) new proposed power criteria, well-matched to distinct GWAS applications (

Results

Experimental Design and Protocol

We selected eight representative methods for evaluation, based on their reported effectiveness and computational efficiency. Seven of them (MDR, FIM, IG, BEAM, SH, MECPM and LRIT) are designed to detect interacting loci, with the remaining one based on the widely-used logistic regression model (LR). LR, using only main effect terms, serves as a baseline method to compare against all the interaction-detection methods, i.e., to see whether they give any advantage over pure "main effect" methods when the goal is simply to detect the subset of SNPs that either individually, or via interactions, are predictive of the phenotype. The description of the eight methods is given in the "Methods" part.

Simulation Data Sets

Each data set contains individuals simulated from the control subjects genotyped by the 317K-SNP Illumina HumanHap300 BeadChip as part of the New York City Cancer Control Project (NYCCCP). To facilitate this investigation

A visual illustration of SNP "blocking" and random sampling, used for generating simulated individuals

**A visual illustration of SNP "blocking" and random sampling, used for generating simulated individuals**. "Ind

The eight methods were applied to sets of 1000~10,000 SNPs selected at random from the autosomal loci. This number of SNPs is consistent with a GWAS study following an initial SNP screening stage and also with pathway-based association studies. When selecting SNPs, we first removed those with genotypes that significantly deviate from Hardy-Weinberg equilibrium, and then selected the desired number of ground-truth and "null" SNPs. For each replication data set, ground-truth SNPs were randomly selected, according to the requirements of MAF (within a narrow window of tolerance), and "null" SNPs were chosen completely at random. The simulations reported assume that the disease risk is explained by several ground-truth interaction models and the sporadic disease rate _{r}
_{i}

The simulation data sets have different ground truth interaction models _{r}
_{i}

In Figure

A flowchart detailing all of the steps used in producing the simulated GWAS data sets

**A flowchart detailing all of the steps used in producing the simulated GWAS data sets**.

The simulation approach used in this comparison study is the same as that used in

As mentioned previously, our simulation study consists of three main experimental steps, which we next more fully describe.

Step 1: assess family-wise type I error rate

An accurate family-wise type I error rate is crucial for methods that select candidate SNPs based on their P-values and for reliably comparing methods. If the family-wise type I error rate is either conservative or liberal, the P-value loses its intended meaning and does not reflect the actual false positive rate. That is, we will not be able to control how many false positives are detected by setting a (

BEAM, SH and FIM detect significant SNPs based on P-values calculated from asymptotic distributions and heuristic searches. Thus, based on the preceding discussion, evaluating the accuracy of their P-value assessments is not only of theoretical importance (how well their proposed asymptotic distributions approximate the real distribution), but also of great practical necessity in applying these methods.

To evaluate the accuracy of P-value assessment, we replicated 1,000 data sets by repeatedly randomly selecting 1,000 null SNPs from the SNP pool, i.e. to easily assess family-wise type I error rate, no ground-truth SNPs were embedded in these data sets.

Step 2: assess power

In step 2, each data set has

The 15 ground-truth SNPs each participate in one of 5 ground-truth SNP interactions, which contribute independently to the disease, as described by equation (1). There are three standard factors that determine interactions: penetrance, MAF and LD

**The 5 basic models **vary in interaction order, genetic models (dominant, recessive, or additive), incomplete/complete penetrance, MAF, and marginal effects. To indicate the strength of interaction effects and main effects for each basic model, we calculated the odds ratio by dichotomizing the genotypes of each interaction into a group with the lowest penetrance value (usually with "0" penetrance) and another group with higher penetrance values (the specific calculation can be found in section S4 of the Additional file

**Supplementary information: comparative analysis of methods for detecting interactive SNPs**. This supplementary information consists of 6 sections: S1. Section S1 presents our theoretical analysis of the relationship between association strength, joint effect, main effect, penetrance function, and MAF. This section also provides some theoretical explanations about our experimental results. S2. Section S2 presents comprehensive power evaluation results of the methods for different interaction models and parameter settings, related to power ^{2 }statistics applied by SH and FIM. This analysis partly explains why SH and FIM are conservative. S6. Section S6 gives the empirical relationship between power and the false positive SNP count under a given significance threshold.

Click here for file

The 5 basic models are defined by the penetrance tables and MAFs below. The penetrance function is the probability of disease given the individual's genotype. Thus, the penetrance tables show the probability of developing disease given the genotypes

**Basic model 1**-.two-locus interaction under a dominant model for the major allele. The model is for two very common but low penetrant alleles. The MAFs at these two loci are both 0.25. This model is expected to generate 62 cases per 1000 subjects. The odds ratio is 1.16 for the joint interaction effect between

**Basic model 2**- two-locus interaction for common alleles under a dominant genetic model at each locus. The minor allele frequencies are 0.20 for locus

**Basic model 3**- three-locus interaction, common alleles, incomplete penetrance. The MAFs at the three loci are 0.40 for

**Basic model 4**- three-locus interaction among common alleles. The minor allele frequencies are 0.25 for

**Basic model 5**- five-locus interaction among common alleles. It assumes a MAF of 0.30 at each locus and has a penetrance value of 0.63 if the minor allele is present at each locus; and 0 otherwise. In equation form, the penetrance function is:

where

**Three parameters **are used to assess the robustness of the various methods to variations in penetrance, MAF, and LD, because i) as aforementioned, penetrance, MAF, and LD jointly define the disease model, and thus decide the disease status; ii) it is of interests, in the field of SNP interaction detection, to explore how detection power varies with these parameters

The theoretical, analytical relationship among penetrance, MAF, and statistical significance of an interaction model is investigated in the Additional file

Step 3: assess the power to detect SNPs with only main effects

Most of the interaction-detection methods are designed to find either interactions or main effects (

In Step 3, we simulated 100 replication data sets, following a similar approach

as in Step 2. Each data set includes five main-effect ground truth SNPs and 995 null SNPs. The penetrances and MAFs for the five ground truth SNPs are:

**SNP 1. **Dominant model for the major allele, low penetrance, MAF = 0.25.

**SNP 2. **Additive model for the minor allele, MAF = 0.3.

**SNP 3. **Additive model for the minor allele, MAF = 0.4.

**SNP 4. **Recessive model for the minor allele, high penetrance, MAF = 0.25.

**SNP 5. **Dominant model for the minor allele, low penetrance, MAF = 0.3.

Although SNP 1 and SNP 5 have relatively weaker effects, we still included them because (1) they also affect many subjects' disease status, since a large proportion of subjects carry the disease genotype of SNP1 and SNP 5 (which simulates common-disease markers); (2) our experimental results will show that these weak-effect SNPs differentiate the performance of the methods.

Note that we configured the methods to detect both main effects

Design of Performance Measures

The performance of the methods is evaluated by the accuracy of P-value assessment, various definitions of power, reproducibility, and computational complexity.

A. Family-wise type I error rate (the accuracy of P-value assessment)

There are 1,000 SNPs in each data set. Thus there are multiple comparison effects, and the P-values obtained by the methods are accordingly adjusted by Bonferroni correction. In this way, the accuracy of P-value assessment is represented by the family-wise type I error rate: an error event occurs on a data set with no ground truth SNPs if there are

B. Various power definitions and the ROC curve

Power can be defined in several ways, depending on what we desire to measure. We next give several power definitions experimentally evaluated in the sequel.

Power to progressively detect interactions (Power definition 1)

_{K}

We can also define power over the

Power to precisely detect interactions (power definition 2: exact interaction power)

_{1},..., _{
M
}}, in the _{1},..., _{
M
}} is detected within the top _{2, i
}(_{2, i
}(

Power to detect at least 1 SNP in the ground-truth interaction (power definition 3: partial interaction power)

As revealed by the definitions of the interaction models, a _{1},...,_{M}
_{1},...,_{M}
_{3,i
}(_{3,i
}(

Power to detect individual SNPs (power definition 4: single SNP power)

The power definitions above ignore differences between SNPs within the same interaction, _{j}, j _{j }
_{i}
_{i}
_{j }

ROC curve

We also evaluate the methods via the ROC curve, which shows how many ground-truth SNPs are detected for a given false positive SNP count.

C. Reproducibility

The estimated power, even if high, could deviate significantly across different data set replications, due to the inherent randomness in our simulation approach. Thus, we also want to see how reproducible the detection power is over the data set replications. To evaluate this, we measure the standard deviation of the estimated power across the replicated data sets.

D. Computational complexity

Computational complexity was measured by the execution time and memory occupancy of the methods for the same platform.

Experimental Results

In Step 1, we evaluated the three methods with asymptotic statistics (FIM, BEAM and SH). In Step 2, we evaluated all eight methods (as described in the "Method" section) on the 1000-SNP data sets, and six methods (FIM, IG, BEAM, MECPM, SH and LR) on the 10,000-SNP data sets - we do not evaluate MDR for the 10,000-SNP data sets because the high memory occupancy of the MDR software prevents this evaluation. We also evaluated six methods (MDR, FIM, BEAM, MECPM, SH and LR) in Step 3 - we do not evaluate IG and LRIT, because, by design, they only output multi-locus interaction candidates, and thus are inappropriate to be assessed in Step 3's main effect evaluation. Specifically, IG and LRIT will necessarily have 0 true positives, no matter how well they detect interactions involving the main-effect-only SNPs, since in Step 3 only "singlet" main effects are considered to be true positives. MDR, BEAM, SH and MECPM were all implemented using the authors' freely available software. LR, LRIT, FIM and IG were implemented using C++, with the software freely available. The eight methods were tested on the same platform: OS: Windows, CPU: 3G, RAM: 2G. The parameters used by the respective methods follow their default settings wherever possible. We only modified one parameter when testing MDR: we used its heuristic search (1 hour execution time limit) instead of exhaustive search when testing MDR on the 1000-SNP data sets in step 2, because exhaustive search of MDR required huge memory and quite impractically high computational cost - when implementing MDR with exhaustive search, our machine crashed from running out memory; moreover, the ^{6 }seconds (roughly 15 years) on our platform. Here we compare the eight peer methods along several performance fronts. The results are then further evaluated and summarized in the "Discussion" section.

Accuracy of P-value assessment in step 1

Based on the definition in the subsection "Design of Performance Measures", we tested the accuracy of P-value assessment for BEAM, SH, and FIM on the 1,000 data sets in step 1. Regarding the other methods, IG and MECPM do not give significance assessments, while the significance assessment of MDR is (necessarily) accurate since it uses random permutation testing (However, it should also be noted that MDR only evaluates the significance of the

The average family-wise type I error rates (step 1) for BEAM, SH and FIM under the significance threshold of 0.1 (after Bonferroni correction). More results can be found in the Additional file

**family-wise type I error rate**

**BEAM**

**SH**

**FIM**

1st order

0.094

0.084

0.097

2nd order

0

0.026

0.032

3rd order

0

0.002

0.006

Power (definition 1) and ROC curve in step 2

We measured power (

Power evaluation (

**Power evaluation ( definition 1) of the eight methods on 100 replication data sets with parameter setting: θ = 1.4, β = 1, l = null**. (a) evaluates the power on the whole ground-truth SNP set, and (b) (c) (d) (e) (f) evaluate the power individually on the 5 interaction models. Blue curve - SH, magenta curve - FIM, green curve - MDR, black curve - IG, cyan curve - MECPM, grey curve - LRIT, yellow curve - LR.

Power evaluation (

**Power evaluation ( definition 1) of six methods on 10 replication data sets with parameter setting: θ = 1.4, β = 1, l = null**. (a) evaluates the power on the whole ground-truth SNP set, and (b) (c) (d) (e) (f) evaluate the power individually on the 5 interaction models. In (c), all the methods have overlapped power curve at the upmost part of the figure. Magenta curve - FIM, black curve - IG, red curve - BEAM, blue curve - SH, cyan curve - MECPM, grey curve - LRIT, yellow curve - LR.

For the 1000-SNP case (Figure

For the 10,000-SNP case (Figure

Impact of penetrance, MAF, and LD on power (definition 1)

Figure

The impact of penetrance value (

**The impact of penetrance value ( θ), MAF (β), and LD factor (l) on power for the whole ground-truth SNP set**. Blue curve - SH, magenta curve - FIM, green curve - MDR, black curve - IG, cyan curve - MECPM, yellow curve LR..

Reproducibility of power (definition 1)

We measured the reproducibility by the standard deviation of power across the 100 replication data sets. These results are given in the Additional file

Power (definition 1) to detect interacting SNPs for a fixed significance threshold

Although the statistical significance level is unreliable for measuring performance of the methods (as illustrated in Table

Power to detect entire interactions (definition 2)

Based on power

Power evaluation (

**Power evaluation ( definition 2) of the methods on 100 replication data sets with parameter setting: θ = 1.4, β = 1, l = null**. In (a), FIM, IG, MDR and LRIT have power constantly equal to 0; in (b) FIM and IG and LRIT have power constantly equal to 1; in (d) SH, FIM and MDR have power constantly equal to 0. Blue curve - SH, magenta curve - FIM, green curve - MDR, black curve - IG, grey curve - LRIT, yellow curve - LR.

We can observe that all the methods have poor performance for models 1 and 4. For models 3 and 5, all the methods fare poorly except for MECPM. For model 2, IG, LRIT and FIM have very good performance (power = 1); MECPM also performs well (power = 0.96); while the other methods still perform poorly.

Power to detect at least 1 SNPin an interaction - partial interaction detection (definition 3)

Based on power

Power evaluation (

**Power evaluation ( definition 3) of the eight methods on 100 replication data sets with parameter setting: θ = 1.4, β = 1, l = null**. Blue curve - SH, magenta curve - FIM, green curve - MDR, black curve - IG, grey curve - LRIT, yellow curve - LR.

Power to detect individual SNP main effects (definition 4)

Based on Figure

The power to detect individual SNPs, for parameter

**The power to detect individual SNPs, for parameter θ = 1.4, β = 1, l = null**. Blue curve - SH, magenta curve - FIM, green curve - MDR, black curve - IG, cyan curve -MECPM, grey curve - LRIT, yellow curve - LR.

Also, we observe similar power for SNPs participating in interactions with symmetric penetrance tables and the same MAFs. For example, all the SNPs in model 1 and model 5 have similar power; likewise for SNPs

For SNPs participating in interactions with a symmetric penetrance table but different MAFs, an interesting (and perhaps unexpected) finding is that for model 2, the power to detect SNP

Performance for step 3, the main-effect-only case

We used power

Power evaluation of 6 methods (using power

**Power evaluation of 6 methods (using power definition 1) on main-effects-only data (step 3)**. Blue curve - SH, magenta curve - FIM, green curve - MDR, cyan curve - MECPM, yellow curve - LR.

We also evaluated whether the methods detect false positive interactions when there are

The average number of false positive interactions (step 3) for BEAM, SH and FIM under the significance threshold of 0

**number of false positives**

**BEAM**

**SH**

**FIM**

2nd order

0

0

2.21

3rd order

0

0

64.19

Computational complexity and memory occupancy

Computational complexity for the eight methods was evaluated for the same platform: OS: Windows, CPU: 3G, RAM: 2G. SH, IG, FIM, LR, LRIT, MECPM and BEAM do not require much memory, but the exhaustive search used by MDR requires an impractical amount of memory for a large number of SNPs. Thus, as noted earlier, we applied the heuristic search option in the MDR software, with a 1 hour time limit to avoid memory overflow. Figure

Execution time (sec) of 4 methods for: (a) number of SNPs = 1,000; (b) number of subjects = 2,000

**Execution time (sec) of 4 methods for: (a) number of SNPs = 1,000; (b) number of subjects = 2,000**. Due to limited space in (b), we list hereby the execution time of the methods on 2000-subject 10,000-SNP data: SH - 962 seconds, IG - 18291 seconds, BEAM - 36423 seconds, FIM - 91251 seconds.

Discussion

General Summary of the Study and Its Results

We report a comparison of eight representative methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIM), and logistic regression (LR). The first seven were specifically designed to detect interactions among SNPs, and the last is a popular main-effect testing method serving as a baseline for performance evaluation. The selected methods were compared on a large number of simulated data sets, each, consistent with complex disease models, embedded with

Based on the simulation data sets used in this study, which include multiple interaction models present in each data set in Step 2, most of the methods miss some interacting SNPs, leading to only moderate power at low false positive SNP counts (Figures

Compared to the promising powers achieved for the simulation studies reported in the methods' respective papers, the degraded performance seen in this comparative study for most methods is attributed to the more difficult yet likely more realistic simulation data that we used. The methods (excepting LR and MECPM) were previously reported as powerful on simulation data sets including only a single, strong ground-truth interaction, but our study included 5 interactions present in each data set to simulate multiple genetic causes for complex diseases. The disease risk is thus effectively divided among the 5 interaction models, giving each a weaker (less easily detected) effect.

Main effects play an important role in whether a ground-truth SNP is detected at low false positive SNP counts

Another notable finding is that the main effects of the interacting SNPs affect their likelihood of being detected at low false positive SNP counts by most methods. For interaction models with very weak marginal effects (models 1 and 5), all the methods have low power (see Figure

For the same interaction model, different levels of power are achieved by the eight methods

For each interaction model, the power varies across methods because of the quite different detection principles applied by the methods. For example, IG and LRIT, which are based on pairwise SNP statistics, can detect 2-way interaction effects well (see models 2 and 4, where model 4 can be considered as two overlapped 2-way interactions), but IG and LRIT gets poorer results for higher-order models. For the difficult 5-way interaction, only MECPM gave promising results.

Power on the whole ground-truth SNP set - MECPM performs the best, while MDR performs the worst

From Figure

Power may be degraded by an insufficiently sensitive ranking criterion, by the heuristic search strategy used, or by a suboptimal output design of a method. The high computational complexity of MDR necessitates using its heuristic search option to keep the running time/memory usage in a reasonable range. This heuristic search forces a significantly reduced search space, and hence the performance of MDR is expected to be degraded.

The ranking criterion of IG detects pure interaction effects (see equation (4) and the definition of mutual information). However, what really affects disease risk is a combination of both pure interaction effects and main effects. Additionally, IG is only explicitly designed to detect 2-way interactions, and thus may have difficulty detecting higher order ones.

Comparatively, MECPM, BEAM, FIM and SH have less critical limitations, with these mainly in the sensitivity of their ranking criteria and their use of heuristic search -- e.g., the difficulty for heuristic search to pick up interactions with weak marginal effects and high-order interactions due to the large search space (Consider a contingency table with 3^{5 }= 243 cells for a 5-way interaction.).

The performance of the methods is sensitive to changes in penetrance value, MAF, and LD

From Figure

Most methods can partially but not exactly detect the interactions

The results for power

The P-value assessments of BEAM, SH and FIM are variable across method and all are overly conservative

From the subsection "

For BEAM, SH, and FIM, the heuristic search strategies evaluate fewer SNP combination candidates than the number actually penalized in the Bonferroni correction. Moreover, SH and BEAM exclude SNPs with strong marginal effects from high-order interactions, which further decreases the number of searched SNP combinations. So the Bonferroni-corrected P-value is smaller than it should be. Also, some SNP combinations have dependencies with others, either because they share a common SNP subset and/or because SNPs in different subsets are in LD. Such dependencies make the Bonferroni correction inherently conservative.

Besides heuristic search and dependencies, the conservativeness also derives from the summary statistics themselves. The authors of BEAM evaluated the B statistic's conservativeness with exhaustive search. In the Additional file ^{2 }statistics applied by SH and FIM. We considered the case where there is neither multiple testing nor heuristic search. The ^{2 }statistics turn out to be conservative, becoming more so as the significance threshold is decreased (see Tables ^{2 }statistics in SH and FIM are calculated from the discrete-valued SNP data, the ^{2 }statistics are also discrete. At the tail part of the ^{2 }distribution, two consecutive discrete ^{2 }values may correspond to very different significance levels. For example, let the P-values of consecutive ^{2 }values be _{1}, _{2 }(_{1 }>> _{2}); when the significance threshold is _{0 }and _{1 }> _{0 }>_{2}, the type I error rate actually corresponds to _{2}, which is much less than _{0}, making the results quite conservative.

Limitations of the Current Study and Future Work

There are a number of possible extensions of this simulation study that we intend to consider in our future work. First, our current simulation software only handles categorical traits and categorical (ternary-valued, SNP) covariates. Environmental covariates and admixture-adjusting variables could be either quantitative or ordinal-valued. Likewise, traits (phenotype) could be quantitative or ordinal. There are natural ways of extending our current simulation approach to allow for these more general covariate and trait types, which we will consider in future work. Second, we have not investigated missing SNP-values and their effect on detection power. Third, while we have chosen five plausible penetrance function models, another possibility would be to use "data-driven" penetrance functions,

Conclusions

The methods explored in this study are useful tools in the exploration of potential interacting loci. Each of the methods studied here has its strengths and weaknesses. Our comparative examination of these methods suggests that continued research into methods that test for interacting loci is necessary to expand the tools available to researchers and to achieve improved power for detecting complex interactions, along with accurate assessment of statistical significance.

Methods

Methods Tested in the Comparison Study

The eight ^{2 }or

Properties of methods tested in this paper.

**Name**

**Detection Principle**

**Heuristic search**

**Asymptotic null distribution**

**Free-accessible software**

MDR

Prediction accuracy

Stochastic

No

FIM

Logistic regression

Deterministic

Yes

N/A

IG

Mutual Info.

N/A

No

N/A

BEAM

Bayesian model

Stochastic

Yes

MECPM

BIC

Deterministic

No

SH

χ^{2}or

Deterministic

Yes

LRIT

Logistic regression

N/A

Yes

N/A

LR

Logistic regression

N/A

Yes

N/A

For the methods without free-accessible software by the authors, we provide our self-written software, as well as C++ code, at

A brief summary of these eight methods follows.

(1) Multifactor dimensionality reduction (MDR)

For a set of SNPs, MDR labels a genotype as "high-risk" if the ratio between the number of cases and the number of controls exceeds some threshold (e.g., 1.0). A binary variable is thus formed, pooling high-risk genotypes into one group and low-risk ones into another. If the subject has a high-risk genotype it is predicted as a case; otherwise as a control. The prediction error of each model is estimated by 10-fold cross validation and serves as the measure of association between the set of SNPs and the disease.

(2) Full Interaction Model (FIM)

In FIM, 3^{d}
_{j}, j ^{d}
^{d }
_{j}
_{j}
**x**(

where ^{2 }distribution.

(3) Information Gain (IG)

Let

where the mutual information

(4) Bayesian Epistasis Association Mapping (BEAM)

Suppose _{d }
_{u }
_{1},...,_{L}
_{0}, _{1}, and _{2}, where _{0 }is the subset consisting of SNPs (SNP genotype vectors) with no association, _{1 }is the subset consisting of SNPs with only main effects, and _{2 }is the subset consisting of SNPs with interaction effects. Likewise, let the genotypes on controls be _{1},...,_{L}) **I **= [_{1},_{2},..., _{L}] be the membership of SNPs within each group, e.g. _{j }
_{j }
_{j }
**I **given

Based on equation (5), BEAM draws **I **using the Metropolis-Hastings algorithm. The output is the posterior probability of main-effect markers and interactions associated with the disease. A "

(5) SNP Harvester (SH)

This method aims to detect interactions with weak marginal effects. It includes the following steps:

5a. Remove SNPs with significant main effects;

5b. For a fixed _{1}, _{2},..., _{M}
**A**, to see whether a statistical score **A**) (e.g. χ^{2}statistic, **A**) if statistically significant. Then go back to the first step, with the optimal **A **removed as a candidate for the next run.

5c. Use L2-norm penalized logistic regression

Although SH

(6) Maximum entropy conditional probability modeling (MECPM)

MECPM builds the phenotype posterior under a maximum entropy principle, encodes constraints into the model that correspond 1-to-1 to interactions, flexibly allows dominant or recessive coding for each locus in a candidate interaction, searches interactions via a greedy interaction growing search strategy that evaluates candidates up to fifth order, and uses the Bayesian information criterion (BIC) as the model selection strategy.

(7) Logistic regression (LR)

LR is a generalized linear model used for binomial regression. Let

, where _{0 }and _{1 }are the regression coefficients, learned via maximum likelihood. By a likelihood ratio test, logistic regression evaluates statistical significance for each SNP.

(8) Logistic regression with interaction term (LRIT)

LRIT aims at detecting interaction effects based on the logistic regression model. Let _{m}
_{n}
_{m}
_{n}
_{m}
_{n}
_{m}
_{n}
_{m}
_{n}

, where _{0}, _{1}, _{2}, _{3 }are the regression coefficients, learned via maximum likelihood. By a likelihood ratio test, logistic regression evaluates the statistical significance for this pair of SNPs (the statistical significance reflects the joint effects of the two individual terms and the multiplicative term).

Authors' contributions

LC and YW designed the experiment protocols and evaluation measures, conducted the experiments, participated in implementation of the methods and design of simulation tools, and drafted the manuscript. GY implemented the conventional (also some advanced) interaction-detection methods, and participated in design of the experiments. CL designed the simulation tools. CL, RG and XY carried out the development of simulation software. DM helped to draft and extensively edited the manuscript. DM and JR implemented MECPM. YW and DH conceived of the study, participated in its design and coordination, and helped draft the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was supported in part by the National Institutes of Health (HL090567 to D.M.H. and GM085665 to Y.W.).