Department of Health Sciences Research, Mayo Clinic, 200 First Street Southwest, Rochester, MN, 55905, USA

Department of Statistics and Probability, Michigan State University, A413 Wells Hall, East Lansing, MI, 48824, USA

Department of Psychiatry and Psychology, Mayo Clinic, 200 First Street Southwest, Rochester, MN, 55905, USA

Abstract

Background

Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional.

Results

RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions.

Conclusions

While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.

Background

Genome-wide association studies (GWAS) have been successful in detecting single locus variants with relatively large effects in some common, complex diseases

As an alternative to traditional statistical methods that fail to appropriately account for these complex genetic architectures, data-mining approaches designed to discover patterns in large amounts of data are gaining popularity for genetic association studies. Many data-mining and machine learning approaches were developed with a main goal of prediction, such as exhaustive search strategies

The performance of RFs in the context of genetic data analysis has been investigated, and RF VIMs have been shown to out-perform Fisher’s exact test as a screening tool when interactions are present

In addition to performance as a filtering/screening tool, other properties of VIMs have also been previously investigated. For instance, the bias of these measures has been assessed under linkage disequilibrium

A frequently cited benefit of RFs in the analysis of genetic data is that they capture interactions between predictor variables, because the hierarchical decision tree structure can model non-linear associations

Previous studies of RF performance have primarily applied the approach in lower-dimensional settings, or settings involving interactions with strong marginal components. Although it has been shown that in relatively small datasets RFs can detect interacting risk factors better than univariate tests of association of individual predictors

In this study we explore the ability of RF VIMs to capture interaction effects, particularly as the data becomes increasingly high-dimensional. We hypothesize that when standard RF VIMs are used to identify the best predictors, the ability to detect interacting effects will decline rapidly as the total number of studied predictor variables is increased. Focusing on analysis of binary case/control data, where RF is used for classification, this paper presents a simulation study to investigate the relationship between the number of variables in a RF and the ability of the approach to detect both marginal and interaction effects. We compare the performance of various RF measures of variable importance to p-values from univariate logistic regression under different data-generating models for complex disease, both as the number of predictors becomes large and as the strength of marginal association diminishes.

Methods

Random forests

Random Forests is an ensemble or ‘forest’ of many classification and regression tree (CART) classifiers

In general, a Random Forest of decision tree classifiers is grown as follows:

1. Select a total of

2. For each bootstrap sample, grow an unpruned classification or regression tree (CART)

a. At each node in the tree, randomly select

b. Choose the best split at each node from among the

c. An estimate of prediction error (i.e. the probability of misclassification) is obtained for each tree using the OOB individuals.

3. For a given observation, the final prediction/classification is the majority vote (the predicted class in the majority of trees) over all trees in which that observation was ‘out-of-bag’. The OOB prediction error and prediction accuracy of the RF can be calculated by considering accuracy of the OOB prediction over all subjects.

For more detail on the method, see

Variable importance measures

In addition to providing prediction, the RF method can be used to calculate a VIM for each predictor. Ranking based on VIMs can then be used as a screening tool to prioritize variables for follow-up study. A number of importance measures have been proposed, including Gini importance and mean decrease in accuracy (MDA).

Let _{
t
}
_{
t
}
^{
*
} is calculated for the OOB individuals using the permuted data. The raw or unscaled MDA is

Variable importance can also be measured through a scaled version of MDA,

where ^{
2
} is the estimated variance of decrease in accuracy across trees _{
t
}
_{
t
}
^{
*
} is zero by definition. Another version of MDA variable importance attributed to Meng only considers trees in which

Interaction detection

Typically VIMs are used to rank variables, and variables with high ranks are considered as potentially associated with the phenotype. A true causative genetic factor may be considered to be identified or detected by the RF analysis if it ranks highly in terms of variable importance, above null factors that are not associated with the phenotype. In our study, we considered a SNP to be detected if it ranked within the top

In RF analysis, the importance of each variable takes into account, or is conditional on, the effects of other variables in the tree. However, RF importance measures do not specify whether an effect is marginal or due to interactions with other SNPs. Because one of the frequently cited advantages of RFs is the ability to model interactions, our primary goal was to investigate RF’s ability to detect both the effects of SNPs that act independently as well as those whose influence on the phenotype is dependent on genotypes at another locus (i.e. SNPs with interaction effects).

In order to investigate the performance of RF VIMs for these different types of effects, we need to quantify both the strength and type of a variable’s effect on the phenotype. Heritability in the broad sense, or the proportion of phenotypic variation that can be explained by genetic variation, is a common measure of the degree of genetic determination of a trait, or the total genetic effect size, and can also be used to estimate the effect of a particular disease locus ^{
2
} due to the two loci can be defined as:

where _{
ab
}
_{
ab
}

Similarly, we can define ^{
2
}
_{
M,B
} for SNP B. The heritability due to the interaction effect of SNP A and SNP B, the conditional dependence of SNPs A and B on the phenotype, can be defined as the portion of the total heritability not attributable to the marginal effects at either locus: ^{
2
}
_{
M,A
} > 0 and an ‘interaction effect’ if ^{
2
}
_{
I,AB
} > 0 for some SNP B. These ideas can easily be extended to models with more than two causative loci.

Simulation study design

In order to investigate the performance of RF VIMs in detecting interactions for a binary disease phenotype, we developed a sequence of three simulation studies. Data sets that included variables with main effects only (

**Supplementary information, preliminary simulations to select tuning parameters.** Descriptions and results of parameter sweeps to determine optimal values of the tuning parameters

Click here for file

The first two simulation studies were designed to assess the performance of RF VIMs for detecting main and interaction effects, and compare the performance of RF with p-value rankings from univariate logistic regression. In order to focus on evaluating the performance of the methods in relation to the strength and type (main vs. interaction) of effect, in these simulations all SNPs were assumed to be independent and had the same minor allele frequencies. The third simulation study investigated the impact of linkage disequilibrium (LD) on the detection of main and interaction effects using RFs. The designs of all three simulations are summarized in Table

**Simulation 1**

**Simulation 2**

**Simulation 3**

**Objective**

To compare RF VIMs for main and interaction effect detection.

To compare RF measures with p-values from logistic regression for main and interaction effect detection.

Examine RF performance in presence of realistic patterns of LD and MAF.

**Independent SNPs**

Yes

Yes

No (LD)

**# Total Loci (****
p
**

10, 100, 500, 1000

10, 100, 500, 1000

Fixed at 1000

**# Causal Loci (****
k
**

4

2

2

**MAF**

Fixed at 0.1, 0.2, 0.3, or 0.4

Fixed at 0.3

Varies (0.01–0.50)

**# Model Scenarios**

5

3

4

**Description**

Varying effect sizes, H_{X1X2}
^{2} vs. H_{X3X4}
^{2}

Two interacting SNPs with 0, 1, or 2 having main effects.

Causal SNPs chosen in blocks of strong vs. weak LD with non-causal SNPs.

**Phenotype Generation**

Phenotype is a dichotomized quantitative (normally distributed) trait.

Phenotype is based on direct penetrance functions.

Phenotypes are generated as in Simulation 1.

In all simulations, data were generated assuming that some SNPs contribute to the overall heritability only marginally (‘main’), some SNPs contribute both marginally and interactively (‘interacting’), and that some SNPs are not causally associated with the outcome and thus do not contribute to the total heritability (‘null’). The performance of RFs was evaluated for each of the three types of variables by estimating probability of detection, which is similar to the concept of power in a frequentist statistical framework. A SNP effect was considered to be detected if its rank based on the VIM was in the top

Simulation 1: comparing performance of VIMs for detecting main vs. interaction effects

The goal of the first simulation study was to compare the performance of RF VIMs for detecting marginal and interaction effects. The performance of RF was also compared to the most common GWAS analysis approach, univariate logistic regression. The comparisons were performed for independent SNP data with fixed MAF, but with varying degrees of effect size and patterns of interaction. Case/control datasets with

For each scenario, 100 replicate datasets were generated with 500 cases and 500 controls and MAF = 0.1, 0.2, 0.3, or 0.4 at all SNPs. Genotypes were generated assuming independence among SNPs and Hardy-Weinberg equilibrium. Quantitative phenotypes were generated conditional on genotypes under a linear model, to reflect an underlying quantitative trait, and affection status was assigned using a threshold. Phenotype data were thus generated under the following probit model:

where E ~ N(0,σ^{2}), β_{0} = 20, σ^{2} = 10, and the threshold _{
j
} = 0,1,2 reflecting the number of copies of the minor allele, assuming additive allelic effects (on Y). Note that SNPs 1 and 2 have marginal effects only, whereas SNPs 3 and 4 are interacting. We quantified the strength of the simulated marginal and interacting effects in terms of heritability due to a given genetic effect (Equations 1 and 2). Data were generated under five models, where the vector _{1}, _{2}, _{3}, _{4}, _{5}) was chosen to reflect different effect sizes and patterns in terms of total heritability due to the main effect SNPs 1 and 2 (

Model 1: Similar effects. The total effects of main and interacting SNPs are similar (

Model 2: Main effects greater. The total effects of SNPs 1 and 2 are greater than effects of SNPs 3 and 4 (

Model 3: Main effects only. The total heritability is due to SNPs 1 and 2, and SNPs 3 and 4 are not causative (

Model 4: Interaction effects greater. The total effects of SNPs 3 and 4 are greater than SNPs 1 and 2 (

Model 5: Interaction effects only. The total heritability is due to SNPs 3 and 4, and SNPs 1 and 2 are not causative (

In Models 1–5, the type of effect is a property of the heritability corresponding to the main effect SNPs 1 and 2, and the interacting SNPs 3 and 4. Models with low heritability (^{2} ≤ 7%) were chosen to reflect realistic effect sizes that could be expected in genetic studies and also to investigate the performance in situations with low power. The specific heritability components depend not only on the

**Supplementary information, simulations 1–3.** Additional descriptions and results for Simulations 1–3

Click here for file

In Simulation 1, we investigated performance of VIMs and compared variable importance rankings to p-value rankings from logistic regression. VIMs of interest in these simulations were raw MDA, scaled/Liaw MDA, standard deviation of MDA, and Gini importance. “Probability of detection” is reported for each VIM for ‘main’, ‘interacting’, and ‘null’ SNPs, where the probability of detection was estimated by the proportion of times across 100 replicates that each SNP was detected, averaged across all ‘main’, ‘interacting’, and ‘null’ SNPs, respectively.

Simulation 2: comparing main effect and interaction detection with RF VIMs vs. logistic regression

Generally, a threshold or probit model can be approximated by a logit model. This may give logistic regression an advantage in the above Simulation 1 design, as similar models are used for simulating and analyzing the data. In order to provide an additional comparison of RF with univariate logistic regression, data were also generated directly from penetrance functions, the conditional probability of disease given genotypes. For Simulation 2, genotype data were simulated as previously described, with MAF fixed at 0.3 because the effect of MAF had already been examined. Phenotypes were generated conditional on genotypes from a specified penetrance function (

Simulation 2 penetrance functions.

**Simulation 2 penetrance functions.** Penetrance functions for the two locus interactions in the three models used in Simulation 2, with corresponding total, marginal, and interaction heritabilities.

Simulation 3: investigating detection when LD is present

The goal of the third simulation was to examine the performance of RF VIMs in detecting main and interaction SNP effects under a scenario with realistic patterns of LD and MAF. We therefore generated genotypes based on real data, with various degrees of LD in different regions, and compared performance of interaction detection under some of the previously assumed data-generating models of phenotype conditional on genotype.

A real genome-wide SNP dataset was used as the basis for generating genotypes for ^{2} > 0.95 with at least three SNPs) or weak LD (R^{2} < 0.3 for all SNPs) with other SNPs; either both causal SNPs were chosen to be in strong LD, both in weak LD, or one in strong and the other in weak LD. A fourth scenario was also considered for comparison where all SNPs were generated independently (i.e. no LD) with MAFs identical to those seen in the real data.

Probability of detection was defined as before, and also as the proportion of times that any SNP in high LD (R^{2} > 0.85) with the causal SNP ranks in the top

Results and discussion

Simulation 1: comparing performance of VIMs for detecting main vs. interaction effects

As expected, in general, causal SNPs (both ‘main’ and ‘interacting’) have larger VIMs than null SNPs, particularly for small

For all types of SNPs (‘main’, ‘interacting’, and ‘null’), both the estimated variable importance and the probability of detection decline as the total number of predictors increases. However, as the total number of predictors increases, the probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs (Figure ^{2}), as expected. For example, for the non-interacting SNPs, detection probability increases as H_{X1X2}
^{2} increases. However, for the interacting SNPs, detection probability is largely dependent upon their marginal effect (marginal heritability, H_{M,i}
^{2}) rather than their total effect (H_{X3,X4}
^{2}), which includes their interaction effect. Thus, detection probability is strongest for the SNPs with the largest marginal heritability (H_{M,i}
^{2}), not necessarily the largest total heritability. For example, under Model 4 with MAF of 0.3, the detection probability is higher for the main effect SNPs than for the interacting SNPs, despite the fact that the total effect of the interacting SNPs (H_{X3,X4}
^{2}) is larger than the effect of the main effect SNPs (H^{2}
_{X1,X2}). This is because under this model the main effects SNPs have a larger marginal heritability (0.015) than the interacting SNPs (0.009). Figure

Simulation 1 results.

**Simulation 1 results.** Probability of detection for ‘main’, ‘interacting’, and ‘null’ SNPs plotted against the number of total SNPs for select RF VIMs and logistic regression (LR). Top row shows results for the “main effects greater” Model 2; bottom row shows results for “interaction effects greater” Model 4. Results are plotted separately across MAF. Average PE estimates range between 0.430 and 0.476 (

A pattern is observed across MAF, where ‘main’ SNPs are more readily detected than ‘interacting’ SNPs for higher frequencies, corresponding to scenarios where H_{M,X1}
^{2} and H^{2}
_{M,X2} are high. As MAF increases, the difference in detection between ‘main’ and ‘interacting’ SNPs also increases; for low MAFs, ‘interacting’ SNPs are more frequently detected, while for more common variants the ‘main effects’ SNPs are more frequently detected. This is because under our data generating model, as MAF increases the heritability due to the marginal effects of interacting SNPs 3 and 4 (H_{M,X3}
^{2} and H_{M,X4}
^{2}) decreases (

RF prediction errors, under the different simulation models, are shown in _{M,i}
^{2}), the SNP effects are not well detected, and consistent with this, the prediction errors are high (close to 50%).

Of importance is the fact that the probability of detection is not strongly affected by method of ranking (RF VIMs or logistic regression), and in particular the RF VIMs rarely outperform logistic regression for values of

Simulation 2: comparing main effect and interaction detection with RF VIMs vs. logistic regression

Results for Simulation 2 based on penetrance functions are portrayed in Figure _{M,i}
^{2}) is highest, which has a greater impact than method of ranking. No single RF method consistently outperforms the others; however, in general RF VIMs perform slightly better than logistic regression models.

Simulation 2 results.

**Simulation 2 results.** Probability of detection for SNP1 and SNP2 plotted against total number of SNPs by VIM for models with interactions and two main effects (Model 6 - left), one main effect (Model 7 - center), and no main effects (Model 8 - right). Average PE estimates range between 0.465 and 0.508 (

Because total ^{
2
} is low for the assumed models, we expect prediction error estimates to be high (

Simulation 3: investigating detection when LD is present

The results of the third set of simulations (Tables

**Level of LD**

**MAF**

**Detection Definition**

**Raw MDA**

**Liaw MDA**

**SD MDA**

**Gini**

**LR P-value**

Detection probability with and without LD for main effects only Model 3 for RF VIMs and logistic regression (LR). Total number of SNPs = 1,000, MAF ≈ 0.3. Average PE estimates range from 0.458 to 0.477 (

1

Strong

.294

Causal SNP

0.13

0.12

0.14

0.08

0.21

Causal Region

0.38

0.3

0.4

0.29

0.49

Strong

.309

Causal SNP

0.25

0.18

0.28

0.26

0.33

Causal Region

0.56

0.48

0.55

0.58

0.48

2

Weak

.294

Causal SNP

0.72

0.58

0.73

0.78

0.73

Weak

.281

Causal SNP

0.66

0.56

0.71

0.79

0.76

3

Strong

.294

Causal SNP

0.15

0.13

0.11

0.08

0.21

Causal Region

0.5

0.46

0.52

0.39

0.71

Weak

.294

Causal SNP

0.59

0.52

0.63

0.78

0.5

4

None

.294

Causal SNP

0.67

0.57

0.72

0.73

0.75

None

.294

Causal SNP

0.68

0.6

0.67

0.7

0.76

**Level of LD**

**MAF**

**Detection Definition**

**Raw MDA**

**Liaw MDA**

**SD MDA**

**Gini**

**LR P-value**

Detection probability with and without LD for interaction effects only Model 5 for RF VIMs and logistic regression (LR). Total number of SNPs = 1,000, MAF ≈ 0.3. Average PE estimates range from 0.479 to 0.496 (

1

Strong

.294

Causal SNP

0.05

0.09

0.05

0.02

0.09

Causal Region

0.2

0.19

0.21

0.1

0.3

Strong

.309

Causal SNP

0.2

0.17

0.15

0.16

0.14

Causal Region

0.47

0.42

0.38

0.41

0.33

2

Weak

.294

Causal SNP

0.43

0.34

0.49

0.52

0.4

Weak

.281

Causal SNP

0.35

0.25

0.35

0.42

0.29

3

Strong

.294

Causal SNP

0.06

0.09

0.04

0.04

0.12

Causal Region

0.32

0.27

0.27

0.14

0.41

Weak

.294

Causal SNP

0.51

0.45

0.4

0.62

0.3

4

None

.294

Causal SNP

0.28

0.21

0.3

0.31

0.33

None

.294

Causal SNP

0.29

0.27

0.3

0.31

0.29

In general, the RF VIMs show improved detection over logistic regression for the SNP in weak LD if the other causative SNP is in strong LD (Scenario 3), particularly when the causative genetic factors are interacting (Table

Discussion

In this study, we investigate the ability of Random Forests to detect both marginal and interacting effects in high-dimensional data, in order to validate the claim that RF methods are well suited to describe gene-gene interactions and to determine their usefulness as filter methods or screening tools that allow for interaction effects in large datasets, assuming sample sizes and genetic effect sizes likely to be encountered in real data analysis. While RFs are often cited as an approach suitable for detecting genetic effects in the presence of interactions, McKinney et al.

In Simulation 1, we observed an inverse relationship between MAF and interaction detection probability, which is a result of the dependency of effect size (i.e. heritability) on MAF. For example, under the data-generating threshold model with stronger interaction effects (Model 4), the marginal effect of the two interacting SNPs (H_{M,X3}
^{2} and H_{M,X4}
^{2}) decreases and the interaction effect of these two SNPs (H_{I,X3X4}
^{2}) increases, as the MAF increases ( _{M}
^{2}), regardless of the presence of an interaction effect (H_{I}
^{2}), and that VIMs are capturing marginal effects rather than interactions as originally claimed. This was also clearly demonstrated by Simulation 2, with models generated from penetrance functions with MAF fixed at 0.3. SNPs that had some level of marginal heritability had higher detection probability, whereas interacting SNPs with no marginal contribution to the total heritability were rarely detected, if at all.

In our simulations, we also observed a strong inverse relationship between estimated prediction error and detection probability. This relationship is expected, since if the RF model was not predictive of phenotype, then no predictive signal was detected for any variable, and hence the true causative factors (if they exist) were not identified. Therefore we do not advocate utilizing a ranked list to screen predictors if prediction error is high, because even if true causative factors exist, they will not be highly ranked. Nevertheless, the relationship between prediction error and detection probability based on VIMs portrays a consistent story: prediction error estimates are only lower than what is expected by chance if the true causal effects are detected. As dimension becomes large, detection probability diminishes and becomes highly dependent on the strength of the marginal effects, and the poor prediction errors are a reflection on the failure of RF to model interactions in these scenarios.

The models used for simulating the data had low ^{
2
} to reflect realistic effect sizes for a study of common SNP variants assessed with a genome-wide platform. It has been shown that SNPs identified thus far through GWAS explain only a small portion of the heritability and have poor predictive performance, which is consistent with the models chosen for our simulation study. Nevertheless, we also considered interaction models with stronger effect sizes and higher ^{
2
} (data not shown). In these high-heritability models, models with marginal effects resulted in greatly improved prediction error, whereas interaction models without strong marginal effects still showed little if any improvement, reflecting the same general trend described in this study.

The use of alternative definitions of SNP detection and detection probability could impact the findings of this study; however, we found that a previous definition of power utilized by Bureau et al.

Notably, our simuations revealed that the advantage of RF over univariate logistic regression is lost for larger values of

The results of the study indicate that as a tool for variable selection, both RF VIMs and univariate logistic regression can detect SNPs with marginal components, but neither may be adequate for interaction detection in high dimensions. In lower dimensions, RFs capture interactive effects and may therefore outperform univariate logistic regression. However, in lower dimensions higher order logistic regression models and pair-wise scans are possible, limiting the advantage of RF. In fact, some researchers feel that interactions modeled by RF but not confirmed with logistic regression are unlikely to be real. Nevertheless, the advantages of RF reside in the ability to incorporate the effects of multiple variables simultaneously and model conditional associations in both low and high dimensional data (even if interactions may not be specifically modeled), which cannot be captured with univariate procedures. Thus RF is recommended as a complimentary approach to other variable selection methods. Moreover, we note that machine learning methods such as RF were designed to improve prediction rather than variable selection; therefore if the research objective is to develop a predictive model, then RF may be more appropriate.

Bureau et al.

Increasing the sample size tends to increase the tree-depth and number of possible splits per tree, which increases the number of variables included per tree and the probability that the effects of a pair of interacting SNPs will be jointly modeled. To investigate the impact of sample size on power, we considered a difficult genetic model with low power (Model 8) and increased the sample size from

Poor SNP effect detection in high-dimensional data is exacerbated in the presence of strong LD, both for marginal and interaction effects. If the true causative SNPs are in regions of strong LD, the causative effects must compete with correlated predictors for positions in each tree, since non-causative variables may also be associated with the phenotype because of LD. The result is lower importance rankings and a reduction in the probability of SNP detection. Our results were similar to those observed in previous studies, which also found that the presence of SNPs that are highly correlated with risk SNPs reduces RF performance

The results of this study allow us to draw a number of conclusions about the performance of RF, but some inherent limitations remain. In the current study we only considered simple disease models with architectures involving marginal effects and two-locus interactions, and investigation of more complex architectures is warranted. Nevertheless, our results demonstrate difficulties with detecting even these simple lower-order interactions using RF. Furthermore, only a limited number of LD patterns were investigated. Although this was not a thorough examination of the consequences of LD on interaction detection, our results provide some insight into the impact of LD on the performance of RF for identifying interacting SNPs. While beyond the scope of the current study, these findings motivate further research into how RF should be applied in practice for different types of data.

These results call into question the applicability of RF as a variable selection and screening tool in a GWAS setting. In high-dimensional data, true causal SNPs without a strong marginal component are not highly ranked by the variable importance measures, indicating little potential improvement of RF as a filter approach over current univariate techniques. Therefore, extensions that improve the detection of interacting factors would be highly advantageous. As the RF methodology currently stands, the primary goal is not identification of interactions. Because the method incorporates conditional effects, allows for the analysis of high-dimensional data where the number of predictors far exceeds the sample size, and provides a ranking scheme to implement potential filtering, it seems that extension to better capture interaction effects seems promising. This work provides insight into why RF variable importance measures fail to capture interactions in a high-dimensional setting, which motivates further research to develop new variable importance measures to properly account for interacting variables or to modify the approach for accurate variable selection in the presence of interactions.

Conclusions

The ability of Random Forests variable importance measures to detect interaction effects has not been previously investigated in high-dimensional data. We found that as dimensionality increases, the probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs and Random Forests no longer outperforms univariate logistic regression. Random Forests efficiently model complex relationships including interactions in low dimensional data, but in high dimensional data they only effectively identify genetic effects with a marginal component. Therefore current variable importance measures may not be useful as filter techniques to capture nonlinear effects in genome-wide data and extensions are necessary to better characterize interactions.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SJW designed the study, performed the simulations and data analysis, and drafted the manuscript. CC executed the simulations and facilitated data analysis. RRF, XW, MDA and MH participated in the design of the study and interpretation/presentation of results. JMB conceived of the study, assisted in its design, and drafted the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This study was funded by NIDA (R21 DA019570, PI Biernacka).