Oscient Pharmaceuticals, Inc. (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA

Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA

Genizon BioSciences Inc., Montreal, Quebec, Canada

Department of Psychiatry, Harvard Medical School, Boston, Massachusetts, USA

Abstract

Background

Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.

Results

Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.

Conclusions

In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.

Background

Genome-wide association studies for complex diseases such as asthma, schizophrenia, diabetes, and hypertension will soon produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). Due to the large number of SNPs tested and the potential for both genetic and environmental interactions, determining which SNPs modify the risk of disease is a methodological challenge. While the number of genotypes produced by candidate gene approaches will be somewhat less daunting, on the order of hundreds to thousands of SNPs, it will still be a considerable challenge to weed out the noise and identify the SNPs contributing to complex traits.

A logical first approach to dealing with massive numbers of SNPs is to first conduct univariate association tests on each individual SNP, in order to screen-out those with no evidence for disease association. The primary goal of such a procedure is not to prove that a particular variant or set of variants influences disease risk, but to prioritize SNPs for further study. Using a univariate test at this stage will result in low power for SNPs with very small marginal effects in the population, even if the SNPs have large interaction effects. Of course, in addition to taking all individual SNPs, all SNP pairs could also be tested for association. However, when dealing with multiple thousands of SNPs at the outset, such an approach is cumbersome, and raises the question of where to stop: why not all sets of three, four, or even five SNPs as well?

Many model-building methods exist for dealing with large numbers of predictors. For example, stochastic search variable selection (SSVS)

Multivariate adaptive regression splines (MARS) models have also been explored in the context of genetic linkage and association studies

Combinatorial partitioning and multifactor dimensionality reduction

An additional concern to be considered is genetic heterogeneity. We define genetic heterogeneity to mean that there are multiple possible ways to acquire a disease or trait that can involve different subsets of genes. Traditional regression models are limited in their ability to deal with underlying genetic heterogeneity (see,

Classification trees and random forests

Tree-based methods consist of non-parametric statistical approaches for conducting regression and classification analyses by recursive partitioning (see, e.g., Hastie et al.

The ease of interpretation of classification trees, along with their flexibility in accommodating large numbers of predictors and ability to handle heterogeneity, has resulted in increasing interest in their application to genetic association and linkage studies. Classification trees have been adapted for use with sibling pairs to subdivide pairs into more homogenous subgroups defined by non-genetic covariates

Classification trees are grown by recursively partitioning the observations into subgroups with a more homogeneous categorical response

Using ensembles of trees built in this manner increases the probability that some trees will capture interactions among variables with no strong main effect. Unlike variable selection methods, interactions among predictors do not need to be explicitly specified in order to be utilized by a forest of trees. Instead, any interactions between variables serve to increase the importance of the individual interacting variables, making them more likely to be given high importance relative to other variables. Thus, random forests appear to be particularly well-suited to address a primary problem posed by large scale association studies. In preliminary studies, we have shown the potential of random forests in the context of linkage analysis

To fully understand the basis of complex disease, it is important to identify the critical genetic factors involved, and to understand the complex relationships between genotypes, environment, and phenotypes. The few successes to date in identifying genes for complex disease suggest that despite carefully collected large samples, novel approaches are needed in the pursuit to dissect the multiple and varying factors that lead to complex human traits. Ultimately, the challenge in identifying polymorphisms that modulate the risk of complex disease is to find methods that can seamlessly handle large numbers of predictors, capitalize on and identify interactions, and tease apart the multiple heterogeneous etiologies. Here, we explore the use of the Random Forest methodology

Results

Genetic models

We simulated complex diseases with sibling recurrence risk ratio for the disease (_{s}) fixed at 2.0 and population disease prevalence _{p }equal to 0.10. These values are consistent with or lower than estimates from known complex genetic traits, such as Alzheimer disease, where estimates of cumulative prevalence in siblings of affected range from 30–40%, compared to a population prevalence of 10% at age 80 _{s }= 2 and _{p }= .10 for models H4M4 and H8M2, and 32 loci are responsible for models H8M4 and H16M2. Table

Genetic models used for simulating case-control data.

Risk SNPs

Case Genotype Correlation

Allele

Marginal GRR

Penetrance Factors

Within System

Between System

Model

Number

Frequency

Het

Hom

0

1

2

H2M2

4

0.207

2.85

4.71

2.4E-02

0.51

1

0.30

-0.32

H4M2

8

0.160

1.99

2.96

3.9E-04

0.50

1

0.35

-0.12

H8M2

16

0.104

1.66

2.18

8.0E-06

0.56

1

0.32

-0.05

H16M2

32

0.069

1.46

1.78

1.4E-05

0.59

1

0.28

-0.02

H4M4

16

0.282

1.63

1.79

1.2E-08

0.79

1

0.17

-0.06

H8M4

32

0.214

1.34

1.40

2.8E-03

0.86

1

0.14

-0.02

Simulation and analysis

All analyses were performed on 100 replicate data sets of 500 cases and 500 controls. In addition to the rSNPs contributing to the trait, we simulated noise SNPs ("nSNPs"), independent of disease status, with allele frequencies distributed equally across the range .01–.99. To simulate the results of an association study, in which we do not expect to be lucky enough to genotype all polymorphisms related to a trait, we included only a subset of the total number of rSNPs in each analysis. We denote the analysis design using the shorthand KkSsNn, where K = k is the total number of rSNPs genotyped in the analysis, S = s is the number of SNPs within each interaction system genotyped, and N = n is the total number of SNPs genotyped in the design. Thus, N-K is the total number of nSNPs in the analysis. For example, suppose the genetic model is H8M4, and the design is K4S2N100. Then out of the total of 8 × 4 = 32 rSNPs that contribute to the trait, four are genotyped: two interacting SNPs from within each of two heterogeneity systems. Six heterogeneity systems are not represented at all in the analysis. In addition to the four genotyped rSNPs, 100-4 = 96 total nSNPs are also genotyped in the design.

Comparison of raw and standardized importance scores

Random forests version 5 software _{T}) and standardized (_{T}) variable importance scores (see Methods section for definitions of the scores). Little is known about the properties of importance indices under different distributions of the predictor variables. We use simulation to investigate their properties in the context of discrete predictors such as genetic polymorphisms conferring susceptibility to a complex trait.

We first compared the raw and standardized scores, in order to determine whether one might outperform the other in screening. We considered a K4S2N100 analysis design for each genetic model described in Table _{T }and _{T }are highly correlated; the average correlation coefficient over 100 replicate data sets ranged from .93 (H8M4) to >.99 (H2M2 and H4M2) (Table _{T }and _{T }for the 100 SNPs over the 100 replicate data sets was 0.98 for each of the six models (Table _{T }than for _{T}. The opposite is true for H8M4 (see Table

Summary of the correlation between _{T }and _{T }("raw") and rank(_{T}) and rank(_{T}) ("rank") for four rSNPs and 96 nSNPs over 100 replicate data sets: K4S2N100 analysis design.

H2M2

H4M2

H8M2

H4M4

H16M2

H8M4

^{2}

raw

rank

raw

rank

raw

rank

raw

rank

raw

rank

raw

rank

Mean

0.996

0.983

0.990

0.982

0.975

0.982

0.957

0.982

0.964

0.982

0.933

0.982

SD

0.001

0.006

0.003

0.006

0.009

0.006

0.013

0.007

0.013

0.006

0.018

0.007

Min

0.990

0.963

0.980

0.957

0.941

0.960

0.921

0.955

0.926

0.962

0.891

0.953

Max

0.998

0.993

0.996

0.992

0.989

0.993

0.984

0.992

0.983

0.994

0.970

0.992

Comparison of ranks based on _{T }and _{T }for the four rSNPs over 100 replicate data sets: K4S2N100 analysis design.

H2M2

H4M2

H8M2

H4M4

H16M2

H8M4

Rank:

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

_{
T
}

Mean

2.5

2.5

2.5

2.5

2.51

2.52

2.64

2.61

5.16

5.94

9.35

8.69

SD

1.12

1.12

1.12

1.12

1.13

1.16

1.74

1.62

8.06

8.33

13.67

13.9

Max

4

4

4

4

6

8

23

21.5

77

62.5

83

88.5

p-value*

0.94

1.00

0.77

0.12

1.32E-20

2.91E-12

*p-value for the Wilcoxon signed-rank test comparing the rSNP ranks based on _{T }and _{T}

Ranking SNPs based on Z_{T }and Fisher p-value

We next compared the ranking of rSNPs by importance score (Z_{T}) to ranking by Fisher Exact test p-value using K4S2N100 and K4S2N1000 analysis designs, where two SNPs from each of the first two interaction systems are in the analysis. Figure _{T }criterion ranks the four rSNPs as the most significant SNPs more often than the univariate Fisher Exact test association p-value under all genetic models. The difference between the random forest and association p-value ranking is less extreme for N1000. For the H8M4 genetic model, the results do not suggest that one ranking system is better than the other overall. Figure _{T }contain all of the rSNPs. Thus, for a given probability of retaining all of the rSNPs, more SNPs can be eliminated using the Z_{T }criterion than the Fisher exact test p-value. For example, for model H16M2, only 15 SNPs must be retained to have 80% probability that the 4 rSNPs are in the retained set, while 44 SNPs must be retained if the p-value criterion is used. The difference is less dramatic for H8M4: 37 SNPs give 80% probability that the four genotyped rSNPs are in the retained set for the Z_{T }criterion, compared to 43 for the p-value criterion. For N1000, the advantage of the Z_{T }criterion is clear for the H8M2 and H16M2 models. For H4M4, the advantage of Z_{T }is minor, while for H8M4, ranking by Z_{T }appears to give poorer results than the p-value criterion for the higher cutoff values of N. A second interpretation of Figure _{T }criterion than for the univariate p-value criterion for all but the H8M4 model with 1000 total SNPs.

Proportion of replicates for which the most significant 1, 2, 3, and 4 SNPs are all rSNPs for K4S2N100 and K4S2N1000 analysis designs

Proportion of replicates for which the most significant 1, 2, 3, and 4 SNPs are all rSNPs for K4S2N100 and K4S2N1000 analysis designs. Genetic models are listed on the plots. "RF" and "Fisher" refer to the random forest importance index Z_{T }and the Fisher Exact test p-value. See text for notation description.

Proportion of replicates for which all rSNPs are among the top-ranking N SNPs for K4S2N100 and K4S2N1000 analysis designs

Proportion of replicates for which all rSNPs are among the top-ranking N SNPs for K4S2N100 and K4S2N1000 analysis designs. Other notation as in Figure 1.

Noticing that the analyses with all SNPs from an interacting system (e.g., the H16M2K4S2 simulations) had a more substantial improvement in ranking using Z_{T }over p-value than the analyses with subsets of SNPs from an interacting system, we hypothesized that the interactions among the pairs of analyzed rSNPs influenced the improved ranking performance of the random forests over the univariate tests. To confirm this, we used the H8M4 genetic model and analyzed the data in the following manner. For a constant number of analyzed rSNPs included in the model (K = 4, 8, or 16) and a constant 96 nSNPs, we looked at the effect of increasing S, the number of rSNPs from each interaction system that were genotyped. Thus, for K8S1, along with 96 nSNPs, one SNP from each of the first 8 systems was included in the analysis, while for K8S4, all four SNPs in the first two systems were included in the analysis. For K8S3, three SNPs from the first two systems, and one from the third were included. Assuming that the random forest analysis was taking advantage of the interactions among the rSNPs, and that this was responsible for the improved performance of the random forests over the univariate tests, we expected the Fisher p-values and random forest importance Z_{T }to perform similarly when only a single rSNP was genotyped from each system, and the random forests to perform increasingly better than the univariate Fisher tests as S increased from 1 to 4. Figures _{T}, the S = 1 analyses for each K were similar to the Fisher results, while for each increase in S, the proportion of replicates for which the N most significant SNPs were rSNPs increases (Figure

Proportion of replicates for which the most significant N SNPs are all rSNPs

Proportion of replicates for which the most significant N SNPs are all rSNPs. H8M4 genetic model. Analysis designs include 96 noise SNPs; K and S are listed on the plots. Other notation as in Figure 1.

Proportion of replicates for which all rSNPs are among the top-ranking N SNPs for H8M4 genetic model

Proportion of replicates for which all rSNPs are among the top-ranking N SNPs for H8M4 genetic model. Analysis designs include 96 noise SNPs; K and S are listed on the plots.

Magnitude of difference

Beyond simply ranking SNPs, we may wish to use the magnitude of the difference in importance or p-value to determine which subset of top-ranked SNPs should be prioritized for further study. Thus, particularly for the cases where the rSNPs are among the top-ranked SNPs, we want to determine not just that Z_{T }ranks interacting rSNPs higher than the univariate test, but also that the differences in rank correspond to differences in magnitude of Z_{T }that are meaningful. In other words, we want to know how much "better" in terms of Z_{T }(or p-value) the rSNPs are than the nSNPs. Toward this goal, we computed the difference between the importance Z_{T }of the top ranked rSNP and the top ranked nSNP:

D_{max}(Z_{T}) = max_{rSNP}(_{T}) - max_{nSNP}(_{T}),

as well as the lowest ranked rSNP and the top ranked nSNP:

D_{min}(Z_{T}) = min_{rSNP}(_{T}) - max_{nSNP}(_{T}).

Thus, D_{min}(Z_{T}) is positive when the lowest ranked rSNP is larger than the highest ranked nSNP, and negative when the lowest ranked rSNP is smaller than the highest ranked nSNP.

We computed the analogous quantities, D_{max}(-log p) and D_{min}(-log p), for the -log_{10 }transformed Fisher Exact test p-values. In Figure _{T }that is significantly higher than the highest ranking nSNP for both N100 and N1000, while the difference in -log10(p) is not significantly different from 0 for N100, and is significantly less than 0 for N1000. These plots illustrate that the positive differences are typically more extreme for Z_{T }than for -log p, and that the negative differences are less extreme.

Distribution of difference in importance ZT between the top ranked rSNP and the top ranked nSNP (Dmax(ZT), and lowest ranked rSNP (Dmin(ZT)) and top ranked nSNP

Distribution of difference in importance ZT between the top ranked rSNP and the top ranked nSNP (Dmax(ZT), and lowest ranked rSNP (Dmin(ZT)) and top ranked nSNP. Dmax(-log p) and Dmin(-log p): differences using -log10 p-value from the Fisher exact test. Beside each boxplot is the p-value for the test of whether the mean difference over the 100 replicates is significantly different from 0. Genetic models listed in plot. Analysis design: K4S2, with N100 and N1000 shown on plot.

Discussion

A key advantage of the random forest approach is that the investigator does not have to propose a model of any kind. This is important in an initial genome-wide or candidate region association study, where little is known about the genetic architecture of the trait. If interactions among SNPs exist, they will be exploited within the trees, and the variable importance scores will reflect the interactions. Therefore, we expect that when unknown interactions between true risk SNPs exist, the random forest approach to screening large numbers of SNPs will outperform a univariate ranking method in finding the risk SNPs among a large number of irrelevant SNPs. Our genetic models for simulation feature both multiplicative interaction and genetic heterogeneity. The multiplicative interaction results in a marginal effect in the population, the size of which is dependent on many factors, including the amount of heterogeneity. Thus, we have highlighted models in which univariate tests still have power, and shown that the random forest analysis can outperform these tests for selecting subsets of SNPs for further study. For models with genetic heterogeneity and interactions resulting in no main effect, similar to the models described by Ritchie et al.

Our results from analyses with four risk SNPs among 1000 SNPs suggest that even when a high proportion of the analyzed SNPs are unassociated, a random forest can rank interacting SNPs considerably higher than a univariate test, and that the proportional difference in importance between the risk SNPs and the best of the noise SNPs can be larger on average for a random forest. In our scale-up from 100 to 1000 total SNPs, we kept the number of risk SNPs constant. In practice, as we increase the number of SNPs genotyped, we expect that we will also increase the number of risk SNPs (or SNPs in linkage disequilibrium with risk SNPs) that are captured in an analysis. Thus, as a larger and large proportion of the genome or candidate region is captured by a scan, the more likely we will be to have all or most of sets of SNPs that interact, and thus the more likely we are to be in situations where random forest screening will outperform univariate screening of SNP data.

It is important to consider the tuning parameters for such analyses. Consistent with the recommendations made by Breiman and Cutler

We have examined the use of random forests in the context of association studies for complex disease with uncorrelated SNP predictors. Random forests can also be used when predictors are correlated, as is the case with multiple SNPs in linkage disequilibrium within a small genetic region. For any analysis procedure, the more highly correlated variables are, the more they can serve as surrogates for each other, weakening the evidence for association for any one of the correlated variables to the outcome. In a random forest analysis, limited simulations suggest that correlated variables lead to diminished variable importance for each correlated risk SNP (data not shown). One way to limit the problems presented by SNPs in linkage disequilibrium is to use haplotypes instead of SNPs as predictor variables in a random forest. Future challenges include quantifying more completely the effect of linkage disequilibrium among SNPs submitted to a random forest analysis, and developing random forests in the context of haplotypes.

Conclusions

With the increasing size of association studies, two-stage analyses, in which in the first stage a subset of the loci are retained for further analyses, are becoming more common. The most frequently voiced concern for these analyses is that variables that interact to increase disease risk but have minimal main effects in the population will be missed. Random forest analyses address this concern by presenting a summary importance of each SNP that takes into account its interactions with other SNPs. Current implementations of random forests can accommodate up to one thousand of SNPs in one analysis with the computation of importance. Further, there is no reason to restrict the input variables to SNPs. Potential environmental covariates can also be easily accommodated, allowing for SNPs with no strong main effect, but environmental interactions, to be distinguished from unassociated SNPs. We have shown that when unknown interactions among SNPs exist in a data set consisting of hundreds to thousands of SNPs, random forest analysis can be significantly more efficient than standard univariate screening methods in ranking the true disease-associated SNPs highly. After identifying the top-ranked SNPs and other variables, and weeding out those unlikely to be associated with the phenotype, more thorough statistical analyses, including model building procedures, can be performed.

Methods

Variable importance

Rather than selecting variables for modeling, a random forest uses all available covariates to predict response. Here, we use measures of variable importance to determine which covariates (SNPs, in our case) or sets of covariates are important in the prediction. Breiman **X**_{i }represent the vector of predictor variable values, y_{i }its true class, V_{j}(**X**_{i}) the vote of tree j and t_{ij }an indicator taking value 1 when individual i is out-of-bag for tree j and 0 otherwise. Let **X**^{(A,j) }= (**X**_{1}^{(A,j)},..., **X**_{N}^{(A,j)}) represent the vector of predictor variables with the value of variable **A **randomly permuted among the out-of-bag observations for tree j, and **X**^{(A) }the collection of **X**^{(A,j) }for all trees where N is the total number of individuals in the sample. Letting 1(C) denote the indicator function taking value 1 when the condition C is true and 0 otherwise, the importance index averages over the trees of the forest, and is defined as:

where _{j }represents the number of out-of-bag individuals for tree j and T is the total number of trees.

The importance index can be standardized by dividing by a standard error derived from the between-tree variance of the raw index _{T},

The variance _{T}, rather than the variance of _{T }due to the sampling of the individuals from a population: the magnitude of _{T }increases with the number of trees in the forest, and the number of trees is limited only by computing time. Thus, this standardized index cannot be treated as a Z-score in the traditional sense.

Simulation models and methods

For simplicity, assume each locus has the same effect, and let (q_{0}, q_{1}, q_{2}) represent the penetrance factors for 0, 1, or 2 risk alleles for an individual locus in a given model. Let G = {g_{11}, g_{12}, . ., g_{HM}} be the multilocus genotype for an individual, where g_{hm}(=0, 1, 2 risk alleles) indicates the individual's genotype at locus m (=1, . ., M) of heterogeneity system h (=1, . ., H). Then the penetrance for genotype G is defined as:

For example, for model H2M2, an individual with genotype G = {0101} would have penetrance _{G }= 1 - (1 - _{0}_{1})^{2 }= 2_{0}_{1 }- (_{0}_{1})^{2}.

The penetrance factors (q_{0}, q_{1}, q_{2}) and risk allele frequencies, as well as other features of our genetic models, are listed in Table _{s }and _{p}, there is a unique allele frequency when we make the assumption that each SNP subunit has equal effect (the given penetrance factor vector) in the population. We chose penetrance factors such that the risk alleles at each locus for the H2M2, H4M2, H8M2, and H16M2 models are approximately additive in effect on the penetrance factor scale. For H4M4 and H8M4, we chose penetrance factor vectors such that the risk alleles show a moderate degree of dominance.

The marginal genotype relative risks (GRRs) listed in Table

Analysis

All analyses were performed on 100 replicate data sets of 500 cases and 500 controls. We treated the SNPs as ordinal predictors. Random forests have one primary tuning parameter: "mtry" the number of randomly picked covariates to choose among for each split. The Random Forest v5 manual

Ranking of rSNPs

The random forest analysis produces the raw and standardized importance indices (_{T}, _{T},), which can be used to rank order the importance of SNPs much as p-values from association tests are used. Using either method, the ranking of the SNPs in an analysis can be used to prioritize which sets of SNPs (or genome regions) should be followed up with further genotyping and/or additional analyses. We use the convention that a rank of 1 is the highest ranking SNP.

We compare the ranking of the raw and standardized importance measures, and further compare these with the rank based on p-values from a test of association, the Fisher Exact test, to determine whether the random forest can better discriminate susceptibility SNPs from SNPs unrelated to disease status when there is interaction and heterogeneity among and between SNPs.

Authors' contributions

KLL conceived of and led the design of the study, coordinated all phases of simulation and analysis, and drafted the manuscript. LBH and JS automated the simulation and random forest analysis procedures, and participated in the design and analysis of study results. PVE added critical insight to the development of the genetic models and participated in the interpretation of study results. All authors provided comments on a draft manuscript and read and approved the final manuscript.

Acknowledgments

We thank Kathleen Falls for her help with earlier versions of this work.