Email updates

Keep up to date with the latest news and content from BMC Genetics and BioMed Central.

Open Access Highly Accessed Research article

An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Benjamin A Goldstein13*, Alan E Hubbard1, Adele Cutler2 and Lisa F Barcellos1*

Author Affiliations

1 Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA

2 Department of Mathematics & Statistics, Utah State University, Logan UT, USA

3 Genetic Epidemiology and Genomics Laboratory, Division of Epidemiology, School of Public Health, University of California, Berkeley, CA, USA

For all author emails, please log on.

BMC Genetics 2010, 11:49  doi:10.1186/1471-2156-11-49


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2156/11/49


Received:14 January 2010
Accepted:14 June 2010
Published:14 June 2010

© 2010 Goldstein et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.

Results

Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.

Conclusions

This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

Background

Genome-wide association (GWA) studies are a well-established approach for identifying genetic regions of interest for many common complex diseases and traits [1]. These studies are characterized by examining genetic information from thousands of individuals, at hundreds of thousands of loci across the human genome known as single nucleotide polymorphisms (SNPs). The standard assumption is that either variation at particular loci leads to changes in biological function, which in turn leads to disease, or that associated loci are in linkage disequilibrium (LD) with other disease causing variants. By examining genotypes derived from individuals with and without the disease or trait of interest, one can discern such variation. This is typically done by performing a marginal chi-square test with some control for multiple testing. However, since each causal SNP will confer risk under an unknown and different genetic model (i.e. additive, dominant, recessive), and may also interact with other SNPs (epistasis), a marginal test will be a less successful approach for finding the association [2]. Ideally, one would simply test all possible genetic models of association, including those for interaction. However, in the context of a GWA study, this is not computationally feasible.

Recent emphasis has been on the use of machine learning techniques to identify potential causal variants. Such techniques include logic regression [3], multi-dimensional reduction (MDR) [4], support vector machines (SVM) [5], and Random Forests (RF) [6]. While these techniques are each unique, they have a shared characteristic whereby each algorithm searches over a transformed version of the feature space attempting to find the optimal solution to the problem while minimizing some empirical risk. Importantly, the algorithms make minimal assumptions about the causal mechanism. This means these algorithms may be more suited for identifying variants where the causal mechanism is unknown and complex, as is the case with complex genetic diseases.

Each of these methods has utility for finding structure in genetic data, where the best algorithm will depend on the true nature of the underlying association. However, the focus of the current study is RF because of the ability of this method to identify variables of interest from very large datasets. Equally important, RF is a relatively straightforward algorithm, both to understand and interpret. Unsurprisingly, there has been a slow but steady use of RF in the genomic literature since its introduction in 2001 [6-12].

RF was first introduced by Leo Breiman [13] and is a natural extension of his previous work on classification and regression trees (CART) [14] and bootstrap aggregating (or bagging) [15]. CART is an effective tool for building a classifier, but tends to be data dependent, where even small data changes can result in different tree structures. Bagging is a process whereby data are sampled with replacement and the classifier is grown using this bootstrap sample. After many iterations, results are aggregated over all trees to create a less variable classifier with a lower prediction error when compared to the original classifier. In bagging, the variance reduction is limited by the correlation between trees; as correlation is decreased or minimized, the potential for reduction is increased. The RF algorithm (see Figure 1) begins by bagging CART trees. To reduce the correlation between trees, instead of searching over all p variables at each node for the optimal split, a search is performed over a random subset, m ≤ p, at each node. The algorithm continues to split the data until no further splits are possible, either because the node is pure (all of one class), or there are no more variables upon which to split. While the CART algorithm calls for the tree to be pruned for increased stability, RF leaves the tree unpruned, as bagging is used to decrease the variance created by the lack of pruning.

thumbnailFigure 1. Random Forests Algorithm. The RF algorithm begins by selecting a bootstrap sample of the data (1). A random subset of the variables is selected (2) and searched over to find the optimal split (3). This is repeated until an unpruned CART tree is formed (4). The data not part of the bootstrap sample is run down the tree to derive the error rate and measures of VI (5). This is repeated until a full forest is grown (6).

An aspect of the bagging procedure is that a natural, internal error rate is created. Within each bootstrap sample, approximately 37% of the original data will be unselected, referred to as the out-of-bag (OOB) sample [16]. RF passes OOB samples down the tree to obtain a class prediction. After the full forest is grown, the class predictions are compared to the true classes generating the out-of-bag error rate (OOB-ER). This error-rate can be used to compare the prediction accuracy of one set of inputs to another, behaving similarly to cross-validation [17].

An appeal of RF is that the forest of trees contains a large amount of information about the relationship between the variables and observations. This information can be used for prediction, clustering, imputing missing data, and detecting outliers. Of great interest to genetic epidemiologists, is the ability of RF to identify 'important' variables. After each OOB sample is passed down the tree to produce a prediction error for the sample, one then permutes each variable in the tree across samples, and passes the same observation down the tree again. Any increase in misclassification helps determine the importance of that variable. This type of variable importance (VI) can be derived from disparate variable types (categorical, ordinal, continuous), and makes no assumptions about the data generating distribution for the outcome. However, unlike a formal hypothesis, it is best to consider the output of a RF analysis as a rank ordering of important variables worthy of further investigation, not as a list of variables with a known Type I error rate.

Utilization of RF requires choosing between three tuning parameters: (1) number of trees to grow (ntree), (2) number of variables to select per-node (mtry), and in the case of classification, (3) class weights. While most applications in the literature have successfully implemented RF using default settings, applying RF to large GWA datasets is more complicated. Few studies have examined the various tuning parameters. The two most comprehensive reviews concluded that RF predictions were stable and robust to small fluctuations in tuning parameters settings, but often there were optimal settings [7,18]. While both studies provide useful information, the largest dataset examined by each contained only 9,868 predictors and 78 observations. This is obviously much smaller than the data analyzed in a typical GWA study.

Further complicating RF analysis, beyond the large feature space, is that GWA data tend to be highly correlated, with potentially, many regions of LD among SNPs. Also, the data are assumed to be highly sparse, meaning there is an apriori assumption that the vast majority of SNPs will not be associated with the disease. While many of these issues have been discussed in the literature, none have been considered in the context of a large GWA dataset. Moreover, many of the strategies one would employ with smaller data sets (e.g. permutation, cross-validation etc.) are not feasible due to computational constraints. Instead of working with simulated data which can be less realistic, we investigated the application of RF using a large multiple sclerosis (MS) GWA study dataset comprised of cases and controls. The aim of the current study was two-fold: (1) to illustrate how one would go about tuning RF for a particular GWA analysis, and (2) to determine whether RF would duplicate results found in the original MS GWA study, as well as identify any new loci of interest.

Methods

Genotypes

Data were derived from a 2007 MS case-control study conducted by the International Multiple Sclerosis Genetics Consortium [19] and were comprised of genotypes for a total of 325,807 SNPs (Affymetrix GeneChip Human Mapping 500K array) in 931 MS cases and 2,431 controls (n = 3,362). Stringent quality control (QC) analyses were applied to the dataset as previously described, including the removal of population outliers [19]. SNPs with greater than 10% missing data were removed. The genetic inflation factor was 1.06, indicating negligible population stratification [19].

Less than 1% of the genetic data contained missing values. There are a few different ways missing data can be handled within RF. However, since the data were derived from a dense SNP marker panel and had minimal missingness, any missing values were imputed with Beagle 2.13 [20]. Allelic data were then recoded into genotype format using PLINK 1.05 [21], producing three categories for each SNP (0, 1 and 2 copies of the minor allele). Since the optimal binary split is found at each node, this allows for the algorithm to be agnostic to recessive, dominant or additive effects. An allelic chi-square test (df = 1) was performed to calculate marginal associations for comparison.

RF Implementation

The RF code was originally written in Fortran by Breiman and Cutler. There is also an R package randomForest based on the same Fortran code [22]. Neither implementation could be used for the large GWA dataset in the current study. The original RF code has been licensed to Salford Systems [23], and they recently optimized the Fortran version, v.6.4.0.179, for application to large datasets. In preliminary testing of small datasets, similar results were found between the three implementations of RF (data not shown). RF was implemented in a server environment with 8 2/GHz cpus and 32 GB of memory. Run time was dependent on data size and mtry, ranging from a few seconds per tree to over 10 minutes per tree (~ 1 week for a full forest).

Tuning Parameters Considered

Number of variables to choose per node (mtry)

The primary tuning parameter in RF is the number of variables to search at each node (mtry). This parameter controls the bias-variance trade-off. Searching over fewer variables per node will produce less correlated trees, reducing the overall variance of the prediction. However, this will also decrease the accuracy of each individual tree, increasing the bias. The mtry can also be viewed as controlling the complexity of the model, with a smaller value leading to a more complex, less sparse solution (see below). Breiman originally suggested choosing the int(log 2 p + 1) of the number of predictors per node. In the R implementation, the default value is the square root of the number of predictors.

For a GWA dataset, this would entail examining approximately 550 SNPs per node. As noted by Breiman, when there are many weak predictors, this number may need to be increased. It has also been noted that mtry is more important for VI calculation than for prediction, and that with sparse data, mtry = p leads to greatest stability [18]. A coarse search for the optimal mtry was performed in the current study using mtry values of 1, , 0.1p, 0.5p and p. The parameterization that produced the lowest final OOB-ER was chosen as the optimal mtry.

Number of Trees to Grow (ntree)

Another important consideration is how many trees to grow. This is also a dataset dependent factor, where stronger predictors lead to quicker convergence. While for prediction purposes few trees are often necessary, and the OOB-ER will generally converge rapidly, for VI, more trees will generally lead to refinement and stability in VI [18].

The main trade-off with growing a larger number of trees is the computation cost required. In the current study, trees were grown until the OOB-ER stabilized. Additional trees were then grown to ensure stability.

Weighting

The final tuning parameter, which was not considered in this analysis, is weighting. In classification, with uneven classes, an unweighted classification scheme will be biased towards the majority class. The typical strategy is to re-weight the classes so that they are balanced, the practice used within the Salford Systems implementation of RF, and the default in the R implementation. Unfortunately, class weighting cannot be altered in the Salford Systems version, so it could not be tested as a tuning parameter. However, internal testing on a more flexible version of RF showed no added benefit to changing the weighting.

Data Configurations

Sparsity Pruning

As noted, it is expected that the vast majority of SNPs in a GWA study do not impact risk for disease, and therefore, are simply noise. The goal of any algorithm should be to separate noise from signal, providing a sparse solution. A sparse solution is indicated when the VI is either 0 or negative. Such a VI indicates that the variable was either never selected into a tree, or when it was selected, permutation did not increase the prediction error. Sparse solutions provide a convenient way to remove unimportant data from the analysis.

Sparsity is a function of both mtry and ntree, with a higher mtry leading to greater sparsity and a higher ntree leading to less sparsity. One proposed strategy is to sequentially remove genes by dropping the bottom 20% or 50%, and perform successive runs until there is a noticeable increase in prediction error [7]. Utilizing the natural sparseness in the dataset, the results of each RF run were examined and sparse SNPs were dropped. The RF analysis was then re-run until prediction error stabilized. While this will give a biased estimate of the prediction error for the model [24], it can still be used to judge model quality. This sub-sampling process was repeated in the current study until the final OOB error-rate stabilized or increased.

Removing Strong Associations

RF searches over multiple variables finding solutions based on joint and conditional effects. Variable(s) with strong effects may mask weaker, yet important effects. It is well established that the HLA region within the major histocompatibility complex (MHC) on chromosome 6p is strongly associated with MS [25]. Therefore, to search for weaker non-MHC effects, RF analysis was performed in the current study after removing chromosome 6p marker data.

Linkage Disequilibrium

An important consideration when applying RF to GWA data is the large degree of LD among SNPs. VI is calculated from the number of trees in which a variable appears. Therefore, two SNPs that are in perfect LD will appear in trees about half as often as each individual one may appear by itself, effectively lowering the VI of each SNP. While this does not present a problem for prediction, it can skew the VI rankings [18]. Two proposed solutions have been to calculate VI independently of the number of trees in which the variable appears [10] or as conditional on other variables in the tree [26].

PLINK [21] provides two methods of LD pruning based on r 2 and R 2. r 2 is a traditional pairwise LD measure, though not based on phased haplotypes. R 2 is the multiple correlation coefficient based on a sliding window. Using PLINK, SNPs with a multiple correlation coefficient (R 2) of 0.99, 0.90, 0.80, 0.50 and 0.33 were removed from the MS case-control dataset for comparison. This resulted in pruning between 22% and 76% of the original data which had the side benefit of increasing computational efficiency.

Reliability of Results Obtained from RF

Since RF is a Monte-Carlo process, random variation may influence VI results, particularly if enough trees are not grown. While, work has indicated that RF results are relatively stable [18] and our own internal testing has confirmed this, it is important to grow large forests and do multiple runs when possible. Reliability of final RF results was examined by re-running RF with the final dataset configuration, parameterization and sub-sampling process, changing just the seed in the random number generator. While more than one re-run would be ideal, the VI measures are unlikely to be unstable given that two runs were performed.

Comparison of RF Results to Original GWA study

The original MS GWA study identified, with replication, 16 SNPs across 13 genes as associated with MS [19]. An important consideration for the current study was whether RF could identify additional genes of interest, as well as duplicate the original findings based on univariate testing. Duplication was considered present when a SNP identified by RF was: (1) among the original 16 SNPs, or (2) a SNP that was tagged by one of the 16 SNPs identified in the original GWA study. PLINK was used to identify tagged SNPs using an r 2 threshold of 0.5.

Analysis Strategy

Figure 2 presents the analysis plan. The primary method for choosing tuning parameters was minimization of the OOB-ER, as this is the best indication of model quality. Determining how many results to report is more subjective since the VI measure does not constitute a formal hypothesis test. To help guide interpretation of RF results for follow-up, we plotted VI scores. A sloping line with an "elbow" (Figure 3) was observed most often around the top 25, so this was chosen as the cutoff for an important result.

thumbnailFigure 2. Analysis Flow. Flow Plan for RF analysis. The full MS case-control dataset was analyzed, searching for the optimal mtry &ntree, along with sparsity pruning, as necessary. Two runs were then conducted, one without any 6p genotypes, and one with data for a single 6p SNP. Finally, LD pruning was explored. After the best data configuration was found, RF analysis was re-run to examine stability of results. The final RF results were compared to the original GWA results [19].

thumbnailFigure 3. Scree Plots for top 100 RF VI measures. The three plots represent the VI measures for the full dataset with chromosome 6p data removed, the R 2 = 0.99 run and the R 2 = 0.90 run. An "elbow" is present in all three plots around 25 markers (designated with the vertical line).

Results

Tuning Parameters

Number of variables to choose per node (mtry)

The first parameter considered was mtry since this has the greatest impact on the OOB-ER. Figure 4 shows the OOB-ER for different values of mtry. The typically suggested value of mtry of around is not sufficient for GWA data, as the OOB-ER is minimized with an mtry around .1p. Among the higher mtry values, there was little distinction between them with regard to OOB-ER.

thumbnailFigure 4. Convergence of Error Rate Across Different mtrys. An examination of the error-rate across different mtrys. The larger mtrys of .1p and above clearly lead to a much lower error rate than the more traditional lower values. .1p seems to minimize the overall OOB error-rate though not by much. Convergence seems to occur around 200 - 400 trees.

Another consideration is the sparsity induced by the mtry factor. As expected, sparsity increases with mtry, though this is most dramatic after increasing to mtry = p (Figure 5).

thumbnailFigure 5. Sparsity of SNPs across mtry. As expected, sparsity increases as a function of mtry. There is the most dramatic increase after moving from an mtry of .5p to p.

Number of Trees to Grow (ntree)

Using mtry = .1p, forests of size 50, 250, 500, 1,000, 1,500 and 2,000 trees were grown. It is clear that the OOB-ER leveled off around 250 trees (see Figure 4) and 1,000 trees was used as a reliable forest size. However, for datasets without chromosome 6p and only weak predictors (see below), it took more than 4,000 and sometimes 8,000 trees for convergence. In those cases, 5,000 and 10,000 trees, respectively, were grown. More trees led to a less sparse result, as expected, with nearly a linear decrease through 2,000 trees.

Data Configurations

Sparsity Pruning

When using the full dataset for RF analysis, SNPs within the HLA region of chromosome 6p were consistently selected as the most important variables (Table 1). This is not surprising, as some SNPs in that region had a marginal Χ2-statistic as large as 274. The final error rate of 35% is identical to a simple classification based just on genotypes for the three most highly associated SNPs (rs3129900, rs3129934, rs9370986). RF results based on analysis of all SNPs from chromosome 6p resulted in the same 35% error rate.

Table 1. Top SNPs identified by Random Forests in MS case-control dataset

Removing Chromosome 6p

After removing all SNPs on chromosome 6p (p = 8,335), the initial run of 317,472 SNPs produced an error-rate of 48% after 1,000 trees, and using both an mtry of .1p and p. The resulting forest based on mtry of .1p was 74% sparse (82,527 SNPs retained). Using mtry = p, the forest was 99% sparse (4,219 SNPs retained).

For the mtry = p run, re-running RF analysis with the reduced dataset produced an error-rate of 26%, and required approximately 4,000 trees to converge. Repeating this sub-sampling process two more times produced an error-rate of 21%. After a fourth run, the OOB error-rate remained at 21%, suggesting that three sub-samples were sufficient. For the 10% run, the final OOB error-rate was 37% and contained 25,000 SNPs.

Overall, results suggest there is predictive structure (differences between MS cases and controls) beyond chromosome 6p, and that aggressive pruning of the initial mtry = p is more effective for discovering that structure. The top 25 SNPs derived from RF analysis without chromosome 6p markers are shown in Table 2.

Table 2. Top SNPs identified by Random Forests in MS case-control dataset without 6p data

Linkage Disequilibrium

The final consideration was the effect of pruning SNPs based on LD. The dataset without any markers for chromosome 6p was used and the same sub-sampling strategy was followed. Figure 6 shows final error-rates for the six LD configurations investigated, along with the full dataset. The number of SNPs in each configuration is included.

thumbnailFigure 6. Error Rate Across LD Prunes. In the red line we see the OOB error rate across the different LD prunes. There is little information lost going from the full data to pruning at 99% and even 90%. Thereafter there is more loss of information. The blue line shows the number of SNPs that were in each RF analysis.

While pruning past an R2 of 0.90 (LD90) results in a higher final error-rate and suggests a loss of information, it is hard to determine which approach is best when solutions based on full data, LD99 and LD90 are compared. Examination of the top 25 SNPs from the three configurations (full, LD99 and LD90; Tables 2, 3 and 4), reveals that most of the SNPs were located within a gene (14, 14, and 15 respectively). However, the LD90 solution identified SNPs within more unique genes (14) compared to the other configurations (9 and 11). In addition to identification of potentially functional SNPs, the majority of top results show strong marginal associations (p ≃ 10-5) but do not meet established criteria for genome-wide significance [27]. When the top 25 SNP results from each configuration were compared, both overlapping and unique genes are observed. Genes not previously associated with MS were among the top hits, specifically CTNNA3, MPHOSPH9, PHACTR2, and IL7.

Table 3. Top RF SNPs in the MS case-control dataset with LD pruning R2 = 0:99

Table 4. Top RF SNPs in the MS case-control dataset with LD pruning R2 = 0.90

Reliability of Results Obtained from RF

The final three data configurations (full data, LD99 and LD90) were re-analyzed, changing only the random number seed. For all three configurations, at least 19 of the top 25 SNPs were in the final results after sparsity pruning. This suggests that even after changing the seed, RF results are very stable.

Comparison of Results to Original GWA Study

Finally, the RF results were compared with replicated results from the original MS GWA study [19] (Additional File 1). In all, 4 of 13 MS genes were directly identified by one of the three data configurations. The strongest evidence came from SNPs in EVI5 and KANK1 with a suggestion of duplication in IL2RA. Details including SNP rs number, minor allele frequency (MAF), and univariate analysis results are shown.

Additional file 1. Comparison of RF results with original GWA screen in MS cases and controls. The top 16 SNPs identified by IMSGC [19] and corresponding rank for a tag SNPs derived from RF analysis. EVI5 and KANKI ranking the highest based on RF results, with rs10735781 having multiple tag SNPs. *The two SNPs in IL2RA serve as tag SNPs for one another.

Format: XLS Size: 12KB Download file

This file can be viewed with: Microsoft Excel ViewerOpen Data

Discussion

This study is the first application of RF, and one of few machine learning applications, to the analysis of a GWA dataset. The goals were to outline methodological considerations for applying RF to large GWA data, and to identify potential novel MS associations. Given what is currently known about the genetics of MS, it was not surprising that a strong classifier could be constructed by RF based on data for multiple SNPs within the MHC. Among the strongest effects (most important SNP predictors of MS as outcome) was rs9271366, which has been previously shown to tag DRB1*1501 with r2 = 0.98 [28]. Interestingly, once the 6p effect was removed from analyses, a strong classifier based on non-MHC data emerged. Results suggest that sparsity pruning provides a means to discover new associations with RF, although the final error-rate is biased [24].

RF analyses consistently identified four non-MHC genes as important to distinguishing MS cases from controls. These were: MPHOSPH9, CTNNA3, PHACTR2 and IL7. MPHOSPH9 up-regulates neuronal functioning [29], and interestingly, variation within this locus has recently shown suggestive evidence for association in a much larger meta-analysis that included 2,624 MS cases and 7,220 controls [30]. CTNNA3 is a cell adhesion gene that has been associated with Alzheimer's disease [31]. PHACTR2 is involved in phosphate and actin regulation and has been implicated in Parkinson's disease [32]. Finally, IL7 is an important immune system gene involved in T and B cell production and has been implicated in other autoimmune diseases, notably rheumatoid arthritis, but not MS [33]. It is important to note that although SNPs within CTNNA3, MPHOSPH9, PHACTR2 and IL7 were among the top RF results, associations for these SNPs based on univariate analyses would not meet criteria for genome-wide significance [27] (Tables 2, 3, and 4). As a point of comparison, statistical power based on univariate testing was high in our dataset (n = 931 cases and 2,431 controls) for detecting an effect size per allele (or allelic odds ratio) of 1.5 (assuming MAF = 0.15-0.50 and Α = 1.5 * 10 -7), However, power was quite limited to detect smaller effect sizes, for example, 1.3 or 1.2, where ∼ 5 - 30% and ∼ 0.5 - 3% power, respectively, was present. To date, replicated non-MHC MS genes have demonstrated very modest effects of 1.2 or even smaller [19,28,30]. New results from the current study will require further replication in a larger, independent dataset, but underscore the utility of using more than one analytical method to identify genetic associations.

RF results were also compared to findings from the original MS case-control study using the same dataset, with duplication defined as a either the original SNP or one tagged by that SNP among the top RF results. Two previously reported genes, EVI5 and KANK1, were among the top RF findings in the current study. There was also a suggestion of importance based on RF analyses for IL2RA and perhaps CBLB.

Methodologically, it was shown that RF can be applied to large GWA datasets, but certain standard assumptions cannot always be made. The OOB-ER was relied upon to guide decision making about tuning parameters and data configuration. Even though the focus of the current study was not prediction, this error-rate is valuable for determining the quality of RF results. First, when working with large, sparse data, the default value of mtry needs to be increased in order to improve learning. Even for the sub-sampled data sets, generally an mtry = .1p was the optimal setting. It was also found that the number of trees necessary to reach stability depended greatly on the strength of the inputs. For the data configurations with chromosome 6p genotype data, stability was reached within 250 trees, while for the data configurations without chromosome 6p data, stability was often not reached until at least 4,000 trees were generated. LD pruning can be an effective means of reducing data size without significant loss of information. Also removing sparse variables proved to be highly effective and resulted in much more efficient learning. It was established that some very strong effects (chromosome 6p) can mask weaker, yet potentially interesting effects. Prediction based on genetic data that did not include HLA region SNPs was surprisingly strong. Finally, one needs to consider the coding of the allelic data. Coding the data on a dosage scale allows for a flexible examination of genetic effects. Upon settling on a final configuration(s), doing multiple runs of RF is necessary to examine the reliability of the VI measures.

More work is needed to achieve a better understanding of the RF algorithm and how best to apply it to large GWA datasets. The theoretical basis for RF as a predictor is well understood, but less is known about VI. Unlike p-values, there is no strict criterion for distinguishing between important and non-important variables. Our decision to focus on the top 25 results was based on graphing results, and in that sense was fairly qualitative. Ideally, one would use permutation to assess the significance of the VI measures, however this is not feasible with these large datasets. Work is ongoing to determine valid cutoffs for VI measures. Also, only one form of VI was used in this analysis (permutation), but another general VI exists for classification based on the Gini criterion (the optimizing criteria used to construct the tree). Work is also ongoing to define more targeted measures of VI, particularly for SNP data. Furthermore, as discussed, LD between SNPs and other correlated data are problematic for RF due to the way VI is calculated and we are currently exploring alternative VI calculations. Finally, further work is also needed to leverage additional information from the forest of trees. Little work has been done on clustering observations in RF. The tree structure can also be used for identifying extensive regions of interactions and genetic networks and predictors important to specific disease phenotypes.

Conclusions

This study represents one of the first successful applications of a machine learning algorithm to GWA data. Machine learning algorithms require fewer assumptions about the data generating distribution, and therefore, offer a very flexible approach to data analysis. Our results show the RF algorithm is both computationally feasible and sensible for analyses of large GWA datasets. Computation time ranged from a few minutes to a few days depending on the number of variables. Our results support findings from previous genetic studies in MS, and more importantly, new candidates emerged that strongly warrant further investigation.

A unique approach to analyzing complex genetic data is described in the current study. As other machine learning algorithms are expanded to accommodate large GWA datasets, one can apply an array of algorithms to a large dataset, and then aggregate results across methods to determine which markers or genes may be of greatest interest for future studies. Such ensemble learners are common in the machine learning literature [17], and should soon be applicable to larger genetic datasets.

Authors' contributions

BAG, AEH, AC and LFB designed the study. BAG performed analytical work. BAG and LFB wrote the manuscript. BAG, AEH, AC and LFB contributed to discussion and edits to the manuscript. All authors have read and approve the contents of the final manuscript.

Acknowledgements

We thank the International Multiple Sclerosis Genetics Consortium (IMSGC) for collaborating on this study. A full list of members is available at https://www.imsgc.org/. We gratefully acknowledge the significant contributions made by Patricia Ramsay (UC Berkeley) for preliminary Random Forests analyses and multiple discussions related to this project, and Gary Artim for substantial programming assistance related to this project. This work was supported by grants R01AI076544 from the National Institute of Allergy and Infectious Disease and R01NS049510 from the National Institute of Neurological Disorders and Stroke (NINDS). This work was also supported by the Medical Research Council (G0700061), the National Institute of Health (RO1 NS049477 and NS032830) and the Cambridge NIHR Biomedical Research Centre. BAG is funded in part by a National Institutes of Health NRSA Trainee appointment on grant T32 HG 00047 and the Russell M. Grossman Endowment. This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113 (see Nature. 2007; 447;661-78). We gratefully acknowledge the National Institutes of Mental Health (NIMH) for generously allowing the use of their genotype data. Control subjects from the NIMH Schizophrenia Genetics Initiative (NIMH-GI), data and biomaterials are being collected by the "Molecular Genetics of Schizophrenia II" (MGS-2) collaboration. The Investigators and co-Investigators are: ENH/Northwestern University, Evanston, IL, MH059571, Pablo V. Gejman, M.D. (Collaboration Coordinator: PI), Alan R. Sanders, M.D.; Emory University School of Medicine, Atlanta, GA, MH59587, Farooq Amin, M.D. (PI); Louisiana State University Health Sciences Center; New Orleans, Louisiana, MH067257, Nancy Buccola APRN, BC, MSN (PI); University of California-Irvine, Irvine, CA, MH60870, William Byerley, M.D. (PI); Washington University, St. Louis, MO, U01, MH060879, C. Robert Cloninger, M.D. (PI); University of Iowa, Iowa, IA, MH59566, Raymond Crowe, M.D. (PI), Donald Black, M.D.; University of Colorado, Denver, CO, MH059565, Robert Freedman, M.D. (PI); University of Pennsylvania, Philadelphia, PA, MH061675, Douglas Levinson M.D. (PI); University of Queensland, Queensland, Australia, MH059588, Bryan Mowry, M.D. (PI); Mt. Sinai School of Medicine, New York, NY MH59586, Jeremy Silverman, Ph.D. (PI). The samples were collected by V L Nimgaonkar's group at the University of Pittsburgh, as part of a multi-institutional collaborative research project with J Smoller, M.D. D.Sc. and P Sklar, M.D. Ph.D. (Massachusetts General Hospital) (grant MH 63420).

References

  1. WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

    Nature 2007, 447:661-678. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Heidema AG, Boer JM, Nagelkerke N, Mariman EC, van der A DL, Feskens EJ: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases.

    BMC Genet 2006, 7:23. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  3. Kooperberg C, Ruczinski I: Identifying interacting SNPs using Monte Carlo logic regression.

    Genet Epidemiol 2005, 28:157-170. PubMed Abstract | Publisher Full Text OpenURL

  4. Motsinger AA, Ritchie MD: Multifactor dimensionality reduction: an analysis strategy for modelling and detecting gene-gene interactions in human genetics and pharmacogenomics studies.

    Hum Genomics 2006, 2:318-328. PubMed Abstract | Publisher Full Text OpenURL

  5. Yoon Y, Song J, Hong S, Kim J: Analysis of multiple single nucleotide polymorphisms of candidate genes related to coronary heart disease susceptibility by using support vector machines.

    Clin Chem Lab Med 2003, 41:529-534. PubMed Abstract | Publisher Full Text OpenURL

  6. Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P: Identifying SNPs predictive of phenotype using random forests.

    Genet Epidemiol 2005, 28:171-182. PubMed Abstract | Publisher Full Text OpenURL

  7. Díaz-Uriarte R, de Andrés Alvarez S: Gene selection and classification of microarray data using random forest.

    BMC Bioinformatics 2006, 7:3. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  8. Glaser B, Nikolov I, Chubb D, Hamshere ML, Segurado R, Moskvina V, Holmans P: Analyses of single marker and pairwise effects of candidate loci for rheumatoid arthritis using logistic regression and random forests.

    BMC Proc 2007, 1(Suppl 1):S54. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  9. Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P: Screening large-scale association study data: exploiting interactions using random forests.

    BMC Genet 2004, 5:32. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  10. Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL: Performance of random forest when SNPs are in linkage disequilibrium.

    BMC Bioinformatics 2009, 10:78. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. Nonyane B, Foulkes A: Application of two machine learning algorithms to genetic association studies in the presence of covariates.

    BMC Genetics 2008, 9:71. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  12. Sun YV, Cai Z, Desai K, Lawrance R, Leff R, Jawaid A, Kardia SL, Yang H: Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests.

    BMC Proc 2007, 1(Suppl 1):S62. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  13. Breiman L: Random Forests.

    Machine Learning 2001, 45:5-32. Publisher Full Text OpenURL

  14. Breiman L, Friedman J, Olshen R, Stone C: Classification and Regression Trees. New York: Chapman & Hall; 1984.

  15. Breiman L: Bagging Predictors.

    Machine Learning 1996, 24:123-140. OpenURL

  16. Breiman L: Out-Of-Bag Estimation. Tech. rep., UC Berkeley; 1996.

  17. Hastie T, Tibshirani R, Friedman J: Elements of Statistical Learning. 2nd edition. New York: Springer; 2009.

  18. Genuer R, Poggi JM, Tuleau C: Random Forests: some methodological insights. [http://hal.inria.fr/inria-00340725/en/] webcite

    Tech rep, INRIA 2008. OpenURL

  19. Hafler DA, Compston A, Sawcer S, Lander ES, Daly MJ, De Jager PL, de Bakker PI, Gabriel SB, Mirel DB, Ivinson AJ, Pericak-Vance MA, Gregory SG, Rioux JD, McCauley JL, Haines JL, Barcellos LF, Cree B, Oksenberg JR, Hauser SL: Risk alleles for multiple sclerosis identified by a genomewide study.

    N Engl J Med 2007, 357:851-862. PubMed Abstract | Publisher Full Text OpenURL

  20. Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.

    Am J Hum Genet 2007, 81:1084-1097. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses.

    Am J Hum Genet 2007, 81:559-575. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Liaw A, Wiener M: Classification and regression by randomForest.

    Rnews 2002, 2:18-22. OpenURL

  23. Salford Systems [http://salford-systems.com/] webcite

  24. Svetnik V, Liaw A, Tong C: Variable selection in random forest with application to quantitative structureactivity relationship. In Proceedings of the 7th Course on Ensemble Methods for Learning Machines. Edited by Intrator N, Masulli F. Springer-Verlag; OpenURL

  25. Oksenberg JR, Barcellos LF: Multiple sclerosis genetics: leaving no stone unturned.

    Genes Immun 2005, 6:375-387. PubMed Abstract | Publisher Full Text OpenURL

  26. Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution.

    BMC Bioinformatics 2007, 8:25. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  27. Pearson TA, Manolio TA: How to interpret a genome-wide association study.

    JAMA 2008, 299:1335-1344. PubMed Abstract | Publisher Full Text OpenURL

  28. Australia, Consortium NZMSG: Genome-wide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20.

    Nature Genetics 2009, 41:824-828. PubMed Abstract | Publisher Full Text OpenURL

  29. Ward GR, Franklin SO, Gerald TM, Dempsey KT, Clodfelter DE, Krissinger DJ, Patel KM, Vrana KE, Howlett AC: Glucocorticoids plus opioids up-regulate genes that influence neuronal function.

    Cell Mol Neurobiol 2007, 27:651-660. PubMed Abstract | Publisher Full Text OpenURL

  30. deJager PL, Jia X, Wang J, deBakker PIQ, Ottoboni L, Aggarwal NT, Piccio L, Raychaudhuri S, Tran D, Aubin C, Briskin R, Romano S, IMSGC : Meta-analysis of genome scans and replication identify CD6, IRF8 and TNFRSF1A as new multiple sclerosis susceptibility loci.

    Nature Genetics 2009, 41:776-782. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Morgan AR, Hamilton G, Turic D, Jehu L, Harold D, Abraham R, Hollingworth P, Moskvina V, Brayne C, Rubinsztein DC, Lynch A, Lawlor B, Gill M, O'Donovan M, Powell J, Lovestone S, Williams J, Owen MJ: Association analysis of 528 intra-genic SNPs in a region of chromosome 10 linked to late onset Alzheimer's disease.

    Am J Med Genet B Neuropsychiatr Genet 2008, 147B:727-731. PubMed Abstract | Publisher Full Text OpenURL

  32. Wider C, Lincoln SJ, Heckman MG, Diehl NN, Stone JT, Haugarvoll K, Aasly JO, Gibson JM, Lynch T, Rajput A, Rajput ML, Uitti RJ, Wszolek ZK, Farrer MJ, Ross OA: Phactr2 and Parkinson's disease.

    Neurosci Lett 2009, 453:9-11. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  33. van Roon JA, Lafeber FP: Role of interleukin-7 in degenerative and inflammatory joint diseases.

    Arthritis Res Ther 2008, 10:107. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL