Abstract
We used the simulated data set from Genetic Analysis Workshop 15 Problem 3 to assess a twostage approach for identifying singlenucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). In the first stage, we used random forests (RF) to screen large amounts of genetic data using the variable importance measure, which takes into account SNP interaction effects as well as main effects without requiring model specification. We used the simulated 9187 SNPs mimicking a 10 K SNP chip, along with covariates DR (the simulated DRB1 gentoype), smoking, and sex as input to the RF analyses with a training set consisting of 750 unrelated RA cases and 750 controls. We used an iterative RF screening procedure to identify a smaller set of variables for further analysis. In the second stage, we used the software program CaMML for producing Bayesian networks, and developed complex etiologic models for RA risk using the variables identified by our RF screening procedure. We evaluated the performance of this method using independent test data sets for up to 100 replicates.
Background
It is commonly believed that complex diseases are caused not by single genes acting alone, but by multiple genes and nongenetic factors interacting with one another. Due to the large number of singlenucleotide polymorphisms (SNPs) now available in genomewide scans, the computational burden of testing each locus for main effects and all possible twoway, threeway, and higherorder interactions is overwhelming. One approach to reducing the number of interactions to examine is to perform a twostage analysis. In the first stage, one identifies a subset of SNPs for further analysis of interaction in the second stage. Often, a univariate test (e.g., a chisquare test) is used to identify SNPs associated with outcome in the first stage. When riskassociated SNPs have small marginal effects but large interaction effects in the population, univariate methods will result in low power for detecting these SNPs. "Multilocus" approaches consider interactions of multiple genes and environmental factors in identifying susceptibility loci for complex diseases [1]. Random Forests (RFs) [2] provide a powerful method for detecting interacting risk susceptibility SNPs (rSNPs) [3]. However, this method does not provide a model that delineates the interactions.
Bayesian networks (or directed graphical models) are graphs in which the nodes represent random variables and the arrows represent dependence relationships [4]. These methods have been successfully applied to generate a model describing the relationship among SNPs in multiple candidate genes for a complex trait [5].
Methods
We used the 100 replicates of simulated data in Problem 3 provided by the Genetic Analysis Workshop 15 (GAW15). We performed analyses with knowledge of the "real" answers but screened all of the 9187 SNPs, distributed on the genome to mimic a 10 K SNP chip without regard to the generating model. We used disease status for rheumatoid arthritis (RA) as the outcome and smoking, sex and DR genotype (the simulated DRB1 genotype) as covariates.
Subjects
To obtain biologically independent cases, for each replicate we randomly selected one affected sibling from each of 1500 nuclear families. These 1500 cases were then divided at random into a training data set of 750 affected subjects and a test data set of 750 cases. The GAW data provided 2000 unrelated unaffected individuals for use as controls. Two independent sets of 750 controls were selected at random from the 2000 for use as training data set and test data set controls. Thus, for each replicate we had independent training and test data sets consisting of 750 cases and 750 controls.
Random Forests
RFs grow a large number of classification or regression trees with no trimming or pruning. The RF method produces an importance score for each variable that quantifies the relative contribution of that variable to the prediction accuracy. We used this score to prioritize the predictor variables. The RF also produces prediction errors for the individuals, which we used for evaluation of the method.
We used Random Forests version 5 [6] to analyze the training data. We used an iterative process similar to a strategy previously proposed for gene expression analysis [7] in which, at each iteration, we built a random forest using the training data, and saved the 50% of variables with the highest importance scores to build the next forest. The random forests built at each iteration were named IT_{0}, IT_{1},..., IT_{n}, and the prediction errors of the training data set were estimated for the forest built at each iteration. We call the forest with the best prediction error IT_{bp}. The variables included in IT_{bp }were then used in secondstage analysis. We compared the performance of the iterative procedure that resulted in the forest IT_{bp}, in terms of keeping the true risk variables and removing noise variables, to a simple procedure of selecting the top 50 ranked variables by importance from iteration 0, in the test data sets. Specifically, we compared the prediction error of the IT_{bp }forest, the IT_{0 }forest (all variables used; no selection), and a forest built using only the top 50 SNPs from the IT_{0 }forest ("IT_{top50}"). Because the iterative procedure averaged 53 SNPs in the final forest, we chose 50 SNPs from IT_{0 }to yield a forest with approximately the same number of SNPs. We computed prediction error using the test data ("test"), and using the outofbag observations of the training data set ("training").
Network inference
Bayesian networks (BN) are directed acyclic graphs for representing the joint probability distribution of all variables. A network for discrete variables, e.g., Figure 1, is specified by the graph structure (nodes and arcs) and the conditional probability table (CPT) at each node (node chr6_162 is shown). Each node is a variable, and each directed arc implies association and direction of dependency between the two variables. The origin node of an arc is usually called the parental node, and nodes that an arc points to are called child nodes. A child node is conditionally independent of other nodes given its parental nodes. Thus, the joint probability of n variables can be simplified to , where x_{il}, x_{i2},..., x_{ik }in the condition are the parental nodes of x_{i }and a subset of x_{1}, x_{2}, x_{i1},..., x_{i+1}, x_{n}. BN models are useful for describing complex relationships among variables, as well as for making predictions for variables that are regarded as outcomes.
Figure 1. Bayesian network based on variables of IT_{bp }for Replicate 1.
CaMML [8], Causal Minimum Message Length (MML), is a program for generating Bayesian networks. The general goal is to find a model that maximizes the posterior probability of that model given the data. CaMML searches over all possible structures (models) using the Metropolis algorithm. It uses MML as a metric that includes a penalty on model complexity to control the resampling process. We evaluated the performance of CaMML on a set of variables used in the IT_{bp }forest described above. We used the test data set to predict case status and estimate prediction error.
Results
We identified the best surrogates for all risk loci (AG) as the SNPs with the highest linkage disequilibrium (LD) (r^{2}) with risk loci from the answer files given with the GAW15 data (Table 1). For locus C, three SNPs had r^{2 }≥ 0.2; for locus D, two SNPs had r^{2 }≥ 0.2. When analyzing the results, we considered these SNPs true positives, in that they are the best proxies for the true risk loci that were not genotyped.
Table 1. Power estimate of IT_{bp }and CaMML
Risk variables identified by RF
We compared IT_{bp }and IT_{0 }top 50 for choosing a set of variables by comparing how often the best surrogates for loci AG appeared in the variable set. DR and the best surrogates for C, D, and F were included in 94 and 98 out of 100 replicates for the IT_{bp }forest and the top 50 variables for IT_{0 }forest, respectively. The average number of variables included in the IT_{bp }forest was 53 (range 8–287). The IT_{bp }forest occurred, on average, at iteration 7.64 (range 5–10).
Estimate of prediction error
As seen in Table 2, the mean and median prediction error for the training data sets is smaller than that for the test data sets for the IT_{bp }and IT_{top50 }methods (median differences 2.77, 0.93, p < 0.0001), which may indicate overfitting. The IT_{0 }forest gives similar prediction error for test and training data.
Table 2. Prediction error for random forest analyses
For the training data sets, the mean prediction error for the IT_{bp }forests is smaller than that for the IT_{0 }forests; the IT_{top50 }forests fall in between (Table 3). For the test data sets, although both IT_{top50 }and IT_{bp }outperform IT_{0}, the IT_{bp }has larger prediction error than IT_{top50 }(difference in median = 0.43, p < 0.0001), which might be due to overfitting for the iterative method.
Table 3. Paired Wilcoxon rank test of prediction errors from three RFs, using Replicates 1–100
Network inference
We used CaMML to analyze the variables selected from IT_{bp }for Replicates 1 to 50. Due to computational limits, if more than 50 variables were selected by IT_{bp}, only the top 50 variables were used for secondstage analysis. With the maximum number of variables restricted to 50, the average number of variables used in CaMML across the 50 replicates was 40. In the estimated BNs, an average of 11 variables were connected to RA status directly or indirectly through other variables in a path of a network that included RA status. The average prediction error using the test data was 12.4% (Table 2), which is smaller than that of IT_{bp }(Table 4). An example BN with the conditional probability table (CPT) for node chr6_162, using Replicate 1 is displayed in Figure 1. In this BN, all SNPs included in the analysis with r^{2 }> 0.3 with one of the disease loci (Table 1) were connected directly or indirectly to RA. Many SNPs on chromosome 6 were interconnected due to LD between these markers. The CPT for node chr6_162 showed 5.5fold increased risk of RA for carrying allele 3 versus allele 2.
Table 4. Paired Wilcoxon rank test of prediction errors from three RFs and CaMML using test data and Replicates 1–50
Table 1 displays the frequency of variables appearing in the network for Replicates 1–50. We have 100% power to detect SNP6_152, SNP6_153, SNP6_154 (surrogates for locus C), SNP6_162 (surrogate for locus D), and SNP11_389 (surrogate for locus F), all of which have strong LD (r^{2 }≥ 0.418) with disease loci. We have lower power (66%) to detect SNP6_160, a surrogate for D that is in lower LD (r^{2 }= 0.273). Despite its low LD with locus C (r^{2 }= 0.104), the power to detect SNP6_155 is 98%. This may be due to the very strong effect of locus C. Importantly, CaMML identified all covariates (DR, sex, and smoking) and almost all surrogates in LD with disease loci (with exception of SNP6_160, which was not detected by CaMML in one replicate) as part of the RA network from variables selected from IT_{bp}.
Discussion
Using the simulated data from Problem 3, we assessed a twostage approach for identifying SNPs associated with RA that employs random forests to identify important variables, and Bayesian networks to further filter out noise SNPs by reducing prediction error. The random forest analysis reduced the number of variables for further Bayesian network analyses from 9190 to about 53. This screening strategy successfully filtered out many SNPs unassociated with the disease loci, while keeping the surrogates for risk SNPs for four out of nine of these loci (DR, C, D, and F) in 94 of 100 replicates. Although IT_{bp }seems to give lower prediction error than IT_{top50 }in training data sets, IT_{top50 }gives lower prediction error than IT_{bp }in test data sets. Therefore, the strategy of building a second forest using the top 50 SNPs from a first forest may be a better variable selection method overall. However, the effects of these loci in this data set are very strong, and it is not clear that this result will generalize to data weaker association signals. Further, it is not clear how to choose the number of variables to select if one uses the simpler procedure. Additional simulation studies are needed to determine how to generalize our results to less ideal circumstances. The fact that the difference in the median of prediction errors for training and test data sets are large for IT_{bp }suggests overfitting; however, because we removed a large (50%) proportion of "noise" in this first stage, IT_{bp }is not expected to be the optimal RF with the lowest prediction error. It is possible to remove one noise variable at a time; however, it is not practical in the context of thousands of variables. We expected the BN analysis to further reduce the number of noise SNPs and provide some guidance as to important interaction effects.
Bayesian network analysis based on a subset of the variables (≤ 50) selected from IT_{bp }captured most of the true loci and the correct dependencies among them and further decreased the test set prediction error. The network model provides a method for predicting case status and facilitates the understanding of complex relationships between the disease and genetic and environmental factors. The limitations of BN include the difficulty to discern the exact relationship between variables that are interconnected and the exponential increase in computation time with the number of variables. These make BN impractical for genomewide scan of dense SNPs. However, BN results are at least useful to generate potentially biological meaningful hypotheses to be confirmed by further statistical analyses or/and biological experiments.
Competing interests
The author(s) declare that they have no competing interests.
Acknowledgements
This work was supported by grants from Fonds de la Recherche en Santé (YM), the National Institutes of Health (R01 AG09029 to YM, R01 AG017173 to KTC, R01 NS3671104 to AD) and the National Heart, Lung and Blood Institute's Framingham Heart Study Contract N01HC25195 (QY). It utilized the Boston University Linux Cluster for Genetic Analysis (LinGA), which is funded by the NIH National Center for Research Resources Shared Instrumentation grant (S10 RR16373601A1). The authors thank Lucas Hope and Rodney O'Donnell from the Faculty of Information Technology at Monash University, Australia for their assistance with the CaMML program.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/1?issue=S1.
References

Hoh J, Ott J: Mathematical multilocus approaches to localizing complex human trait genes.
Nat Rev Genet 2003, 4:701709. PubMed Abstract  Publisher Full Text

Mach Learn 2001, 45:532. Publisher Full Text

Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P: Screening largescale association study data: exploiting interactions using random forests.
BMC Genet 2004, 5:32. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Murphy K: A brief introduction to graphical models and Bayesian networks. [http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html] webcite

Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH: Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia.
Nat Genet 2005, 37:435440. PubMed Abstract  Publisher Full Text

Breiman L, Cutler A: Random forests. Version 5. [http://www.stat.berkeley.edu/users/breiman/RandomForests/] webcite

DiazUriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest.
BMC Bioinformatics 2006, 7:3. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wallace CS, Korb KB: Learning linear causal models by MML sampling. In Causal Models and Intelligent Data Management. Edited by Gammerman A. Berlin: SpringerVerlag; 1999:89111.