Abstract
Background
Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype.
Results
To address the problems of computational methods in epistatic interaction detection, we propose a scorebased Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonlyused methods in terms of power and sampleefficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data.
Conclusions
We propose a Bayesian networkbased method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higherorder epistatic interactions by estimating the model complexity from data, and apply a fast BranchandBound algorithm to learn the structure of a twolayer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genomewide association studies) datasets may provide helpful insights into understanding the genetic basis of Agerelated Macular Degeneration, lateonset Alzheimer's disease, and autism.
Background
To identify genetic variants that affect susceptibility of a variety of diseases, genomewide association studies (GWAS) genotype a dense set of common SNPs (Single Nucleotide Polymorphism) and test allelic frequencies among a cohort of affected people and nonaffected people [1]. Traditional analysis methods for GWAS data only consider one SNP at a time and test its association with disease. This type of analysis strategy is only suitable for simple Mendelian disorders. Some common complex diseases such as various types of cancers, cardiovascular disease, and diabetes are influenced by multiple genetic variants. Therefore, detecting highorder epistasis, which refers to the interactive effect of two or more genetic variants on complex human diseases, can help to unravel how genetic risk factors confer susceptibility to complex diseases [2]. However, the very large number of SNPs checked in a typical GWAS and the enormous number of possible SNP combinations make detecting highorder epistatic interactions from GWAS data computationally challenging [3]. Moreover, how to measure the association between a set of SNPs and the phenotype presents another grand statistical challenge.
During the past decade, two types of heuristic computational methods have been proposed to detect epistatic interactions: prediction/classificationbased methods and associationbased methods. Prediction/classificationbased methods try to find the best set of SNPs, which can generate the highest prediction/classification accuracy including, for example, multifactor dimensionality reduction (MDR) [4], penalized logistic regression (e.g., stepPLR [5], and lassoPLR [6]), support vector machine (SVM) [7], and random forest [8]. MDR is a nonparametric and modelfree method based on constructing a risk table for every SNP combination [4]. If the case and control ratio in a cell of this risk table is larger than 1, MDR will label it as "high risk", otherwise, "low risk". By the risk table, MDR can predict disease risk and will select the SNP combination with the highest prediction accuracy. StepPLR and lassoPLR make some modifications to avoid the overfitting problems that standard logistic regression methods suffer from [9] when detecting epistatic interactions. For example, stepPLR combines the logistic regression criterion with a penalization of the L2norm of the coefficients. This modification makes stepPLR more robust to highorder epistatic interactions [5]. Two machine learning methods: SVM [7] and random forest [8] have also been applied to detecting epistatic interactions. Machine learning methods are based on binary classification (prediction) and treat cases as positives and controls as negatives in SNP data. They use SVM or random forest as a predictor and select a set of SNPs with the highest prediction/classification accuracy by feature selection. Some prediction/classificationbased methods can only be applied to smallscale analysis (i.e., a small set of SNPs) due to their computational complexity. Moreover, almost all prediction/classificationbased methods tend to introduce a large number of false positives, which may result in a huge cost for further biological validation experiments [10].
Bayesian epistasis association mapping (BEAM) is a scalable and associationbased method [11]. It partitions SNPs into three groups: group 0 is for normal SNPs, group 1 contains disease SNPs affecting disease risk independently, and group 2 contains disease SNPs that jointly contribute to the disease risk (interactions). Given a fixed partition, BEAM can get the posterior probability of this partition from SNP data based on Bayesian theory. A Markov Chain Monte Carlo method is used to reach the optimal SNP partition with maximum posterior probability in BEAM. One drawback of BEAM is that identifying both single disease SNP and SNP combinations simultaneously makes BEAM overcomplex and weakens its power.
Recently, we propose a new Markov blanketbased method, DASSOMB, to detect epistatic interactions in casecontrol studies [10]. The Markov Blanket is a minimal set of variables, which can completely shield the target variable from all other variables based on Markov condition property [12]. Thus, Markov blanket methods can detect the causal disease SNPs with the fewest false positives. Furthermore, the heuristic search strategy in Markov blanket methods can avoid the timeconsuming training process as in SVM and random forests. However, the faithfulness assumption in Markov blanket methods, which can hardly always be ensured, may hinder their applications in detecting epistatic interactions [13].
In this paper, we address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection by introducing a scorebased Bayesian network structure learning method, EpiBN (Epistatic interaction detection using Bayesian Network model), which employs a BranchandBound technique and a new scoring function. Bayesian networks provide a succinct representation of the joint probability distribution and conditional independence among a set of variables. In general, a scorebased structure learning method for Bayesian networks first defines a scoring function reflecting the fitness between each possible structure and the observed data, and then searches for a structure with the maximum score. Comparing to Markov blanket methods, the merits of applying scorebased Bayesian network structure learning method to epistatic interaction detection include: (1) the faithfulness assumption can be relaxed and (2) heuristic search method can solve the classical XOR (Exclusive or) problem [14]. We apply the EpiBN method to simulated datasets based on four disease models and three real datasets: Agerelated Macular Degeneration (AMD) dataset, lateonset Alzheimer's disease (LOAD) dataset, and autism dataset. We demonstrate that the proposed method outperforms some commonlyused methods such as SVM, MDR, and BEAM, especially when the number of samples is small.
Methods
Bayesian networks: a brief introduction
A Bayesian network is a directed acyclic graph (DAG) G consisting of nodes corresponding to a random variable set {X_{1}, X_{2}, ..., X_{n}} and edges between nodes, which determine the structure of G and therefore the joint probability distribution of the whole network [15]. For three random variables (nodes) X, Y and Z, if the probability distribution of X conditioned on both Y and Z is equal to the probability distribution of X conditioned only on Y, i.e., P(XY, Z) = P(XY), X is conditionally independent of Z given Y. This conditional independence is represented as(X ⊥ Z  Y) [16]. The DAG G encodes local Markov assumption: each variable is conditionally independent of its nondescendants, given its parents in G. By applying the local Markov assumption, the joint probability distribution J can be represented as
where Pa(X_{i}) denotes the set of parents of X_{i }in G. Therefore, there are two components in a Bayesian network. The first component is the DAG G reflecting the structure of the network. The second component, θ, describes the conditional probability distribution P(X_{i}Pa(X_{i})) to specify the unique distribution J on G.
Bayesian networks provide models of causal influence and allow us to explore causal relationships, perform explanatory analysis, and make predictions. Genomewide association studies attempt to identify the epistatic interaction among a set of SNPs, which are associated with one certain type of disease. Therefore, we can use Bayesian networks to represent the relationship between genetic variants and a phenotype (disease status). The n SNP nodes and the disease status/label node form a twolayer Bayesian network and we want to determine which SNP nodes are the parent nodes of the disease status node. In this type of Bayesian network, we only allow edges from SNP nodes to the disease status node. Edges from the disease status node to SNP nodes and edges among SNP nodes are prohibited.
By modelling the association between SNPs and the disease status based on Bayesian networks, we transform detecting epistatic interactions into structure learning of Bayesian networks from GWAS data. There are two types of structure learning methods for Bayesian networks: constraintbased methods and scoreandsearch methods. The constraintbased methods first build a skeleton of the network (undirected graph) by a set of dependence and independence relationships. Next they direct links in the undirected graph to construct a directed graph with dseparation properties corresponding to the dependence and independence determined [17,18]. Although constraintbased methods are developed with a rigorous theoretical foundation, errors in conditional dependence and independence will affect the stability of constraintbased methods, especially for small sample problems, which is also a problem of Markov Blanket methods in detecting epistatic interactions. Therefore, in this paper, we focus on scoreandsearch methods. The scoreandsearch methods view a Bayesian network as a statistical model and transform the structure learning of Bayesian networks into a model selection problem [19]. To select the best model, a scoring function is needed to indicate the fitness between a network and the data. Then the learning task is to find the network with the highest score. Thus, scoreandsearch methods typically consist of two components, (1) a scoring function, and (2) a search procedure. Next, we discuss in detail the proposed EpiBN algorithm, which consists of three components: scoring, searching, and screening.
EpiBN scoring: A new BN scoring function
One of the most important issues in scoreandsearch methods is the selection of scoring function. A natural choice of scoring function is the likelihood function. However, the maximum likelihood score often overfits the data because it does not reflect the model complexity. Therefore, a good scoring function for Bayesian networks' structure learning must have the capability of balancing between the fitness and the complexity of a selected structure. There are several existing scoring functions based on a variety of principles, such as the information theory and minimum description length (e.g. BIC score, AIC score, and MDL score) [2022] and Bayesian approach (BDe score) [23].
Suppose that a dataset D includes n variables {X_{1}, X_{2}, ..., X_{n}} and N samples, we can write a general informationbased scoring function as:
where is an estimate of parameters from the maximum likelihood method for the structure S, q_{i }is the number of configurations of the parent set Pa(X_{i}) of X_{i}, r_{i }is the number of states of X_{i}, C(S) represents the structure complexity, and f(N) is a penalization function [24]. The first term of this score scheme measures the fitness between the structure and data, and the second term reflects structure complexity. With the maximum likelihood method [19], we can get
where N_{ijk }is the number of instances where X_{i }takes its kth value and the set of variables Pa(X_{i}) takes its jth configuration; N_{ij }is the number of instances where the set of variables Pa(X_{i}) takes its jth configuration. Obviously, . Note that if we set f(N) = 1, we get the AIC score as
Alternatively, if we set f(N) = 1/2 log(N), then we obtain the BIC score as
The BIC score and AIC score are derived from some approximations when the number of samples N approaches infinity [25]. If the number of samples is small, the approximation in the inference of both AIC and BIC scores can not hold any more and the structure penalty term in Eq. (5) and Eq. (6) are not suitable [26]. When using informationbased scores in the Bayesian network model to detect epistatic interactions by GWAS data, which show a nonskewed distribution, the BIC score is too strict and prefers to select simple structures, while the AIC score prefers to select complex structures [27].
We herein describe a new informationbased scoring function to detect epistatic interactions by Bayesian network model. In the Bayesian network for epistatic interaction detection, we are only concerned with one target node, the disease status node, and we want to detect its parent SNP nodes. We represent the local structure around the disease status node as LDS (Local Disease Structure), which consists of the disease status node and edges from candidate disease SNP nodes to the disease status node. Because of the decomposability property of informationbased scoring function, the AIC score for LDS is:
where C(LDS) is the complexity of the local disease structure, q is the number of configurations of parent SNP nodes, r is the number of states of the disease status node, N_{jk }is the number of instances where the disease status node takes its kth value and the parent SNP nodes take their jth configuration, N_{j }is the number of instances where the parent SNP nodes take their jth configuration, and .
We start our search from an empty local disease structure LDS_{0}, and we can obtain the AIC score for LDS_{0}:
where N_{k }is the number of instances in which the disease status node takes its kth value, and .
For further inference, we use X for the target disease status node and use Pa(X) for its parent SNP nodes. The loglikelihood of LDS and LDS_{0 }can also be expressed as follows:
where H(X) is the entropy of X and H(XPa(X)) is the conditional entropy of X given its parent set Pa(X) [28]. Based on the concept of mutual information and Eq. (7)(10), the mutual information between X and Pa(X) is:
i.e. the mutual information between X and Pa(X) coincides with the difference between the loglikelihood of LDS and LDS_{0 }[24].
The G^{2 }test is commonly used to test independence and conditional independence between two variables for discrete data. From the general formula for G^{2}, we know that the value of G^{2 }can also be calculated from mutual information [29]. Thus, we can write the G^{2 }test value between X and Pa(X) as:
The number of degrees of freedom for G^{2 }test between X and Pa(X) is:
where Cat(V) is the number of categories of the variable V, and thus Cat(X) = r and Cat(Pa(X)) = q [18].
It is interesting to note that the difference between the complexity of LDS and LDS_{0 }is equal to the degree of freedom of G^{2}(X, Pa(X)) by
By applying Eq. (7)(14), the difference of AIC scores between LDS and LDS_{0 }is:
Thus, the AIC score becomes:
where log P(DLDS_{0}) is a constant.
The distribution of G^{2 }asymptotically approximates to that of χ^{2 }with the same number of degrees of freedom [18]. The χ^{2 }distribution with k degrees of freedom has a variance of 2k, and therefore 2DF(G^{2}(X, Pa(X))) is the variance of the corresponding G^{2 }distribution. Since G^{2}(X, Pa(X)) reflects the bias, the AIC score in Eq. (16) indicates a tradeoff between bias and variance in terms of the G^{2 }statistic G^{2}(X, Pa(X)) and its variance.
One problem for the AIC score in Eq. (5), Eq. (7), and Eq. (16) is that the penalty term (the effective number of parameters) in AIC score probably can not reflect the model complexity (or variance) especially when applied to SNP data with a nonskewed distribution. We can confirm this by comparing 2DF(G^{2}(X, Pa(X))) with the true variance of G^{2}(X, Pa(X)) from SNP data. There is a large deviation between them when Pa(X) contains more than two parent nodes. The more parent nodes Pa(X) contains, the larger the deviation is because of the increasing model complexity and hence the increasing 'difficulty in estimation' [30]. One simple but practical way to consider and estimate the model complexity in AIC score is replacing 2DF(G^{2}(X, Pa(X))) in Eq. (16) with the true variance of G^{2}(X, Pa(X)) from data and our new epistatic scoring function (EpiScore) becomes:
where Var_{D}(G^{2}(X, Pa(X))) comes from the estimation of the variance of the corresponding G^{2 }distribution from data. Our new scoring function estimates the penalty term from the data to make it consistent with the data, which is similar to the DIC (Deviance Information Criterion) score trying to identify models that best explain the observed data [30].
Due to the estimation of the variance of G^{2}(X, Pa(X)) from data in Eq. (17), EpiScore is not scoreequivalent. However, we are not very concerned about this: there are no equivalent structures in the twolayer Bayesian network for the restriction on the direction of edges we describe in the previous section.
EpiBN searching: A BranchandBound algorithm for local structure learning
The computational task in scoreandsearch methods is to find a network structure with the highest score. The searching space consists of a superexponential number of structures and thus exhaustively searching optimal structure from data for Bayesian networks is NPhard [31]. One simple heuristic search algorithm is greedy hillclimbing algorithm, where three types of operators are defined to change one edge at each step: adding an edge, removing an edge, and reversing an edge. By these three operators, we can construct the local neighbourhood of the current network. Then we select the network with the highest score in the local neighbourhood to get the maximal gain. This process can be repeated until it reaches a local maximum. However, greedy hillclimbing algorithm cannot guarantee a global maximum [19]. Other structure learning methods for Bayesian networks include BranchandBound (B&B) [28,32] and Markov chain Monte Carlo [33].
We employ B&B algorithm in our study because the B&B algorithm can guarantee the optimal results in a significantly reduced search time compared to exhaustive search. Our EpiBN method uses B&B algorithm to search a local disease structure that maximizes the EpiScore in Eq. (17). The pseudo code of EpiBN is shown in Figure 1. In EpiBN, the procedure BN_B&B starts from an empty parent node set and constructs a depthfirst search tree to find the optimal parent (disease SNPs) set for the disease status node. In our B&B search, instead of using the pruning strategy as in [28], which sets a lower bound for the MDL score to prune the search tree, we stop the recursive calls when we observe that the score will decrease on the children state of the current state. This strategy cannot guarantee global optima theoretically. However, it will significantly speed up the search process and perform well practically.
Figure 1. EpiBN Algorithm.
To guarantee to find the parent set with the highest EpiScore, we can use the upper bound of the EpiScore to prune the search tree. We notice the G^{2 }function in Eq. (12) has the property:
When adding a SNP node q into the parent set V_{1}, the variance of the corresponding G^{2 }distribution, the penalty term in Eq. (17), will increase by Var_{D}(G^{2}(X, V_{2}))  Var_{D}(G^{2}(X, V_{1})). On the other hand, the G^{2}(X, V_{1}) will increase at most by 2N*H(X) because the value of entropy H(XPa(X)) will decrease and is nonnegative. Hence, if we find
adding a SNP node q into the current parent set V_{1 }will not increase the EpiScore and thus any further search along the branch is useless. Essentially, the upper bound of the EpiScore is
EpiBN screening: MCMC screening method for real datasets
Even though the B&B algorithm uses an upper score bound to reduce the searching space, it still has an exponential time complexity in the worst case and is not feasible to be directly applied to real GWAS data. Therefore, an efficient screening method is necessary. Traditional screening methods assign a score to every single SNP and select a subset of SNPs with high scores. However, these methods ignore the joint effect of SNPs on disease and are not suitable for detecting epistatic interactions from real GWAS data.
In this paper, we use the Markov chain Monte Carlo (MCMC) method [33] to perform the screening process. In the Bayesian network for epistatic interaction detection, we use a MetropolisHastings method to build a Markov chain to get the posterior probability for each edge from the SNP nodes to the disease status node. At each step of the Markov chain, we use two types of moves: add an edge and remove an edge. The proposed move is accepted with probability
where
where #(nbd(LDS)) is the cardinality of the neighbourhood of the current local disease structure and LDS' is the candidate local disease structure in each step of the Markov chain. Since LDS and LDS' differ in one move, the ratio #(nbd(LDS))/#(nbd(LDS')) is one. In addition, the posterior probability of the local disease structure, P(LDSD), is that P(LDSD) ∝ P(DLDS) P(LDS) and we take a uniform distribution over the considered local disease structures. Therefore, the acceptance ratio in Eq. (22) becomes:
The likelihood of local disease structure, P(DLDS), can be calculated by Eq. (17). Based on the result from MCMC method, we select SNP nodes associated with edges whose posterior probabilities larger than 0. Since we consider the association of multiple SNPs with disease status at each step of the Markov chain in our MCMC method, the potential disease SNPs related with epistatic interactions will be kept in the final subset of SNPs.
Results
Analysis of Simulated Data
Simulated Data We first evaluate the proposed EpiBN method on four simulated data sets, which are generated from three commonly used twolocus epistatic models in [9] and one threelocus epistatic model developed in [11]. Model1 is a multiplicative model, model2 demonstrates twolocus interaction multiplicative effects, and model3 specifies twolocus interaction threshold effects. There are three disease loci in model4 [11]. Some certain genotype combinations can increase disease risk in model4 and there are almost no marginal effects for each disease locus.
We generate data based on the similar parameter settings as in [911] for three parameters associated with each model: the marginal effect of each disease locus (λ), the minor allele frequencies (MAF) of both disease loci, and the strength of linkage disequilibrium (LD, quantified by the squared correlation coefficient r^{2 }calculated from allele frequencies) between the unobserved disease locus and a genotyped locus [9]. For each parameter setting on each model, we generate 50 datasets and each dataset contains 100 markers genotyped for 1,000 cases and 1,000 controls. To measure the performance of each method, we use power as our evaluation criterion, which is defined as the proportion of simulated datasets in which only the true diseases associated markers are identified without any false positives.
EpiBN versus BEAM, SVM, and MDR We first compare EpiBN with three methods: BEAM, SVM, and MDR on the four simulated disease models. The BEAM software is downloaded from http://www.fas.harvard.edu/~junliu/BEAM webcite and we set the threshold of the B statistic as 0.1 [11]. For SVM, we use LIBSVM with a RBF kernel to detect genegene interactions. A grid search is used for selecting optimal parameters. Instead of using the exhaustive greedy search strategy for SNPs as in [7], which is very timeconsuming and infeasible to largescale datasets, we turn to a search strategy used in [8]. First we rank SNPs based on the mutual information between SNPs and disease status label that is 0 for the control and 1 for the case. Then, we use a sliding window sequential forward feature selection (SWSFS) algorithm in [8] based on SNPs rank. The window size in SWSFS algorithm determines how robust the algorithm could be and we set it to 20. Since MDR algorithm can not be applied to a large dataset directly, we first reduce the number of SNPs to 10 by ReliefF [34], a commonlyused feature selection algorithm, and then MDR performs an exhaustive search for a SNP set that can maximize crossvalidation consistency and prediction accuracy. When one model has the maximal crossvalidation consistency and another model has the maximal prediction accuracy, MDR follows statistical parsimony (selects the model with fewer SNPs). Our EpiBN is written in Matlab. The results on the simulated data are shown in Figure 2. As can be seen, among the four methods, the EpiBN method performs the best, and BEAM is the second best. One possible reason is that BEAM tries to detect single disease locus and epistatic interactions simultaneously. This strategy makes BEAM unnecessarily overcomplex. In most cases, the powers of both MDR and SVM are much smaller than those of the EpiBN and BEAM algorithms.
Figure 2. Performance comparison of EpiBN, BEAM, SVM, and MDR.
Our definition of power prohibits any false positives and any false negatives and reflects the ability to precisely detect whole interactions [35]. Although we consider both type I error and type II error and the performance comparison is fair for each method in Figure 2, this type of definition of power seems stringent. To explore both false positive rates and false negative rates further, we measure the detection accuracy by precision and recall. Precision is the number of true disease SNPs in the output divided by the number of detected SNPs in the output, which reflects the false positive rate. On the other hand, recall is the number of true disease SNPs in the output divided by the number of true known disease SNPs, which reflects the false negative rate. The Euclidean distance from perfect precision and recall is defined as:
which combines precision and recall [36]. Table 1 presents the average precision, recall, and distance performance about EpiBN, BEAM, and SVM on 50 datasets for each model with MAF = 0.5 and r^{2 }= 1. EpiBN achieves a higher overall accuracy than both BEAM and SVM on model2, model3, and model4. Moreover, the overall accuracy of EpiBN on model4 is perfect. On model1, EpiBN is still better than SVM while it is slightly worse than BEAM. BEAM shows the highest precision on the first three models, but it intends to miss more true positives. On the contrary, SVM demonstrates the highest recall, however, at the cost of introducing more false positives [10]. We do not evaluate the accuracy of MDR because the MDR software can only test at most 4way epistatic interactions.
Table 1. Accuracy comparison of EpiBN, BEAM, and SVM.
EpiScore versus BIC score and AIC score We also compare our EpiScore with BIC score and AIC score. For BIC score and AIC score, we use the upper bound of score in [28] and the same B&B algorithm as in EpiBN. Table 2 presents the results on datasets with MAF = 0.5 and r^{2 }= 1. Column "o" shows the times of correct detection of all disease SNPs in 50 datasets. Column "+" presents the total number of extra detected SNPs and column "" has the total number of missing disease SNPs. For model1, mode2, and model3, EpiScore performs better than both BIC score and AIC score. BIC score can not detect true disease SNPs at all and introduce many false negatives due to its heavy penalty term. Comparing to EpiScore, AIC score tends to introduce more false positives and more false negatives. It is interesting to notice that every score function can achieve perfect power on model4. The reason is that the relatively large genotypic effect, θ, in model4 can generate data with skewed distribution, which can help all scoring function detect true disease SNPs.
Table 2. Comparison of EpiScore, BIC score, and AIC score.
EpiBN versus Markov Blanket methods To demonstrate the advantages of EpiBN over Markov Blanket methods, we compare EpiBN with three Markov Blanket methods: interIAMBnPC [12], PCMB [36], and our DASSOMB [10]. For interIAMBnPC, we use the Matlab toolbox Causal Explorer which contains the interIAMBnPC algorithm [37]. We implement both PCMB and DASSOMB in Matlab. G^{2 }test is used to test dependence and independence in these three Markov Blanket methods and we set the pvalue threshold for G^{2 }test as 0.01. Figure 3 shows the results. These four methods demonstrate the similar performance on the multiplicative model: model1. On the other three interaction models: model2, model3, and model4, EpiBN is better than these three Markov Blanket methods. Disease SNPs in model1 affects the disease risk independently, which makes it easy for Markov Blanket methods to detect them. Additionally, DASSOMB is better than the other two Markov Blanket methods: interIAMBnPC and PCMB. This is because the backward phase in DASSOMB to remove false positives is not that strict as in interIAMBnPC and PCMB. Hence, DASSOMB can keep SNP nodes having strong joint effects on disease status but not showing strong marginal effects in the Markov Blanket. This also confirms that the faithfulness assumption may hinder the application of Markov Blanket methods in detecting epistatic interactions.
Figure 3. Performance comparison of EpiBN with three Markov Blanket methods: interIAMBnPC, PCMB, and DASSOMB.
Sample efficiency Typically, GWAS can not generate a large number of samples due to the high experiment cost. Thus, the performance of various computational methods for epistatic interaction detection in case of small samples is important. We explore the effect of the number of samples on the performance of EpiBN, MDR, BEAM and SVM. The parameters used are: λ = 1.1 for model1, λ = 0.9 for model2, λ = 1.8 for model3, and θ = 7 for model4. To test the scalability of EpiBN on large number of SNPs, we generate synthetic datasets containing different number of SNPs (40, 200, and 1000) genotyped for different number of samples (100, 200, 300, 400, 600, 1000, and 2000) with r^{2 }= 1 and MAF = 0.5.
The results are shown in Figure 4. We find that EpiBN is more sampleefficient than other methods in that it can achieve the highest power when the number of samples is the same. In addition, it needs fewer samples to reach the perfect power comparing to other methods. BEAM is still the second best. The powers of both MDR and SVM are still smaller than those of the EpiBN and BEAM algorithms. However, MDR and SVM demonstrate a better performance comparing to the results in Figure 2. This is perhaps due to the fact that increasing the marginal effect size λ for model1, model2, and model3 makes the detecting task suitable for the prefiltering based methods such as MDR and SVM. The result from model4 is particularly interesting: EpiBN exhibits overwhelming superiority over other three methods, as EpiBN yields a perfect power even the number of samples is small (around 600), which indicates that EpiBN is especially suitable for detecting epistatic interactions with weak or no marginal effects. From Figure 4, we can also find that increasing the number of genotyping markers, like adding some noise to the data, will impair the power of all methods, especially in case of small samples.
Figure 4. Comparison of sample efficiency on datasets with different number of SNPs: (a) 40 SNPs, (b) 200 SNPs and (c) 1000 SNPs.
Analysis of AMD Data
In this section and the following two sections, we apply EpiBN to largescale datasets in real genomewide casecontrol studies, which often require genotyping of 30,0001,000,000 common SNPs. We first make use of an Agerelated Macular Degeneration (AMD) dataset containing 116,204 SNPs genotyped with 96 cases and 50 controls [38] (i.e., high dimensionality and small sample sizes). Multiple genetic factors cause AMD, which is a complex retinal degenerative disorder. To remove inconsistently genotyped SNPs, we perform the same filtering process as in [11,38]. After filtering, there are 97,327 autosomal SNPs remained.
We first perform the screening process and select 51 potential disease SNPs related with AMD by MCMC method. Among these 51 selected SNPs, EpiBN detects two associated SNPs showing the highest EpiScore: rs380390 and rs2402053. Klein et al. demonstrated that the first SNP, rs380390, is associated with AMD [38]. The second SNP, rs2402053, is intergenic between TFEC and TES in chromosome 7q31 [39]. Even though no evidences show that rs2402053 is related with AMD, it is worth noting that mutations in some genes on 7q31q32 are revealed in patients with retinal disorders [40]. Therefore, rs2402053 may be a new genetic factor, on chromosome 7q, contributing to the underlying mechanism of AMD.
Analysis of LOAD Data
Lateonset Alzheimer's disease (LOAD) is the most common form of Alzheimer's disease and usually occurs in persons over 65. It causes patients' degeneration of the ability of thinking, memory, and behaviour. The apolipoprotein E (APOE) gene is one genetic factor that accounts for affecting the risk of LOAD. There are three common variants of the APOE gene: ε2, ε3, and ε4. The appearance of the ε4 allele in a person's APOE genotype increases the LOAD risk [41]. Rieman et al. conducted genomewide association studies to detect other generic risk factors related with LOAD [41]. They found 10 SNPs showing significant association with LOAD in the APOE ε4 carriers. All these 10 SNPs are in the GRBassociated binding protein 2 (GAB2) gene.
We download the LOAD GWAS data from http://www.tgen.org/neurogenomics/data webcite. After preprocessing, we have 287,479 SNPs and 1408 samples (857 cases and 551 controls). EpiBN keeps APOE as one parent of the disease status node and identifies two other SNPs: rs1931565 and rs4505578, which may interact with APOE and affect the LOAD risk. The rs1931565 SNP is intergenic between ABCA4 and ARHGAP29 in chromosome 1p22. ABCA4 is related with some brainrelated diseases including stargardt disease 1, earlyonset severe retinal dystrophy and agerelated macular degeneration [42]. On the other hand, some ABC transporter family genes such as ABCA1, ABCA2, ABCA7 and ABCA12 are associated with Alzheimer's disease [43]. Therefore, we can speculate that the interaction among rs1931565, rs4505578 and APOE may affect some brain functions and therefore increase the LOAD risk. Our results do not contain any of the 10 SNPs in GAB2 found in [41]. One reason is that Rieman et al. only explored twolocus interactions related with LOAD. In fact, the epistatic interactions are very complicated. If we restrict the number of genetic risk factors as two, we will miss some potential disease SNPs associated with complex diseases.
Analysis of Autism Data
Autism is a common early onset neurodevelopmental disorder, which affects the brain's normal development and impairs social interaction and communication. To pinpoint the causal SNPs and genes involved in autism, a large number of genotyping data have been generated from subjects with and without autism. Some of the genotyping data have been deposited on the AGRE (Autism Genetic Resource Exchange) website http://www.agre.org webcite for further analysis by the research community. In this paper, we analyse one of the largest genotype dataset contributed by Children's Hospital of Philadelphia (CHOP), which contains 513,312 SNPs genotyped from 1784 cases and 2441 controls [44]. To reduce the searching space and focus on more relevant SNPs, we only use SNPs in autismrelated genes. We first get information of 277 autismrelated genes from the autism genetic database (AGD) http://wren.bcf.ku.edu/ webcite[45]. Then we obtain a list of 205,589 SNPs in these autismrelated genes from UCSC genome browser http://genome.ucsc.edu/ webcite[46]. The CHOP dataset contains 9330 of these 205,589 SNPs. Samples with missing rate > 10% and SNPs with missing rate > 10%, MAF < 0.05, and pvalue of HWE < 0.001 were removed. After applying the aforementioned filtering process, our genotype dataset contains 4222 samples (1783 cases, 2439 controls) and 8198 SNPs.
Heterogeneity of phenotypic presentation in autism makes it difficult to detect epistatic interactions related with this complex disorder [47]. One proposed approach to reduce the phenotypic heterogeneity of autistic subjects is dividing them into several subgroups by clustering method on ADIR (Autism Diagnostic InterviewRevised) data [48]. The ADIR is a clinical diagnostic interview to assess autism in children and adults and contains 93 items about behaviours in three domains: quality of social interaction, communication and language ability, and repetitive, restricted and stereotyped interests and behaviour [49]. In this paper, we use an alternative method to reduce the phenotypic heterogeneity: biclustering [50]. Biclustering methods can identify subgroups of autism samples that show similar behaviour patterns on a specific subset of ADIR items. By applying the biclustering method [50], we find a bicluster of constant value in 235 subjects for 8 out of 77 numerically scored ADIR items (0 = normal; 3 = most severe). These autistic subjects have the same ADIR score (i.e., 3 which is most severe) on the 8 ADIR items which are: CCONVER, CINAPPQ, CPRON, CNEOID, CVERRIT, CINR, CSPEECH, and CFRIEND. Most of these 8 items are about verbal and nonverbal communication. Therefore these 235 autism samples may represent a subset with the most severe communication problems.
To explore the genetic basis in the identified more homogeneous subset, we use the SNP data for these 235 autistic subjects (cases) and 2439 controls in CHOP dataset. The MCMC method first selects 111 candidate SNPs. Then our EpiBN detects an epistatic interaction of three SNPs: rs706363, rs7780487, and rs12536378. The first SNP, rs706363, is on the autism candidate gene DAB1 on chromosome 1. Both rs7780487 and rs12536378 are on the autism candidate gene DPP6 on chromosome 7. If we search HPRD (Human Proteinprotein Interaction Database), we can find a pathway from DAB1 to DPP6: DAB1APLP2PRNPDPP6 [51]. This suggests a potential interaction between the detected SNPs using our EpiBN, which warrants further investigations to assess this in silico prediction by molecular functional experiments.
Discussion
Jiang et al. also tried to use scorebased Bayesian network structure learning methods to detect epistatic interactions [52]. They evaluated the performance of 22 BN scoring criteria by scoring all 1SNP, 2SNP, 3SNP, and 4SNP Bayesian networks on simulation datasets and showed that the BDeu score with large values of hyperparameters α achieved the best performance. Since the prior knowledge on the optimal α and the true number of disease SNPs is unknown, their methods can hardly address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection very well.
Conclusions
To address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection from GWAS data, several machine learning or statistical methods have been proposed during the past decade. However, these proposed machine learning or statistical methods still encounter some problems: scalability to real genomewide dataset, tending to introduce false positives, sampleefficiency, and poor performance when detecting epistatic interactions with weak or no marginal effects.
In this paper, we propose a Bayesian networkbased method, EpiBN, to detect epistatic interactions. We develop a new scoring function, which can reflect higherorder epistatic interactions by estimating the model complexity from data, and apply a fast B&B algorithm to learn the structure of a twolayer Bayesian network containing only one target node. To make our method scalable to GWAS data, we propose the use of a MCMC method to perform the screening process.
We apply the proposed method to both simulated datasets based on four disease models and three real datasets. Our experimental results demonstrate that our method outperforms some other commonlyused methods and is scalable to GWAS data.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BH designed and implemented the EpiBN method, tested the existing methods and analyzed experimental results. XWC conceived the study, designed the experiments and analyzed the results. ZT contributed in autism data analysis and assisted with analyzing experimental results. HX discussed the methods and analyzed some of the results. All authors helped in drafting the manuscript and approved the final manuscript.
Acknowledgements
This work is supported by the US National Science Foundation Award IIS0644366 and the KCALSI103 Patton Trust Grant.
This article has been published as part of BMC Systems Biology Volume 6 Supplement 3, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM)  Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcsystbiol/supplements/6/S3.
References

Hirschhorn JN, Daly MJ: Genomewide association studies for common diseases and complex traits.
Nature reviews 2005, 6:95108. PubMed Abstract  Publisher Full Text

Cordell HJ: Detecting genegene interactions that underlie human diseases.
Nature reviews 2009, 10:392404. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene x gene interactions in genomewide association studies of human population data.
Human heredity 2007, 63:6784. PubMed Abstract  Publisher Full Text

Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer.
American journal of human genetics 2001, 69:138147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Park MY, Hastie T: Penalized logistic regression for detecting gene interactions.

Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression.
Bioinformatics (Oxford, England) 2009, 25:714721. Publisher Full Text

Chen SH, Sun J, Dimitrov L, Turner AR, Adams TS, Meyers DA, Chang BL, Zheng SL, Gronberg H, Xu J, Hsu FC: A support vector machine approach for detecting genegene interaction.
Genetic epidemiology 2008, 32:152167. PubMed Abstract  Publisher Full Text

Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in casecontrol studies.
BMC bioinformatics 2009, 10(Suppl 1):S65. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases.
Nature genetics 2005, 37:413417. PubMed Abstract  Publisher Full Text

Han B, Park M, Chen XW: A Markov blanketbased method for detecting causal SNPs in GWAS.
BMC bioinformatics 2010, 11(Suppl 3):S5. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies.
Nature genetics 2007, 39:11671173. PubMed Abstract  Publisher Full Text

Tsamardinos I, Aliferis C, Statnikov A, Statnikov E: Algorithms for Large Scale Markov Blanket Discovery.

Studený M: Probabilistic conditional independence structures. London: Springer; 2005.

Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions.

Chen XW, Anantha G, Wang X: An effective structure learning method for constructing gene networks.
Bioinformatics (Oxford, England) 2006, 22:13671374. Publisher Full Text

Chen XW, Anantha G, Lin X: Improving Bayesian Network Structure Learning with Mutual InformationBased Node Ordering in the K2 Algorithm.

Pearl J: Causality: models, reasoning, and inference. 2nd edition. Cambridge, U.K.; New York: Cambridge University Press; 2009.

Spirtes P, Glymour CN, Scheines R: Causation, prediction, and search. 2nd edition. Cambridge, Mass.: MIT Press; 2000.

Heckerman D, Geiger D, Chickering DM: Learning Bayesian Networks: The Combination of Knowledge and Statistical Data.

Akaike H: A new look at the statistical model identification.
IEEE Transactions on Automatic Control 1974, 19:716723. Publisher Full Text

Schwarz G: Estimating the dimension of a model.
The Annals of Statistics 1978, 6:461464. Publisher Full Text

Rissanen J: Stochastic Complexity and Modeling.
The Annals of Statistics 1986, 14:10801100. Publisher Full Text

Cooper GF, Herskovits E: A Bayesian Method for the Induction of Probabilistic Networks from Data.

Campos LMd: A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests.

Burnham KP: Model selection and multimodel inference: a practical informationtheoretic approach. 2nd edition. New York: Springer; 2002.

Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2001.

Ueno M: Learning networks determined by the ratio of prior and data. In TwentySixth Conference on Uncertainty in Artificial Intelligence (UAI10); 811 July; Corvallis, Oregon. Edited by Grünwald P, Spirtes P. AUAI Press; 2010:598605.

Suzuki J: Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: An Efficient Algorithm Using the B & B Technique. In 13th International Conference on Machine Learning July 36; Bari, Italy. Morgan Kaufmann; 1996:462470.

Sokal RR, Rohlf FJ: Biometry: the principles and practice of statistics in biological research. 3rd edition. New York: Freeman; 1995.

Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A: Bayesian measures of model complexity and fit.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002, 64:583639. Publisher Full Text

Chickering DM, Heckerman D, Meek C: LargeSample Learning of Bayesian Networks is NPHard.

Chen Xw: An improved branch and bound algorithm for feature selection.
Pattern Recogn Lett 2003, 24:19251933. Publisher Full Text

Giudici P, Castelo R: Improving Markov Chain Monte Carlo Model Search for Data Mining.
Machine learning 2003, 50:127158. Publisher Full Text

RobnikŠikonja M, Kononenko I: Theoretical and empirical analysis of ReliefF and RReliefF.
Machine learning 2003, 53:2369. Publisher Full Text

Chen L, Yu G, Langefeld CD, Miller DJ, Guy RT, Raghuram J, Yuan X, Herrington DM, Wang Y: Comparative analysis of methods for detecting interacting loci.
BMC Genomics 2011, 12:344. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Peña JM, Nilsson R, Björkegren J, Tegnér J: Towards scalable and data efficient learning of Markov boundaries.
International Journal of Approximate Reasoning 2006, 45:211232.

Aliferis C, Tsamardinos I, Statnikov A, Brown L: Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery. In Proceedings of the International Conference on Mathematics and Engineering Techniques in Medicine and Biological Scienes, METMBS '03, June 23  26, 2003, Las Vegas, Nevada, USA. Edited by Valafar F, Valafar H. CSREA Press; 2003:371376.

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al.: Complement factor H polymorphism in agerelated macular degeneration.
Science (New York, NY) 2005, 308:385389. Publisher Full Text

Tobias ES, Hurlstone AF, MacKenzie E, McFarlane R, Black DM: The TES gene at 7q31.1 is methylated in tumours and encodes a novel growthsuppressing LIM domain protein.
Oncogene 2001, 20:28442853. PubMed Abstract  Publisher Full Text

Bowne SJ, Sullivan LS, Blanton SH, Cepko CL, Blackshaw S, Birch DG, HughbanksWheaton D, Heckenlively JR, Daiger SP: Mutations in the inosine monophosphate dehydrogenase 1 gene (IMPDH1) cause the RP10 form of autosomal dominant retinitis pigmentosa.
Hum Mol Genet 2002, 11:559568. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Reiman EM, Webster JA, Myers AJ, Hardy J, Dunckley T, Zismann VL, Joshipura KD, Pearson JV, HuLince D, Huentelman MJ, et al.: GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers.
Neuron 2007, 54:713720. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shroyer NF, Lewis RA, Yatsenko AN, Wensel TG, Lupski JR: Cosegregation and functional analysis of mutant ABCR (ABCA4) alleles in families that manifest both Stargardt disease and agerelated macular degeneration.
Hum Mol Genet 2001, 10:26712678. PubMed Abstract  Publisher Full Text

Bertram L, McQueen MB, Mullin K, Blacker D, Tanzi RE: Systematic metaanalyses of Alzheimer disease genetic association studies: the AlzGene database.
Nature genetics 2007, 39:1723. PubMed Abstract  Publisher Full Text

Wang K, Zhang H, Ma D, Bucan M, Glessner JT, Abrahams BS, Salyakina D, Imielinski M, Bradfield JP, Sleiman PM, et al.: Common genetic variants on 5p14.1 associate with autism spectrum disorders.
Nature 2009, 459:528533. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Matuszek G, Talebizadeh Z: Autism Genetic Database (AGD): a comprehensive database including autism susceptibility geneCNVs integrated with known noncoding RNAs and fragile sites.
BMC Med Genet 2009, 10:102. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC.
Genome research 2002, 12:9961006. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hu VW, Steinberg ME: Novel clustering of items from the Autism Diagnostic InterviewRevised to define phenotypes within autism spectrum disorders.
Autism Res 2009, 2:6777. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hu VW, Addington A, Hyman A: Novel autism subtypedependent genetic variants are revealed by quantitative trait and subphenotype association analyses of published GWAS data.
PLoS One 2011, 6:e19067. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lord C, Rutter M, Le Couteur A: Autism Diagnostic InterviewRevised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders.
J Autism Dev Disord 1994, 24:659685. PubMed Abstract  Publisher Full Text

Cheng Y, Church GM: Biclustering of expression data.
Proceedings/International Conference on Intelligent Systems for Molecular Biology; ISMB 2000, 8:93103. PubMed Abstract  Publisher Full Text

Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al.: Human Protein Reference Database2009 update.
Nucleic acids research 2009, 37:D767772. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jiang X, Neapolitan RE, Barmada MM, Visweswaran S: Learning genetic epistasis using Bayesian network scoring criteria.
BMC bioinformatics 2011, 12:89. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text