Bioinformatics and Computational Life-Sciences Laboratory, ITTC, Department of Electrical Engineering and Computer Science, University of Kansas, 1520 West 15th Street, Lawrence, KS 66045, USA

Department of Computer Science Wayne State University Detroit, MI 48202

Children's Mercy Hospital and University of Missouri-Kansas City School of Medicine, 2401 Gillham Road, Kansas City, MO 64108, USA

School of Biomedical Informatics The University of Texas Health Science Center at Houston Houston, TX 77030

Abstract

Background

Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype.

Results

To address the problems of computational methods in epistatic interaction detection, we propose a score-based Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonly-used methods in terms of power and sample-efficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data.

Conclusions

We propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast Branch-and-Bound algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genome-wide association studies) datasets may provide helpful insights into understanding the genetic basis of Age-related Macular Degeneration, late-onset Alzheimer's disease, and autism.

Background

To identify genetic variants that affect susceptibility of a variety of diseases, genome-wide association studies (GWAS) genotype a dense set of common SNPs (Single Nucleotide Polymorphism) and test allelic frequencies among a cohort of affected people and non-affected people

During the past decade, two types of heuristic computational methods have been proposed to detect epistatic interactions: prediction/classification-based methods and association-based methods. Prediction/classification-based methods try to find the best set of SNPs, which can generate the highest prediction/classification accuracy including, for example, multifactor dimensionality reduction (MDR)

Bayesian epistasis association mapping (BEAM) is a scalable and association-based method

Recently, we propose a new Markov blanket-based method, DASSO-MB, to detect epistatic interactions in case-control studies

In this paper, we address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection by introducing a score-based Bayesian network structure learning method, EpiBN (Epistatic interaction detection using Bayesian Network model), which employs a Branch-and-Bound technique and a new scoring function. Bayesian networks provide a succinct representation of the joint probability distribution and conditional independence among a set of variables. In general, a score-based structure learning method for Bayesian networks first defines a scoring function reflecting the fitness between each possible structure and the observed data, and then searches for a structure with the maximum score. Comparing to Markov blanket methods, the merits of applying score-based Bayesian network structure learning method to epistatic interaction detection include: (1) the faithfulness assumption can be relaxed and (2) heuristic search method can solve the classical XOR (Exclusive or) problem

Methods

Bayesian networks: a brief introduction

A Bayesian network is a directed acyclic graph (DAG) _{1}, _{2}, ..., _{n}

where _{i}
_{i }
_{i}
_{i}

Bayesian networks provide models of causal influence and allow us to explore causal relationships, perform explanatory analysis, and make predictions. Genome-wide association studies attempt to identify the epistatic interaction among a set of SNPs, which are associated with one certain type of disease. Therefore, we can use Bayesian networks to represent the relationship between genetic variants and a phenotype (disease status). The

By modelling the association between SNPs and the disease status based on Bayesian networks, we transform detecting epistatic interactions into structure learning of Bayesian networks from GWAS data. There are two types of structure learning methods for Bayesian networks: constraint-based methods and score-and-search methods. The constraint-based methods first build a skeleton of the network (undirected graph) by a set of dependence and independence relationships. Next they direct links in the undirected graph to construct a directed graph with

EpiBN scoring: A new BN scoring function

One of the most important issues in score-and-search methods is the selection of scoring function. A natural choice of scoring function is the likelihood function. However, the maximum likelihood score often overfits the data because it does not reflect the model complexity. Therefore, a good scoring function for Bayesian networks' structure learning must have the capability of balancing between the fitness and the complexity of a selected structure. There are several existing scoring functions based on a variety of principles, such as the information theory and minimum description length (e.g. BIC score, AIC score, and MDL score)

Suppose that a dataset D includes _{1}, _{2}, ..., _{n}

where _{i }
_{i}
_{i}, r_{i }
_{i}, C

where _{ijk }
_{i }
_{i}
_{ij }
_{i}

Alternatively, if we set

The BIC score and AIC score are derived from some approximations when the number of samples

We herein describe a new information-based scoring function to detect epistatic interactions by Bayesian network model. In the Bayesian network for epistatic interaction detection, we are only concerned with one target node, the disease status node, and we want to detect its parent SNP nodes. We represent the local structure around the disease status node as

where _{jk }
_{j }

We start our search from an empty local disease structure _{0}, and we can obtain the AIC score for _{0}:

where _{k }

For further inference, we use _{0 }can also be expressed as follows:

where

i.e. the mutual information between _{0 }

The ^{2 }test is commonly used to test independence and conditional independence between two variables for discrete data. From the general formula for ^{2}, we know that the value of ^{2 }can also be calculated from mutual information ^{2 }test value between

The number of degrees of freedom for ^{2 }test between

where

It is interesting to note that the difference between the complexity of _{0 }is equal to the degree of freedom of ^{2}(

By applying Eq. (7)-(14), the difference of AIC scores between _{0 }is:

Thus, the AIC score becomes:

where log _{0}) is a constant.

The distribution of ^{2 }asymptotically approximates to that of ^{2 }with the same number of degrees of freedom ^{2 }distribution with ^{2}(^{2 }distribution. Since ^{2}(^{2 }statistic ^{2}(

One problem for the AIC score in Eq. (5), Eq. (7), and Eq. (16) is that the penalty term (the effective number of parameters) in AIC score probably can not reflect the model complexity (or variance) especially when applied to SNP data with a non-skewed distribution. We can confirm this by comparing 2^{2}(^{2}(^{2}(^{2}(

where _{D}
^{2}(^{2 }distribution from data. Our new scoring function estimates the penalty term from the data to make it consistent with the data, which is similar to the DIC (Deviance Information Criterion) score trying to identify models that best explain the observed data

Due to the estimation of the variance of ^{2}(

EpiBN searching: A Branch-and-Bound algorithm for local structure learning

The computational task in score-and-search methods is to find a network structure with the highest score. The searching space consists of a super-exponential number of structures and thus exhaustively searching optimal structure from data for Bayesian networks is NP-hard

We employ B&B algorithm in our study because the B&B algorithm can guarantee the optimal results in a significantly reduced search time compared to exhaustive search. Our EpiBN method uses B&B algorithm to search a local disease structure that maximizes the EpiScore in Eq. (17). The pseudo code of EpiBN is shown in Figure

EpiBN Algorithm

**EpiBN Algorithm**.

To guarantee to find the parent set with the highest EpiScore, we can use the upper bound of the EpiScore to prune the search tree. We notice the ^{2 }function in Eq. (12) has the property:

When adding a SNP node _{1}, the variance of the corresponding ^{2 }distribution, the penalty term in Eq. (17), will increase by _{D}
^{2}(_{2})) - _{D}
^{2}(_{1})). On the other hand, the ^{2}(_{1}) will increase at most by 2

adding a SNP node _{1 }will not increase the EpiScore and thus any further search along the branch is useless. Essentially, the upper bound of the EpiScore is

EpiBN screening: MCMC screening method for real datasets

Even though the B&B algorithm uses an upper score bound to reduce the searching space, it still has an exponential time complexity in the worst case and is not feasible to be directly applied to real GWAS data. Therefore, an efficient screening method is necessary. Traditional screening methods assign a score to every single SNP and select a subset of SNPs with high scores. However, these methods ignore the joint effect of SNPs on disease and are not suitable for detecting epistatic interactions from real GWAS data.

In this paper, we use the Markov chain Monte Carlo (MCMC) method

where

where #(

The likelihood of local disease structure,

Results

Analysis of Simulated Data

We generate data based on the similar parameter settings as in ^{2 }calculated from allele frequencies) between the unobserved disease locus and a genotyped locus

Performance comparison of EpiBN, BEAM, SVM, and MDR

**Performance comparison of EpiBN, BEAM, SVM, and MDR**.

Our definition of power prohibits any false positives and any false negatives and reflects the ability to precisely detect whole interactions

which combines precision and recall ^{2 }= 1. EpiBN achieves a higher overall accuracy than both BEAM and SVM on model-2, model-3, and model-4. Moreover, the overall accuracy of EpiBN on model-4 is perfect. On model-1, EpiBN is still better than SVM while it is slightly worse than BEAM. BEAM shows the highest precision on the first three models, but it intends to miss more true positives. On the contrary, SVM demonstrates the highest recall, however, at the cost of introducing more false positives

Accuracy comparison of EpiBN, BEAM, and SVM.

**Model**

**Method**

**Precision**

**Recall**

**Distance**

1

EpiBN

0.76 ± 0.27

0.76 ± 0.27

0.34 ± 0.38

BEAM

0.87 ± 0.32

0.75 ± 0.34

**0.32 ± 0.43**

SVM

0.61 ± 0.29

0.91 ± 0.19

0.43 ± 0.31

2

EpiBN

0.90 ± 0.21

0.90 ± 0.20

**0.14 ± 0.29**

BEAM

0.91 ± 0.26

0.75 ± 0.31

0.29 ± 0.38

SVM

0.69 ± 0.29

0.95 ± 0.15

0.34 ± 0.31

3

EpiBN

0.78 ± 0.30

0.79 ± 0.30

**0.31 ± 0.43**

BEAM

0.83 ± 0.35

0.74 ± 0.37

0.34 ± 0.49

SVM

0.72 ± 0.28

0.88 ± 0.24

0.33 ± 0.35

4

EpiBN

1.00 ± 0.00

1.00 ± 0.00

**0.00 ± 0.00**

BEAM

0.41 ± 0.49

0.20 ± 0.29

1.05 ± 0.47

SVM

0.41 ± 0.32

0.61 ± 0.38

0.76 ± 0.40

^{2 }= 1. Column "o" shows the times of correct detection of all disease SNPs in 50 datasets. Column "+" presents the total number of extra detected SNPs and column "-" has the total number of missing disease SNPs. For model-1, mode-2, and model-3, EpiScore performs better than both BIC score and AIC score. BIC score can not detect true disease SNPs at all and introduce many false negatives due to its heavy penalty term. Comparing to EpiScore, AIC score tends to introduce more false positives and more false negatives. It is interesting to notice that every score function can achieve perfect power on model-4. The reason is that the relatively large genotypic effect,

Comparison of EpiScore, BIC score, and AIC score.

**Model**

**Score**

**o**

**+**

**-**

1

EpiScore

27

24

24

BIC score

0

0

57

AIC score

12

55

31

2

EpiScore

40

11

10

BIC score

40

11

10

AIC score

22

36

14

3

EpiScore

30

23

21

BIC score

0

0

57

AIC score

10

53

20

4

EpiScore

50

0

0

BIC score

50

0

0

AIC score

50

0

0

"o": times of correct detection of all disease SNPs in 50 datasets. "+": total number of extra detected SNPs in 50 datasets. "-": total number of missing disease SNPs in 50 datasets.

^{2 }test is used to test dependence and independence in these three Markov Blanket methods and we set the p-value threshold for ^{2 }test as 0.01. Figure

Performance comparison of EpiBN with three Markov Blanket methods: interIAMBnPC, PCMB, and DASSO-MB

**Performance comparison of EpiBN with three Markov Blanket methods: interIAMBnPC, PCMB, and DASSO-MB**.

^{2 }= 1 and MAF = 0.5.

The results are shown in Figure

Comparison of sample efficiency on datasets with different number of SNPs: (a) 40 SNPs, (b) 200 SNPs and (c) 1000 SNPs

**Comparison of sample efficiency on datasets with different number of SNPs: (a) 40 SNPs, (b) 200 SNPs and (c) 1000 SNPs**.

Analysis of AMD Data

In this section and the following two sections, we apply EpiBN to large-scale datasets in real genome-wide case-control studies, which often require genotyping of 30,000-1,000,000 common SNPs. We first make use of an Age-related Macular Degeneration (AMD) dataset containing 116,204 SNPs genotyped with 96 cases and 50 controls

We first perform the screening process and select 51 potential disease SNPs related with AMD by MCMC method. Among these 51 selected SNPs, EpiBN detects two associated SNPs showing the highest EpiScore: rs380390 and rs2402053. Klein

Analysis of LOAD Data

Late-onset Alzheimer's disease (LOAD) is the most common form of Alzheimer's disease and usually occurs in persons over 65. It causes patients' degeneration of the ability of thinking, memory, and behaviour. The apolipoprotein E (APOE) gene is one genetic factor that accounts for affecting the risk of LOAD. There are three common variants of the APOE gene:

We download the LOAD GWAS data from

Analysis of Autism Data

Autism is a common early onset neurodevelopmental disorder, which affects the brain's normal development and impairs social interaction and communication. To pinpoint the causal SNPs and genes involved in autism, a large number of genotyping data have been generated from subjects with and without autism. Some of the genotyping data have been deposited on the AGRE (Autism Genetic Resource Exchange) website

Heterogeneity of phenotypic presentation in autism makes it difficult to detect epistatic interactions related with this complex disorder

To explore the genetic basis in the identified more homogeneous subset, we use the SNP data for these 235 autistic subjects (cases) and 2439 controls in CHOP dataset. The MCMC method first selects 111 candidate SNPs. Then our EpiBN detects an epistatic interaction of three SNPs: rs706363, rs7780487, and rs12536378. The first SNP, rs706363, is on the autism candidate gene DAB1 on chromosome 1. Both rs7780487 and rs12536378 are on the autism candidate gene DPP6 on chromosome 7. If we search HPRD (Human Protein-protein Interaction Database), we can find a pathway from DAB1 to DPP6: DAB1--APLP2--PRNP--DPP6

Discussion

Jiang

Conclusions

To address the two critical challenges (small sample sizes and high dimensionality) in epistatic interaction detection from GWAS data, several machine learning or statistical methods have been proposed during the past decade. However, these proposed machine learning or statistical methods still encounter some problems: scalability to real genome-wide dataset, tending to introduce false positives, sample-efficiency, and poor performance when detecting epistatic interactions with weak or no marginal effects.

In this paper, we propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. We develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast B&B algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to GWAS data, we propose the use of a MCMC method to perform the screening process.

We apply the proposed method to both simulated datasets based on four disease models and three real datasets. Our experimental results demonstrate that our method outperforms some other commonly-used methods and is scalable to GWAS data.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BH designed and implemented the EpiBN method, tested the existing methods and analyzed experimental results. XWC conceived the study, designed the experiments and analyzed the results. ZT contributed in autism data analysis and assisted with analyzing experimental results. HX discussed the methods and analyzed some of the results. All authors helped in drafting the manuscript and approved the final manuscript.

Acknowledgements

This work is supported by the US National Science Foundation Award IIS-0644366 and the KCALSI-10-3 Patton Trust Grant.

This article has been published as part of