Bioinformatics and Computational Life Sciences Laboratory, ITTC, Department of Electrical Engineering and Computer Science, The University of Kansas, 1520 West 15th Street, Lawrence, KS 66045, USA

Abstract

Background

Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis and treatment of complex human diseases. A recent study in automatic detection of epistatic interactions shows that Markov Blanket-based methods are capable of finding genetic variants strongly associated with common diseases and reducing false positives when the number of instances is large. Unfortunately, a typical dataset from genome-wide association studies consists of very limited number of examples, where current methods including Markov Blanket-based method may perform poorly.

Results

To address small sample problems, we propose a Bayesian network-based approach (bNEAT) to detect epistatic interactions. The proposed method also employs a Branch-and-Bound technique for learning. We apply the proposed method to simulated datasets based on four disease models and a real dataset. Experimental results show that our method outperforms Markov Blanket-based methods and other commonly-used methods, especially when the number of samples is small.

Conclusions

Our results show bNEAT can obtain a strong power regardless of the number of samples and is especially suitable for detecting epistatic interactions with slight or no marginal effects. The merits of the proposed approach lie in two aspects: a suitable score for Bayesian network structure learning that can reflect higher-order epistatic interactions and a heuristic Bayesian network structure learning method.

Background

Genome-wide association study (GWAS) focuses on studies of the genetic variants related with a variety of diseases from individual to individual among a cohort of cases (people with the disease) and controls (similar people without the disease)

During the past decade, some heuristic computational methods have been proposed to detect causal interacting genes or SNPs. One type of computational methods for epistatic interactions detection are statistical methods including multifactor dimensionality reduction (MDR)

An alternative approach is machine learning based methods, which are based on binary classification (prediction) and treat cases as positives and controls as negatives in SNP data. Support vector machine-based approaches

Recently, we propose a new Markov Blanket-based method, DASSO-MB, to detect epistatic interactions in case-control studies

In this paper, we address the problems by introducing a Bayesian networks-based method, which also employs a Branch-and-Bound technique to detect epistatic interactions. Bayesian networks provide a succinct representation of the joint probability distribution and conditional independence among a set of variables. In general, a structure learning methods for Bayesian networks first defines a score reflecting the fitness between each possible structure and the observed data, and then searches for a structure with the maximum score. Comparing to Markov Blanket based methods, the merits of applying Bayesian networks method to epistatic interaction detection includes: (1) BDE, BIC or MDL scores for Bayesian network structure learning can reflect higher-order interactions and are not sample-consuming; and (2) heuristic Bayesian network structure learning method can solve the classical XOR problem, which may hinder the applications of Markov blanket based approaches.

We apply the bNEAT (

- B

- N

- E

- A

- T

Results

Analysis of simulation data

We first evaluate the proposed bNEAT method on simulated data sets, which are generated from three commonly used two-locus epistatic models in

To compare the performance of different methods, we use the same data generation process and the similar parameter settings as in

where _{D}

We compare the bNEAT algorithm with four methods: BEAM, Support Vector Machine, MDR and DASSO-MB on the four simulated disease models. The BEAM software is downloaded from ^{2} test as 0.01 to determine (conditional) dependence and (conditional) independence.

The results on the simulated data are shown in Figures

Performance comparison for ^{2} = 0.7

**Performance comparison for r ^{2} = 0.7** The power is defined as the proportion of simulated datasets whose result only contains disease associated markers without any false positives.

Performance comparison for ^{2} = 1

**Performance comparison for r ^{2} = 1** The power is defined as the proportion of simulated datasets whose result only contains disease associated markers without any false positives.

Typically, GWAS can not generate a large number of samples due to the high experiment cost. Thus, the performance of various computational methods for epistatic interaction detection in case of small samples is important. We explore the effect of the number of samples on the performance of bNEAT, DASSO-MB, BEAM and SVM. We generate synthetic datasets containing 40 markers genotyped for different number of cases and controls with ^{2} = 1 and MAF=0.5.

The results are shown in Figure

Comparison of sample efficiency

Comparison of sample efficiency

Results on AMD data

In this section, we apply bNEAT to large-scale (large number of SNPs but small samples) datasets in real genome-wide case-control studies, which often require genotyping of 30,000–1,000,000 common SNPs. We make use of an Age-related Macular Degeneration (AMD) dataset containing 116,204 SNPs genotyped with 96 cases and 50 controls

To remove inconsistently genotyped SNPs, we perform filtering process as in ^{2} test and then use bNEAT to identify disease SNPs related with AMD. bNEAT detects three associated SNPs: rs380390, rs3913094 and rs10518433. The first SNP, rs380390, is already found in

Conclusions and discussion

Comparing with many computational methods used for identification of epistatic interactions, Markov Blanket based method can increase power and reduce false positives. However, Markov Blanket based method is sample-consuming and the greedy searching strategy in Markov Blanket method is not suitable for detecting some interaction models with no independent main effects for each disease locus. In this paper, we propose a Bayesian networks method based on Branch-and-Bound technique (bNEAT) to detect epistatic interactions. We demonstrate that the proposed bNEAT method significantly outperforms Markov Blanket method and other commonly-used methods, especially when the number of samples is small.

Even though the bNEAT method is more powerful than Markov Blanket based method, it can not be directly applied to genome-wide dataset due to the large number of SNPs. Integrating Markov chain Monte Carlo or simulated annealing technique into our bNEAT method to make it scalable to genome-wide dataset is one direction for future research. Moreover, we will explore different score schemes for epistatic interaction detection by Bayesian networks. For example, information-based score schemes (e.g., AIC score and BIC score) are derived in case of large number of samples

Methods

Bayesian networks

A Bayesian network is a directed acyclic graph (DAG) _{1}, _{2}, …, _{n}

**Definition 1 (Conditional Independence)**

This conditional independence is represented as

**Theorem 1 (Local Markov Assumption)**

By applying the local Markov assumption, the joint probability distribution

where _{i}_{i}_{i}_{i}

**Definition 2 (V-structure)**

**Definition 3 (D-seperation)**

Bayesian networks allow us to explore causal relationships to perform explanatory analysis and make predictions. As shown in Figure _{1}, SNP_{2},…, SNP_{k}, which are associated with a disease. The

An Example of Genome-wide Association Studies.

**An Example of Genome-wide Association Studies.** The goal of genome-wide association studies is to identify the _{1}, SNP_{2},…,SNP_{k}, which are associated with disease.

Structure learning of Bayesian networks

Even though a Bayesian network can be constructed by an expert, most tasks of determining the network structure are too complex for humans. We have no choice but to learn the network structure and parameters from data. There are two types of structure learning methods for Bayesian networks: constraint-based methods and score-and-search methods.

The constraint-based methods first build the skeleton of the network (undirected graph) by a set of dependence and independence relationships. Next constraint-based methods direct links in the undirected graph to construct a directed graph with d-separation properties corresponding to the dependence and independence determined

The score-and-search methods view a Bayesian network as a statistical model and transform the structure learning of Bayesian network into a model selection problem

One of the most important issues in score-and-search methods is the selection of score function. A natural choice of score function is the likelihood function. However, the maximum likelihood score often overfits the data because it does not reflect the model complexity. Therefore, a good score function for Bayesian networks’ structure learning must have the capability of balancing between the fitness and the complexity of a selected structure. There are several existing score functions based on a variety of principles, such as the information theory and minimum description length (BIC score, AIC score, MDL score)

The general idea of BDe score is to compute the posterior probability distribution. Consider that we want to learn the structure _{i}_{i}_{i}_{i}_{i}

where _{ijk}_{i}_{i}_{ij}_{ijk}_{ijk}_{i}q_{i}

where

In (4), by setting

If we set

The BIC score are derived from a Taylor expansion and Laplace approximation when the number of samples

The computational task in score-and-search methods is to find a network structure with the highest score. The searching space consists of a superexponential number of structures-2^{O}^{(}^{n2)} and thus exhaustively searching optimal structure from data for Bayesian networks is NP-hard

• Add an edge

• Remove an edge

• Reverse an edge

By these three operators, we can construct the local neighbourhood of current network. Then we select the network with the highest score in the local neighbourhood to get the maximal gain. This process can be repeated until it reaches a local maximum. However, greedy hill-climbing algorithm cannot guarantee a global maximum

The proposed method uses B&B to search a structure that maximizes the BIC score. The algorithm is shown in Figure

the bNEAT algorithm

the bNEAT algorithm

Competing interests

Authors declare that they have no competing interests.

Authors' contributions

BH designed and implemented the algorithm. XWC conceived the study and designed the experiments. Both authors drafted the manuscript and approved the final manuscript.

Acknowledgements

This work is supported by the US National Science Foundation Award IIS-0644366.

This article has been published as part of