Bioinformatics and Computational Life Sciences Laboratory, Information and Telecommunication Technology Center, Department of Electrical Engineering and Computer Science, The University of Kansas, 1520 West 15th Street, Lawrence, KS 66045, USA

Children’s Mercy Hospital and University of Missouri - Kansas City, Kansas City, MO 64108, USA

Abstract

Background

The interactions among genetic factors related to diseases are called epistasis. With the availability of genotyped data from genome-wide association studies, it is now possible to computationally unravel epistasis related to the susceptibility to common complex human diseases such as asthma, diabetes, and hypertension. However, the difficulties of detecting epistatic interaction arose from the large number of genetic factors and the enormous size of possible combinations of genetic factors. Most computational methods to detect epistatic interactions are predictor-based methods and can not find true causal factor elements. Moreover, they are both time-consuming and sample-consuming.

Results

We propose a new and fast Markov Blanket-based method, FEPI-MB (Fast EPistatic Interactions detection using Markov Blanket), for epistatic interactions detection. The Markov Blanket is a minimal set of variables that can completely shield the target variable from all other variables. Learning of Markov blankets can be used to detect epistatic interactions by a heuristic search for a minimal set of SNPs, which may cause the disease. Experimental results on both simulated data sets and a real data set demonstrate that FEPI-MB significantly outperforms other existing methods and is capable of finding SNPs that have a strong association with common diseases.

Conclusions

FEPI-MB algorithm outperforms other computational methods for detection of epistatic interactions in terms of both the power and sample-efficiency. Moreover, compared to other Markov Blanket learning methods, FEPI-MB is more time-efficient and achieves a better performance.

Background

In recent years, the success of GWAS (genome-wide association studies) makes it possible to detect genetic factors that influence the susceptibility to particular diseases in human populations

Recently, a number of statistical methods have been proposed to detect epistatic interactions. Among these methods, the most commonly used one is logistic regression (LR)

In this paper, we propose a new and fast Markov Blanket method, FEPI-MB (Fast EPistatic Interactions detection using Markov Blanket), to detect epistatic interactions. The Markov Blanket is a minimal set of variables, which can completely shield the target variable from all other variables. As shown in Figure

Example of genome-wide association studies (GWAS)

**Example of genome-wide association studies (GWAS).** The goal of genome-wide association studies is to identify the

Some Markov Blanket methods take a divide-and-conquer approach that breaks the problem of identifying Markov Blanket of variable T (MB (T)) into two subproblems: first, identifying parents and children of T (PC (T)) and, second, identifying the parents of the children of T (spouse). The goal of epistatic interactions detection is to identify causal interacting genes or SNPs for some certain diseases and therefore it is a special application of Markov Blanket method because we only need to detect the parents of the target variable T (disease status labels). Our new Markov Blanket method makes some simplifications to adapt to this special condition.

We apply the FEPI-MB algorithm to simulated datasets based on four disease models and a real dataset (the Age-related Macular Degeneration (AMD) dataset). We demonstrate that the proposed method significantly outperforms other commonly-used methods and is capable of finding SNPs strongly associated with diseases. Comparing to other Markov Blanket learning methods, our method is faster and can still achieve a better performance.

Results

Simulated data generation

We first evaluate the proposed FEPI-MB on simulated data sets, which are generated from three commonly used two-locus epistatic models _{A}_{B}

Four disease models

Model1

AA

Aa

aa

BB

^{2}

Bb

^{2}

^{3}

bb

^{2}

^{3}

^{4}

Model2

AA

Aa

aa

BB

Bb

^{2}

bb

^{2}

^{4}

Model3

AA

Aa

aa

BB

Bb

bb

Model4

AA

BB

Bb

bb

CC

Cc

cc

Aa

BB

Bb

bb

CC

Cc

cc

aa

BB

Bb

bb

CC

Cc

cc

where _{A}_{B}_{A}_{B}_{A}_{B}

In Model1 the odds of disease increase in a multiplicative mode both within and between two loci. For example, an individual with Aa at locus A has larger odds, which are 1 + ^{2}. We can also find similar effects on locus B. Finally the odds of disease for each combination of genotypes at loci A and B can be obtained by the product of the two within-locus effects. Model2 demonstrates two-locus interaction multiplicative effects because at least one disease-associated allele must be present at each locus to increase the odds beyond the baseline level. Moreover the increment of the disease-associated allele at loci A or B can further increase the disease odds by the multiplicative factor 1 +

To generate data, we need to determine three parameters associated with each model: the marginal effect of each disease locus (^{2} calculated from allele frequencies ^{2} take two values (0.7, 1.0) for each model. For each non-disease marker we randomly chose its MAF from a uniform distribution in [0.0. 0.5]. We first generate 50 small datasets and each dataset contains 100 markers genotyped for 1,000 cases and 1,000 controls based on each parameter setting for each model. To test the scalability of FEPI-MB, we also generate 50 large datasets and each dataset contains 500 markers genotyped for 2,000 cases and 2,000 controls using the same parameter setting for each model.

Epistasis detection on simulated data

We compare the FEPI-MB algorithm with three commonly-used methods: BEAM, SVM and MDR on the four simulated disease models. To measure the performance of each method, we use “power” as the criterion function. Power is calculated as the fraction of 50 simulated datasets in which disease associated markers are identified and demonstrate statistically significant associations (^{2} test values below a threshold for FEPI-MB) with the disease

We show the results on the simulated data in Figures

Performance comparison for small datasets containing 100 markers genotyped from 1000 cases and 1000 controls.

Performance comparison for small datasets containing 100 markers genotyped from 1000 cases and 1000 controls.

Performance comparison for large datasets containing 500 markers genotyped from 2000 cases and 2000 controls.

Performance comparison for large datasets containing 500 markers genotyped from 2000 cases and 2000 controls.

An important issue for epistatic interaction detection in genome-wide association studies is the number of available samples. Typically, the size of samples is limited and consequently, computational model behaves differently. We explore the effect of the number of samples on the performance of BEAM and FEPI-MB (SVM will always introduce a large number of false positives and thus, is not compared here). We generate synthetic datasets containing 40 markers genotyped for different number of cases and controls with ^{2} = 1 and MAF=0.5. The result is shown in Figure

Effect of number of samples on the performance of FEPI-MB and BEAM.

Effect of number of samples on the performance of FEPI-MB and BEAM.

We also compare the performance of FEPI-MB with interIAMBnPC based on the large dataset from model1 to show the time efficiency of FEPI-MB. Among the three variants of IAMB, interIAMBnPC can achieve the best performance

Comparison of performance of FEPI-MB and interIAMBnPC for the large datasets of Model1

Model

^{2}

MAF

Algorithm

Power

Average time (s)

1

0.3

0.7

0.05

FEPI-MB

3

0.4574

interIAMBnPC

3

7.5505

0.1

FEPI-MB

6

0.4437

interIAMBnPC

5

9.2449

0.2

FEPI-MB

20

0.4436

interIAMBnPC

20

9.4295

0.5

FEPI-MB

42

0.4449

interIAMBnPC

42

8.2823

1

0.05

FEPI-MB

2

0.4393

interIAMBnPC

2

7.3610

0.1

FEPI-MB

12

0.4421

interIAMBnPC

12

9.7156

0.2

FEPI-MB

39

0.4431

interIAMBnPC

38

9.6498

0.5

FEPI-MB

45

0.4449

interIAMBnPC

43

9.1229

Epistasis detection on AMD data

FEPI-MB demonstrates its greater power, sample-efficiency, and time-efficiency on simulated data with the number of SNPs less than 500. In practical problems, a typical GWAS genotype dataset contains at least more than 30,000 common SNPs. FEPI-MB can also be scalable to large-scale datasets in real genome-wide case-control studies. We apply FEPI-MB to an Age-related Macular Degeneration (AMD) dataset, which contains 116,204 SNPs genotyped with 96 cases and 50 controls

The searching time of FEPI-MB for AMD-related SNPs on an Intel Core 2 Duo T6600 2.20 GHz, 4GB RAM and Windows Vista is 96.4s and FEPI-MB detects two associated SNPs: rs380390 and rs2402053, which have a ^{2} test p-value of 5.36*10^{-10}. The first SNP, rs380390, is already found in

It is worth noting that several lines of evidence have previously shown the long arm of 7q harbors genes implicated in retinal disorders. Among which is mapping of a locus on 7q31-q32 for retinitis pigmentosa, another retinal disease

The rs2402053 SNP identified in our study does not locate in any of the previously reported implicated genes in retinal disorders. Therefore, it may shed light on discovering a new genetic factor, on chromosome 7q, contributing to the underlying mechanism of AMD, a complex form of retinal degenerative disorder. The real mechanism of interaction between rs380390 and rs2402053 should be explored further by biological experiments.

Conclusions

While many computational methods were used for identification of epistatic interactions, most existing computational methods do not consider the complexity of genetic mechanisms causing common diseases and only focus on the selection of SNP sets, which show the best classification capacity. This will introduce many false positives inevitably. Furthermore, most existing methods cannot directly handle genome-wide scale problems. In this paper, we introduce a new and fast Markov Blanket-based method, FEPI-MB, to identify epistatic interactions. We compared FEPI-MB with three other methods, BEAM, SVM and MDR, over both simulated datasets and a real dataset. Our results show that the FEPI-MB algorithm outperforms other methods in terms of the power and sample-efficiency. Moreover, we compare FEPI-MB with one of the best Markov Blanket learning method, interIAMBnPC. The FEPI-MB is more than ten times faster than interIAMBnPC.

Methods

Markov blankets

Bayesian networks represent a joint probability distribution

**Definition 1 (Faithfulness). **

**Theorem 1**.

We can define the Markov Blanket of a variable T, MB (T), as a minimal set for which (

**Theorem 2**.

**Theorem 1** and **Theorem 2** are proven in

The Aisa network.

**The Aisa network.** The gray-filled nodes are the MB(T) of node ‘TBorCancer’.

Given the definition of a Markov Blanket, the probability distribution of T is completely determined by the values of variables in MB(T). Therefore, the detection of Markov Blanket can be applied for optimal variable selection and causal discovery. In this paper, we use Markov Blanket method to detect potential causal SNPs for common complex diseases.

Markov blankets learning methods

There are several Markov Blanket learning methods such as: Koller-Sahami (KS) algorithm

Koller-Sahami (KS) algorithm is the first algorithm to employ Markov Blanket for feature selection. However, there is no theoretical guarantee for Koller-Sahami (KS) algorithm to find optimal MB set

To overcome the data inefficient problem of IAMB and its variants, Max-Min Markov Blanket (MMMB) algorithm

Method description: FEPI-MB

Detecting gene-gene interaction is a special application of Markov Blanket learning method because we only need to detect the parents of the target variable T and don’t need to design a complex algorithm to detect spouses of T. Here target variable T is the disease status labels and the parents of T are those disease SNPs. MB(T) only contains the parents of T.

All Markov Blanket learning methods are based on the following two Theorems.

**Theorem 3.**

**Proof**: This is a direct consequence of **Theorem 1** because now MB(T) only contains the parents of T. □

**Theorem 4.**

**Proof**: Let ** X**,

The ^{2} test is used to test independence and conditional independence between two variables for discrete data ^{2} test is that two variables are independent. As described next, the proposed FEPI-MB uses ^{2} to test the association and independence between SNPs and disease status.

The detail of our FEPI-MB algorithm is shown in Figure ^{2} score and is associated with the target variable T in canMB enters MB(T) in the phase of

FEPI-MB algorithm.

FEPI-MB algorithm.

Like IAMB and PCMB, the soundness of FEPI-MB is based on the assumptions of DAG-faithfulness and correct independence test.

**Theorem 5**.

**Proof**: First, each node in MB(T) enters MB(T) in the **Theorem 3**. Second, the nodes outside the MB(T) will be removed sooner or later during the **Theorem 4**. □

Even though FEPI-MB is a method based on the greedy algorithm, **Theorem 3** and **Theorem 4** can guarantee that FEPI-MB will not get stuck in a local optimum.

List of abbreviations used

GWAS: genome-wide association studies; FEPI-MB: Fast EPistatic Interactions detection using Markov Blanket; SNP: single nucleotide polymorphisms; LR: logistic regression; MDR: multifactor dimensionality reduction; stepPLR: stepwise penalized logistic regression; BEAM: Bayesian epistasis association mapping; MCMC: Markov Chain Monte Carlo; SVM: Support Vector Machine; RFE: recursive feature elimination; RFA: recursive feature addition; GA: genetic algorithm; AMD: Age-related Macular Degeneration; MAF: minor allele frequencies; LD: linkage disequilibrium; HWE: Hardy-Weinberg Equilibrium; DAG: directed acyclic graph.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BH designed and implemented the FEPI-MB method, tested the existing methods and analyzed experimental results. XWC conceived the study, designed the experiments, and analyzed experimental results. ZT analyzed experimental results. All authors helped in drafting the manuscript and approved the final manuscript.

Acknowledgements

This work is supported by the US National Science Foundation Award IIS-0644366.