Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, P.R.China

Computer Science and Engineering Department, University of Connecticut, Storrs, Connecticut 06269-2155, USA

Abstract

Background

Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered.

Results

In this article, we focus on detecting associations between multiple rare variants and traits. Similar to

Introduction

In most existing genetic variant association studies, "common trait, common variants", which asserts that common genetic variants contribute to most of traits (disease susceptibilities), serves as the central assumption. Researchers have successfully identified some significant associations between common single nucleotide polymorphisms (SNPs) and disease traits ^{-2}). Although some rare variants associated with Mendelian diseases have been identified, more often, the allelic population attributable risk (PAR), which describes a small reduction in the incidence that would be observed in unexposed samples compared to the actual exposure pattern, is low. The odds ratio (OR), a measure of the strength of association or non-independence between two binary data values, is also low. Moreover, based on the "common trait, rare variants" hypothesis, in many cases, a set of rare variants, instead of just one variant, should be identified to fully explain the genetic influence. Both the single-variant test

Alternatively, the collapsing strategy, also called the "burden-based test", is another approach for rare variants association studies. Most of the collapse-based approaches build on the "recessive-set" genetic model, in which the predisposing haplotype contains mutation(s) in at least one variant

Collapse-based approaches have low statistical powers when "causal", "neutral" and "protective" variants are combined

Motivated by

We adopt the "dominant" and "recessive set" genetic model, which are also used in

Methods

Notions and model overview

Suppose we are given _{i }_{i }_{i }_{i }_{i }

The core of our approach is a Markov random field (MRF) model. We first introduce four key components of modeling this MRF:

• The observed data of this MRF consist of all of the genotypes and phenotypes.

• There are two unknown states for each site: one is the causal or non-causal status and the other is the region location status. Here, we define them as the hidden states of this Markov random field. Let a latent vector _{s }_{s }_{s }_{s }

• A neighborhood system is required in the MRF model to describe the interactions among hidden states. Details of the hidden states and neighborhood system are shown in the section "Estimation of the hidden states in HMRF".

• There are two kinds of probabilities in the MRF model: emission probabilities and transition probabilities. Emission probabilities bridge the relationships among genotypes, phenotypes and hidden states. Moreover, hidden states _{s }_{s }_{s }_{s }_{s }_{s }_{s }_{s }

The central thesis of our approach is that causal rare variants, which should be collapsed together, are treated as one random vector variable with certain dimensions. Then, the probability of this bunch of causal rare variants becomes the probability of one variable being associated with the phenotype. Based on the Markov-Gibbs equivalence

Estimation of the hidden states in HMRF

Neighborhood system

Assume there are _{s }_{s }

where _{s }_{s}_{s,s' }

The _{s,s' }_{s }_{s' }

Hidden states

Rare variant

where _{s }_{s}_{s}

and the joint probability of latent vector _{R }

Estimation of the emission probabilities in HMRF

We now estimate the emission probabilities to relate

If _{s }_{s }_{s}_{s }_{s }

The marginal distribution of

On the other hand, if _{s }_{s }_{s }_{s }_{s}_{s }

where _{s}_{s }_{s}_{s }

Estimation of the transition probabilities in HMRF

The transition probabilities link the hidden states _{E }_{B }

and

where

where

Thus, we have the conditional probability of

and the posterior distribution of ξ given

Similarly, the posterior distribution of ζ given

Thus far, we have obtained all of the three transition probabilities of this HMRF:

Estimation the model parameters

Based on the Gibbs-Markov Equivalence _{R}

• Step 1: Estimate _{θ }_{ρ }

Similarly, Update

• Step 2: Estimate _{ξ}, β_{ξ }_{ζ}, β_{ζ }

• Step 3: Estimate Φ and Φ_{R }

and

• Step 4: Update

and

There are several ways to exit from this iteration. We measure the Euclidean distance between the current and the updated

Experiments and results

In this section, we apply our approach on a real dataset from

Simulation frameworks

As the simulation settings in different papers

Fix number of causal variants

First, we generate the datasets with fixed numbers of causal variants, following previous approaches _{s}

where _{S }_{N }_{S }_{N }_{s }_{s}

Causal variants depends on PAR

The second way generates a set,

where _{D }

We use the algorithm proposed in _{s}_{S }_{N }_{s}

For those non-causal variants, we use Fu's model _{s}

Causal variants depends on regions

There are many ways to generate a dataset with regions. The simplest way is to preset the elevated regions and the background regions and to plant causal variants based on certain probabilities. An alternate way creates the regions by a Markov chain. For each site, there are two groups of states. The

To generate enough genotypes, we perform the following steps for each variant: if the process drops into _{S }_{N }

Comparisons on power

Similar to the measurements in ^{-6 }based on the Bonferroni correction assuming 20000 genes, genome-wide. We test at most 1000 datasets for each comparison experiment.

Power versus different proportions of causal variants

We compare the powers under different sizes of total variants. In the first group of experiments, we include 50 causal variants and vary the total number of variants from 100 to 5000. Thus, the proportions of causal variants decrease from 50% to 1%. In the second group of experiments, we hold the group PAR as 5% and vary the total number of variants as before. The results are compared in Table

The power comparisons at different proportions of causal variants

**Total**

**Causal**

**RareProb**

**RareCover**

**RWAS**

**LRT**

100

50

100%

100%

100%

100%

200

50

100%

100%

99.6%

99.9%

400

50

100%

100%

85.3%

88.6%

600

50

100%

94.6%

54.1%

58.8%

800

50

100%

0.0%

33.0%

36.5%

1000

50

100%

0.0%

20.7%

22.0%

2000

50

100%

0.0%

2.0%

2.0%

3000

50

100%

0.0%

0.8%

0.0%

4000

50

100%

0.0%

0.4%

0.0%

5000

50

100%

0.0%

0.3%

0.0%

200

1*

51.0%

0.0%

0.0%

0.0%

400

3*

77.0%

0.0%

0.0%

0.0%

600

2*

63.6%

0.0%

0.0%

0.0%

800

3*

57.1%

0.0%

0.0%

0.0%

1000

3*

59.0%

0.0%

0.0%

0.0%

2000

1*

34.0%

0.0%

0.0%

0.0%

3000

2*

41.2%

0.0%

0.0%

0.0%

4000

3*

40.0%

0.0%

0.0%

0.0%

5000

2*

29.8%

0.0%

0.0%

0.0%

The upper section of this table shows the results with a fixed number of causal variants. The column "Causal" shows the number of causal variants, and "*" indicates that the value is an average value."

**Table S1**. The power comparisons at different levels of PAR and different numbers of causal variants.

Click here for file

The Type I error rate is another important measurement for estimating an approach. To compute the Type I error rate, we apply the same technique as

**Table S2**. The power comparisons for different configurations of causal variants depended on PARs.

Click here for file

Power versus different configurations of regions

We compare the powers on different configurations of elevated regions and background regions and test the performance of our approach in identifying the regions. At each total variant number, we preset the number of regions between 2 and 8, with half elevated regions and half background regions. In these datasets, the probability of a rare variant being causal is 0.1 if the variant is located in an elevated region; otherwise, the probability is 0.001 if variant is located in a background region. In the last group of experiments, the regions are generated by the Markov chain, where the transition probability of remaining in the same regions (keeps in elevated region or background region) is 0.8, while the transition probability of transitioning between different regions (jumps from an elevated region to a background region, or jumps from a background region to an elevated region) is 0.2. The emission probabilities are the same as before. We test the powers and record the percentages of correct identifications on the regions. The results are listed in Table

The power comparisons for different configurations of regions

**Total**

**Causal**

**Regions**

**Length**

**RareProb**

**Correct R**

1000

36*

1

50

100%

96%

37*

2

50

100%

98%

36*

3

50

100%

97%

35*

4

50

100%

98%

2000

73*

1

100

100%

97%

73*

2

100

100%

97%

70*

3

100

100%

98%

71*

4

100

100%

96%

Total

Causal

Regions

Length

RareCover

Correct R

1000

36*

1

50

0.0%

1.9%

37*

2

50

0.0%

1.4%

36*

3

50

0.0%

1.7%

35*

4

50

0.0%

1.6%

2000

73*

1

100

0.0%

0.7%

73*

2

100

0.0%

0.8%

70*

3

100

0.0%

1.3%

71*

4

100

0.0%

0.8%

The column "Causal" represents the total number of causal variants, "Region" denotes the total number of elevated regions, "Length" indicates the total number of variants locating in elevated regions. The column "Correct R" shows the percentage of correct identification of regions.

**Table S3**. The power comparisons for different configurations of regions.

Click here for file

RareProb on real mutation screening data

Finally, we apply our approach to a real mutation screening dataset. This dataset has been previously published by

We apply ^{-16}. As a comparison, authors in

Conclusion

In this article, we propose a probabilistic method,

The Markov random field model treats all of the variants as one vector and estimates their causal/non-causal status by globally maximizing the likelihood of genotypes instead of by local optimization. Our approach gains more power than existing group-wise collapsing approaches;

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JW and ZM conducted this research. JW designed algorithms and experiments. ZC, AY and JZ developed the software packages and participated in the performance analysis and the experiments on the real dataset. JW, ZM and JZ wrote this paper. All authors have read and approved the final manuscript.

Declarations

The publication costs for this article were funded by Xi'an Jiaotong University.

This article has been published as part of

Acknowledgements

This work was supported by National Science Foundation [IIS-0803440], [CCF-1116175] and [IIS-0953563] and the Ph.D. Programs Foundation of Ministry of Education of China [20100201110063]. Authors thank Professor Sean Tavtigian and Professor Georgia Chenevix-Trench for sharing the