Division of Biostatistics and Epidemiology, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA

Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati OH 45229, USA

Division of Asthma Research, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA

Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA

Division of Physical Medicine and Rehabilitation, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Avenue, Cincinnati, OH 45229, USA

Abstract

We propose a nonparametric Bayes-based clustering algorithm to detect associations with rare and common single-nucleotide polymorphisms (SNPs) for quantitative traits. Unlike current methods, our approach identifies associations with rare genetic variants at the variant level, not the gene level. In this method, we use a Dirichlet process prior for the distribution of SNP-specific regression coefficients, conduct hierarchical clustering with a distance measure derived from posterior pairwise probabilities of two SNPs having the same regression coefficient, and explore data-driven approaches to select the number of clusters. SNPs falling inside the largest cluster have relatively low or close to zero estimates of regression coefficients and are considered not associated with the trait. SNPs falling outside the largest cluster have relatively high estimates of regression coefficients and are considered potential risk variants. Using the data from the Genetic Analysis Workshop 17, we successfully detected associations with both rare and common SNPs for a quantitative trait. We conclude that our method provides a novel and broadly applicable strategy for obtaining association results with a reasonably low proportion of false discovery and that it can be routinely used in resequencing studies.

Background

The two highly debated hypotheses on the genetic basis of complex human diseases are the common disease/common variant (CDCV) hypothesis and the common disease/rare variant (CDRV) hypothesis

Methods

Suppose that for each individual _{i}_{i}_{i}_{i}_{1}, _{i}_{2}, …, _{iJ}_{ij}

for _{j}_{i}_{j}_{0}), and ^{2} and **μ** and variance-covariance matrix Σ, **0** is a vector of zeros, **I** is an identity matrix, _{0}) is the Dirichlet process.

Dirichlet process

The Dirichlet process _{0} and the precision parameter _{0}), then _{0} is the prior expectation of

Sethuraman

and

then

is a random probability distribution generated from DP(_{0}), where _{ϕk}_{k}_{j}

Ishwaran and James _{N}

Clustering

Each iteration of the Gibbs sampler gives a clustering structure of SNP-specific regression coefficients such that coefficients taking the same value are clustered together. The number of clusters and the cluster membership of the coefficients vary across iterations, giving a random sample of clustering structures. Pairwise probabilities of two coefficients being equal are calculated from the posterior samples

Application of the method

We illustrate our methods using the data from Genetic Analysis Workshop 17. The analyses were performed with the knowledge of the underlying simulation model ^{2} = 1,000, _{0} = 0.5, and

We evaluated our results using two thresholds. When the number of clusters was small (2–5), we defined true positives as true associations identified in at least 2 of the 10 replicates. This threshold was selected to balance the reduced power resulting from small cluster numbers. Indeed, requiring at least two replications for each identified association yielded a reasonably low FDP. When we used the optimal cluster numbers, we defined true positives as true associations detected in no less than eight replicates. We carried out sensitivity analyses on the prior specification for the SNP-specific regression coefficients with _{0} ranging from 0.1 to 0.5 and

Results and discussion

Successful identification of associations

Table

True discoveries in at least two replicates with 2 to 5 clusters

Gene

SNP

MAF

F2

F3

F4

F5

C13S523

0.066714

0.64997

**10**

**10**

**10**

**10**

C13S431

0.017217

0.74136

**9**

**9**

**9**

**9**

C13S522

0.027977

0.61830

**8**

**8**

**8**

**8**

C6S2981

0.002152

1.20645

**6**

**6**

**6**

**7**

C1S6533

0.011478

0.5619

**5**

**6**

**7**

**7**

C13S524

0.004304

0.62223

**4**

**4**

**5**

**5**

C4S1884

0.020803

0.29558

**4**

**4**

**4**

**5**

C4S1878

0.164993

0.13573

**2**

**4**

**4**

**4**

C4S1877

0.000717

1.07706

1

**4**

**5**

**6**

C4S1889

0.000717

0.94133

1

**2**

**3**

**5**

C1S6542

0.002152

0.46026

1

**2**

**2**

**2**

C4S1861

0.002152

0.56311

1

1

1

**2**

F2, F3, F4, F5: frequency of detection (in bold when ≥ 2) over the 10 replicates with 2 to 5 clusters.

Selection of optimal number of clusters

As the number of clusters increases, more associations may be detected; however, the number of false positives may also increase. To strike a balance between sensitivity and specificity, we examined receiver operating characteristic (ROC) curves (Figure

True discoveries in at least eight replicates with optimal cluster numbers

Gene

SNP

MAF

C1S6561

0.000717

0.65721

C4S1877

0.000717

1.07706

C4S1879

0.000717

0.61830

C4S1889

0.000717

0.94133

C4S4935

0.000717

1.35726

C6S2981

0.002152

1.20645

C13S431

0.017217

0.74136

C13S522

0.027977

0.61830

C13S523

0.066714

0.64997

C13S524

0.004304

0.62223

C1S3181

0.000717

0.76911

C1S3182

0.000717

0.30432

C1S6533

0.011478

0.56190

C13S399

0.000717

0.39602

C13S479

0.000717

0.75946

C4S1873

0.000717

0.58301

C4S1884

0.020803

0.29558

C5S5156

0.000717

0.43010

C13S505

0.000717

0.44850

The items in the bolded rows are the frequency of detection over the 10 replicates (F), the number of true positives (TP), the number of false positives (FP), and the proportion of false discovery (FDP) in each case.

ROC curves and optimal cluster numbers.

**ROC curves and optimal cluster numbers.** r_{1}–r_{10} represent the first 10 replicates of the quantitative trait Q1. Numbers in parentheses and dots on the curves indicate optimal number of clusters for each replicate, which ranges from 59 to 96, with an average of 81.

We then evaluated the performance of this method using a specified number of clusters, ranging from 50 to 100. Using only associations identified with 100% power, we had 8 to 10 true positives and at most 2 false negatives (FDP ranging from 10% to 18%). For 90% power, we had 8 to 16 true positives and at most 4 false negatives (FDP ranging from 8% to 20%). Thus cluster numbers of 50 to 100 seem reasonable.

Characteristics of the true positives and false negatives

Using optimal cluster numbers and the threshold of true positives, we identified 12 of the 23 true associations with private SNPs. As we expected, true positives had overall higher effect sizes than false negatives (Figure

Boxplot of β and MAF by false-negative or true-positive status.

**Boxplot of β and MAF by false-negative or true-positive status****.** (a)

Conclusions

We have demonstrated that a novel nonparametric Bayes-based clustering method can be used to identify associations with SNPs for quantitative traits. Importantly, this method is capable of detecting associations with both rare and common genetic variants. Compared with other methods that deal with rare variants, our methods detect genetic risk factors directly at the SNP level. Compared with single-SNP-based methods, the proposed method is more powerful and reliable. It can detect a relatively larger proportion of true associations independent of the MAF of the variants, and it produces a relatively lower proportion of false discoveries.

Competing interests

The authors declare that there are no competing interests.

Authors’ contributions

LD conceived and performed the statistical analysis and wrote the manuscript. LJM contributed to the design of the statistical analysis and the writing of the manuscript. TMB, HH, XZ, and BGK helped with the writing of the manuscript. All authors read and approved the final manuscript.

Acknowledgments

We wish to thank Siva Sivaganesan for valuable discussion and two anonymous reviewers for their insightful comments and suggestions. This work was supported by National Institutes of Health (NIH) grants K01 HL103165, K12 HD001097-14, K24 HL69712, R01 NS036695, and U19 A1070235. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.

This article has been published as part of