Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089-2910, USA

TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China

Abstract

Background

Genome-wide association studies (GWAS) have identified many common polymorphisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often cost prohibitive to sequence thousands of individuals that are generally required for association studies. A more efficient and cost-effective design would involve pooling the genetic materials of multiple individuals together and then sequencing the pools, instead of the individuals. This pooled sequencing approach has improved the plausibility of association studies for rare variants, while, at the same time, posed a great challenge to the pooled sequencing data analysis, essentially because individual sample identity is lost, and NGS sequencing errors could be hard to distinguish from low frequency alleles.

Results

A unified approach for estimating minor allele frequency, SNP calling and association studies based on pooled sequencing data using an expectation maximization (EM) algorithm is developed in this paper. This approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth on the estimation accuracy of minor allele frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give an unbiased estimate of the minor allele frequency under all scenarios studied in this paper. A SNP calling approach, EM-SNP, for pooled sequencing data based on the EM algorithm is then developed and compared with another recent SNP calling method, SNVer. We show that EM-SNP outperforms SNVer in terms of the fraction of db-SNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to study the association between variants and type I diabetes.

Conclusions

The EM-based approach for the analysis of pooled sequencing data can accurately estimate minor allele frequencies, call SNPs, and find associations between variants and complex traits. This approach is especially useful for studies involving rare variants.

Introduction

Finding genomic variants associated with complex traits is one of the most important problems in modern genomics. Genome-wide association studies (GWAS) based on common variants have been the dominant approach to achieve this objective

Studies of rare variants are complicated by the low minor allele frequencies of rare variants. The development of next generation sequencing (NGS) technologies such as Illumina and Roche 454 has made it possible to sequence a large number of reads economically. Despite such important progress, sequencing a large number of individuals separately is still costly for most biological laboratories. One frequently adopted approach to reduce sequencing cost in the search of rare variants is pooled sequencing, where mixtures of genetic materials from several individuals are grouped together to form a pool for a single sequencing. While this design greatly lowers the sequencing cost, it also makes it hard to distinguish true genetic polymorphisms from sequencing errors, estimate minor allele frequencies at the polymorphic loci, and perform association studies on the rare variants.

Several research groups have used pooled sequencing to detect rare variants that are associated with complex traits such as retinitis pigmentosa, diabetes, cancer, and inflammatory bowel disease

Several groups have developed SNP calling methods based on pooled sequencing data

In order to estimate minor allele frequencies in pooling studies, several groups developed statistical models for the sampling of individuals and the sampling of reads from the individuals in the pools

In this paper, we develop new methods for estimating minor allele frequencies, SNP detection, and association studies using pooled sequencing data based on the models in

Materials and methods

Notation

Consider a locus along the genome. Let

Assume that a total of _{g }

Let

Conditional on

Thus, the probability of observing the data for the

Since the pools can be considered independent, the likelihood of observing the data for all the pools is

Given the above likelihood expression and the data

• Find the maximum likelihood estimate of (

• Determine whether an observed variant is a true SNP or not, i.e. SNP calling.

• Find the variants associated with a phenotype of interest.

Computational methods

An expectation-maximization (EM) approach for allele frequency estimation

Based on the likelihood function, an approximate solution to the maximum likelihood estimation of the parameters can be obtained using the EM algorithm. We consider the following missing data:

• _{g}

• _{gi}

• _{gi}

We also use the following notation:

Based on the above notation, the complete log-likelihood is:

Suppose that the value of Θ = (^{(t) }= (^{(t)}, ^{(t)}, ^{(t)}). The maximization (M)-step gives:

Note the expectation _{(t) }is taken when the parameters are at Θ^{(t)}.

The expectation (E)-step is formulated as follows:

and

where all the parameters in the equations are of the values taken at the

From Equations 3 and 4, we are able to obtain the recursive formula for

Next we calculate _{(t)}(_{11}|Data). Note that

which does not depend on

Similarly, we can derive the formulas for _{(t)}(_{10}|Data), _{(t)}(_{01}|Data) and _{(t)}(_{00}|Data), and the recursive formulas for

SNP identification using EM

Due to sequencing errors, the observed variants may contain a significant amount of false positives,

Consider a case-control study with a group of case individuals and another group of control individuals. Let _{1 }and _{0 }be the minor allele frequencies at a locus among the cases and controls, respectively. Denote **f **= (_{0}, _{1}) and **0 **= (0, 0). We can test if an observed variant is a true SNP using the likelihood ratio test for _{0 }: _{0 }= _{1 }= 0 vs. _{1 }: _{0 }≠ 0 or _{1 }≠ 0:

where _{f }is the maximum log-likelihood of the observed data for both the cases and the controls. Note that the null hypothesis **f **= **0 **is on the boundary of the region of the parameters of interest. Therefore, the asymptotic distribution of Λ is _{0 }is the point mass at 0 and

We can also test if an observed variant is a true SNP using cases or controls separately. For the control pools, we conduct a likelihood ratio test for _{0 }: _{0 }= 0 vs. _{1 }: _{0 }> 0. Similarly, we replace _{0 }by _{1 }for the case pools. We then use the statistic

to test each hypothesis, where _{i }_{i }_{i }

Testing for associations between a SNP and a phenotype in case-control studies

We test if a SNP is associated with a phenotype of interest using the likelihood ratio test again. Here we test the alternative hypothesis _{1 }: _{1 }≠ _{0 }versus the null hypothesis _{0 }: _{1 }= _{0}. This association test is conducted by the likelihood ratio test statistic:

This statistic has an asymptotic chi-square distribution with 1 degree of freedom.

Simulation studies

We use simulations to evaluate our approaches for allele frequency estimation, SNP detection and test for association. A large range of parameter space is considered to see how different parameters affect the performance of our methods. These parameters include minor allele frequency (

Pooled data generation

In our simulations, we set

Since the sequencing error rate can vary from locus to locus and from one pool to another, we generate 1000 _{i}_{i}_{i}_{i}

Measuring the accuracy of the allele frequency estimation

For each of the 4 × 4 × 2 × 3 × 3 = 288 combinations, we do the following:

1. In the

2. Repeat Step 1) for R = 1000 times.

3. Compute the mean squared error (MSE) of

4. Compute the MSE of _{frac }=

We use both MSE and Cg to compare the accuracy of the EM algorithm with the naive approach of estimating

Generating case-control data to study the power of SNP identification and association studies using EM

In order to evaluate the power of SNP identification using EM-SNP and test for association, we simulate case-control data as follows. When generating the control data, we assume that the minor allele frequency is _{0 }= 0.01, _{0}, and λ^{2}_{0}, respectively. In our simulations, we choose

We can use the case or control samples separately or combine them for SNP detection as in the "SNP identification using EM" subsection. For example, we consider both the cases and controls jointly. The log-likelihood ratio statistic Λ (or Λ_{i }_{γ}_{γ }_{γ}_{γ}

Similar approaches can be used to study the power of association studies using the pooling design. For details, see additional file

**Supplementary materials**. Supplementary methods and results.

Click here for file

A pooled sequencing data set related to type 1 diabetes

We use our method to study the pooled sequencing data related to type 1 Diabetes dataset (T1D) in

Results

We first present our results on the effects of various parameters on the estimation accuracy of the minor allele frequency using the EM algorithm. We then present the results on the power of SNP detection and association studies. Finally, we present our results on the analysis of the data in

The effects of minor allele frequency, sequencing error rate, number of individuals in the pools and number of pools on the accuracy of allele frequency estimation

We compare our EM estimate

Comparison of

**MSE**

**Cg**

**MSE**

**Cg**

**MSE**

**Cg**

**MSE**

**Cg**

0

0

0

0

0

0

0

0

9

5

4

3

0

0

0

0

13

10

9

7

0

0

0

0

17

16

17

16

12

12

7

7

Number of scenarios where MSE_{em }> MSE_{avg }or Cg_{em }> Cg_{avg }out of 18 total scenarios for each cell.

Figure _{start }= 1%, _{frac}, while _{frac }might be responsible for the majority of the variance of _{frac }in the lower right panel quantitatively demonstrates the superiority of

Comparison of

**Comparison of **. An example for the comparison of performances between

The relative errors of

We measure the bias of an estimator by the relative error (RE) defined as

Comparison of

**
f
**

0.1%

52.0

9.4

102.0

15.6

502.0

72.0

1000.0

146.0

0.5%

10.3

4.0

20.2

4.9

99.5

13.3

199.0

26.5

1%

5.0

3.3

10.0

3.7

49.2

5.7

98.3

9.5

5%

1.0

5.4

1.9

6.0

9.1

6.7

18.1

6.3

The average RE of

Next we present our results for the effects of (

The effects of minor allele frequency f and sequencing error rate α on the estimation accuracy of

To study the effects of minor allele frequency _{frac}, we fix (_{frac}, rather than the algorithm itself, as shown in Figure S3 of the additional file _{frac}. Thus, _{frac }than of _{frac}, and its variance appears to be affected less by

_{frac}

**
f
**

**MSE**

**Cg**

**MSE**

**Cg**

**MSE**

**Cg**

**MSE**

**Cg**

0.1%

9.8e-7

4.3e-8

9.7e-7

1.2e-7

3.2e-6

2.8e-6

1.1e-5

1.1e-5

0.5%

5.5e-6

1.9e-7

5.4e-6

1.9e-7

6.7e-6

1.5e-6

1.0e-5

5.8e-6

1%

1.2e-5

8.6e-7

1.2e-5

9.6e-7

1.3e-5

2.3e-6

1.6e-5

5.6e-6

5%

5.5e-5

1.3e-5

5.9e-5

1.7e-5

8.3e-5

4.3e-5

1.0e-4

7.0e-5

The mean squared errors (MSE) and Cg's of _{frac }for various combinations of

To reduce the effect of a few outliers of

We also studied the effects of (

Results on the power of SNP calling using the likelihood ratio test

We next study the effects of (

Power of SNP detection

**Power of SNP detection**. The power of detecting true SNPs at a type I error of 0.05, varying one parameter at a time while fixing all other parameters at default values. Default: (

It can be seen from Figure

Results on the power of association studies using the likelihood ratio test

We also study the effects of (

Power of association

**Power of association**. The power of detecting associated SNPs at a type I error of 0.05. Each subplot displays the effect of one parameter and the number of reads

It can be seen from Figure

Results on the analysis of the type 1 diabetes data in

Allele frequency estimation and SNP calling in the control samples

We apply our approaches to analyze the pooled sequencing data in

Evaluation of the SNP calling results

A standard approach to evaluate the effectiveness of a SNP calling method is to compare the fraction of dbSNPs

dbSNP ratio and Ti/Tv ratio

**dbSNP ratio and Ti/Tv ratio**. dbSNP ratio and Ti/Tv (transition/transversion) ratio of the top 100 variants called by EM-SNP and SNVer, whose minor allele frequencies are less than the corresponding threshold labeled by the x-axis.

In terms of the dbSNP ratio for the top 100 called variants, EM-SNP consistently outperforms SNVer under all allele frequency thresholds, and EM-SNP displays significant superiority especially for low frequency variants. In Table S7 of the additional file _{em }

Another criterion to evaluate SNP calls is the transition-transversion (Ti/Tv) ratio. It is well known that transitions are much more frequent than transversions in evolution, and the number of transitions over the number of transversions, referred to as Ti/Tv ratio, in known SNPs is expected to be between 2 and 4 _{EM }< 0.2% by EM-SNP and SNVer. The effect of minor allele frequency on the relative performance of EM-SNP and SNVer in terms of Ti/Tv ratio is similar to that in terms of dbSNP ratio (Figure S10 in the additional file

We also consider the top 150 ranked SNPs and the corresponding figures and tables are shown as Figure S11-S12 and Tables S7-S8 in the additional file

Identifying SNPs associated with type 1 diabetes

We then study the association of the identified variants with type 1 diabetes (T1D). We first look at the common SNPs with estimated minor allele frequencies above 1% in the controls as in ^{-5}. The p-value obtained through the likelihood ratio test reflects the true p-value better because it takes the variation in estimating the allele frequency into account.

Association results

**SNP**

**Gene**

**
n
_{0}
**

**
n
_{1}
**

**Fisher's p-value**

**EM p-value**

rs3184504

SH2B3

0.52

499

0.41

394

1.9e-6

8.4e-7

rs7076103

IL2RA

0.19

178

0.10

93

4.5e-8

2.7e-7

rs2476601

PTPN22

0.09

86

0.16

151

8.1e-6

9.2e-6

Testing for association between common SNPs

The SNP rs3184504 residing within gene SH2B3 has an EM p-value of 8.4

For rare polymorphisms

Discussion

In this paper, we developed an EM algorithm based unified approach for minor allele frequency estimation, SNP calling and association studies, applicable to pooled sequencing data where genetic materials of multiple individuals are pooled together. This study differs from previous studies in that we estimate sequencing error rate for each position while previous studies generally assume a pre-specified sequencing error rate across all sequenced regions. Since sequencing error rate depends on the genomic context, it is essential that the sequencing error rate be estimated specifically for different loci. In a pooling design without tagging, the origin of the reads is not known, and it is impossible to obtain the individual genotypes from the pooled data. Therefore, we modelled the pooled sequencing data as a "missing value" problem and designed an EM algorithm to estimate the minor allele frequency and sequencing error rate.

We first studied the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth in each pool, on the estimation accuracy of the minor allele frequency. It was shown that the naive approach, which estimates the minor allele frequency by the fraction of observed minor alleles in the reads, can significantly over-estimate the true minor allele frequency, and that the effect is most severe for rare variants. The EM based algorithm, on the other hand, can estimate the minor allele frequency in a relatively unbiased manner. Although the variation of this estimation seems to be relatively large, a major part of the variation comes from the sampling of individuals from the population rather than the algorithm itself. We also show that the estimation accuracy of the EM algorithm increases with the number of pools and sequence depth as expected. However, the estimation accuracy decreases with the number of individuals in each pool, most likely because a more extensive pooling induces greater loss of information. Secondly, we used a likelihood ratio statistic based on the estimated parameters from EM to call SNPs. With the real data from

We made several simplifying assumptions in our study. First and foremost, we did not consider errors introduced by mapping the reads to the reference genome. The mapping of Roche 454 data still has many challenges, in particular, in regions around homopolymers, and further development of algorithms for mapping is needed. Secondly, although we assumed that the amount of genetic materials from each individual is the same for each pool, this assumption can be violated. To overcome this problem, one approach would assume that the fractions of genetic materials from individuals follow a Dirichlet distribution

Software

Software can be downloaded from

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Both authors participated in the development of methodology, simulations, real data application, revisions, and manuscript preparation. Both authors read and approved the final manuscript.

Declarations

The publication costs for this article were funded by US NIH 1 U01 HL108634.

This article has been published as part of

Acknowledgements

This research was supported by National Institutes of Health (P50HG002790 and 1 U01 HL108634). Q Chen was partially supported by the Viterbi Fellowship. F.S. is also supported by National Natural Science Foundation of China (60928007 and 60805010) and Tsinghua National Laboratory for Information Science and Technology (TNLIST) Cross-discipline Foundation.