Departments of Integrative Biology and Statistics, UC Berkeley, Berkeley CA 94720, USA

Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark

Beijing Genomics Institute, Shenzhen 518083, China

Department of Biology, University of Copenhagen, Copenhagen, Denmark

Beijing Institute of Genomics, Chinese Academy of Science, Beijing 101300, China

The Graduate University of Chinese Academy of Sciences, Beijing 100062, China

Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Hagedorn Research Institute, Copenhagen, Denmark

Steno Diabetes Center, Gentofte, Denmark

Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

Research Centre for Prevention and Health, Glostrup University Hospital, Glostrup, Denmark

Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark

Faculty of Health Sciences, University of Aarhus, Aarhus, Denmark

Institute of Biomedical Sciences, University of Copenhagen, Copenhagen, Denmark

Abstract

Background

Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15

Results

We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data.

Conclusions

Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.

Background

The frequency of an allele in the population is a fundamental quantity in human statistical genetics. This quantity forms the basis of many population and medical genetic studies. Many evolutionary forces change allele frequencies. Consequently, allele frequencies can be used to infer past evolutionary events. For example, allele frequencies at single nucleotide polymorphisms (SNPs) can be used to infer the demographic history of a population _{ST }

Given the importance of allele frequencies in genetic studies, it is critically important to be able to estimate them reliably. Traditionally, allele frequencies were simply estimated by counting the number of times each allele had been seen in a sample from the population. This approach was often successfully used on SNP genotype data and Sanger sequencing data because the genotypes for each individual could often be unambiguously determined. However, this approach may fail when applied to data from next-generation sequencing technology. First, next-generation sequencing data has a higher error rate than traditional Sanger sequencing or SNP genotyping assays

Several different approaches have been proposed to attempt to make accurate inferences of allele frequency from next-generation sequencing technologies

Here we discuss the properties of a new likelihood approach designed to estimate the population minor allele frequency from next-generation sequencing data. We show that the new likelihood method can obtain accurate estimates of allele frequencies, even when the depth of coverage is quite shallow. Further, we show that the new likelihood method either performs as well as, or better than, genotype calling methods. Finally, we discuss the performance of the likelihood approach in testing for differences in allele frequency between cases and controls.

Results

The minor allele is the less frequent allele in the population at a variable site. We first describe two main approaches to estimate the minor allele frequency (MAF) at a particular site in the genome. The first approach involves inferring individual genotypes and treating those inferred genotypes as being completely accurate when estimating the MAF. We then examine the performance of a likelihood framework that directly takes the uncertainty in assigning genotypes into account. Throughout our work, we assume that all segregating sites are biallelic.

Estimation of MAF from called genotypes

One way to estimate the MAF from next-generation sequencing data is to first call a genotype for each individual using sequencing data, and then use those genotypes as if they are the true ones. This was the approach traditionally used for genotype data and Sanger sequencing data. It is not clear how well it will perform when applied to next-generation sequencing data.

A maximum likelihood approach can be used to infer the genotype for each individual from the next-generation sequencing data. At each site

where _{i,j }
_{i}
_{,}
_{j }

To assign a genotype to a particular individual, the likelihood of each of the three possible genotypes can be calculated for the individual. The genotype with the highest likelihood can then be assigned. However, researchers often prefer a more stringent calling criterion and will not assign a genotype to an individual unless the most likely genotype is substantially more likely than the second most likely one. Here the three possible genotypes are sorted by their likelihoods: _{(k) }corresponds to the genotype with the _{(1) }if

Maximum likelihood estimator of allele frequency

Instead of estimating the MAF from the called genotypes, a maximum likelihood (ML) method introduced by Kim et al.

Suppose that the three genotype likelihoods defined in Equation 1 are available. Using the same notation as above, let _{j }
_{j }

The ML estimate of _{j }

Maximum likelihood estimator with uncertain minor allele

In practice, often the second most common nucleotide across individuals can be used as the minor allele. However, for rare SNPs (e.g., MAF < 1%), it is hard to determine which allele is the minor allele, since all four nucleotides may appear in some reads due to sequencing errors. To deal with this situation, we now describe a likelihood framework that takes the uncertainty in the determination of the minor allele into account.

Suppose that for site _{1}, _{2}, and _{3}. The likelihood introduced in Equation 2 assumes a fixed major allele

Further, assuming that any of the three possible minor alleles is equally likely, we obtain:

where _{(1)}
_{(2)}
_{(3)}), where _{(1) }is the largest one. Then,

In association studies, SNPs showing significant differences in allele frequency between cases and controls are said to be associated with the phenotype of interest. Association mapping can be performed using data from next-generation sequencing studies. We first discuss approaches that require calling individual genotypes and then perform a test for association using the called genotypes. In this approach, a genotype is first called for each individual. The genotypes can be filtered or unfiltered. Assuming independence across individuals and HWE, a 2 × 2 contingency table can be built by counting the number of major and minor alleles in both the cases and controls. This leads to the well-known likelihood ratio test for independence, the

where _{k,h }
_{k,h }
^{2}(1)). However, in our studies, we construct the

Likelihood ratio test accounting for uncertainty in the observed genotypes for association mapping

Instead of calling genotypes, the likelihood framework allows for uncertainty in the genotypes and tests at each site _{O }
_{j}
_{,1 }= _{j}
_{,2}(= _{j}
_{,0}) and _{A }
_{j}
_{,1 }≠ _{j}
_{,2}, where _{j}
_{,1 }and _{j}
_{2 }are the MAFs in cases and controls, respectively.

Assuming that minor (

where

If the minor allele is unknown, the likelihood under the null hypothesis is computed as in Equation 3, and the

where _{j }

Estimating MAF in simulated data

We compare the estimates of allele frequency on simulated data using true genotypes (True), called genotypes without any filtering (Call NF), called genotypes with filtering (

**Boxplot of estimated MAFs using ML methods with known or unknown minor allele**. Boxplot of estimated MAFs of SNPs corresponding to each sample allele frequency. Assuming 1,000 individuals, 1,000 SNPs with true MAF of 0.5% were simulated at individual sequencing depth of 8

Click here for file

We first evaluated how well the different approaches were able to estimate the MAF in 200 individuals across a range of sequencing depths for 1,000 SNPs with a true MAF of 5%. Figure

Estimates of allele frequency at sites with a true MAF of 5% for different depths of coverage

**Estimates of allele frequency at sites with a true MAF of 5% for different depths of coverage**. At each depth, 1,000 sites were simulated using 200 individuals, and at each site, an estimate of allele frequency is computed using: (1) true genotypes (True); (2) called genotypes without filtering (Call NF); (3) called genotypes with filtering (Call F); and (4) the maximum likelihood method (ML). For more details of the estimation methods, see Methods.

The results are dramatically different for the new ML method. This method provides unbiased estimates of the MAF (median of ~4.9%) across a range of depths. Even at 2×, the estimates show only a slightly larger variance than those based on the true genotypes.

We also compared the estimated mean squared error (MSE; Expectation (

Mean squred error (MSE; Expected

**Mean squred error (MSE; Expected **. At each depth, MSE was computed from the allele frequency estimates made using four different methods: True, Call NF, Call F, and ML (for details of the methods, see the caption of Figure 1).

Overall, the new ML method out-performs genotype calling methods.

Estimating a distribution of MAFs from simulated data

We next examine how the different estimation approaches performed in estimating the proportion of SNPs at different frequencies in the population (similar to the site frequency spectrum but based on population allele frequency instead of sample frequency). Here we simulated 20,000 SNPs where the distribution of the true MAFs followed the standard stationary distribution for an effective population size of 10,000 (see Methods). Note that in practice, however, it is very difficult to distinguish a very rare SNP from a sequencing error. Therefore, for comparison purpose with real data, we discarded SNPs with estimated MAF less than 2%. Figure

Distribution of allele frequencies of SNPs simulated assuming the standard stationary distribution of allele frequencies

**Distribution of allele frequencies of SNPs simulated assuming the standard stationary distribution of allele frequencies**. At each depth (each panel), 20,000 SNPs were simulated, and for each SNP, estimates of the MAF were obtained using four different methods (see the caption of Figure 1). Then, for each method (each color), only sites with estimated allele frequencies > 2% are used to generate each histogram (x-axis).

As expected, with a high depth of coverage, such as 10× per individual, all methods provide estimated MAF distributions that are similar to the expected distribution based on the true genotypes (Figure

The picture is entirely different for the ML method. The estimated MAF distribution obtained from the new ML method closely follows the true distribution even with shallow depths of coverage. Here there is almost no excess of low-frequency SNPs. At a depth of 4×, the proportion of SNPs in the second bin of the histogram is 18.4%, which is very close to the expected proportion (18%). Thus, more reliable estimates of the frequency spectrum can be made from low-coverage data by using our likelihood approach than by using the genotype calling approaches.

Association mapping in simulated data

We compare the performance of methods that treat inferred genotypes as true genotypes in tests of association (using a

With reasonably large sample sizes, standard asymptotic theory suggests that under the null hypothesis both the ^{2}(1)). Therefore, we have compared the null distribution of the ^{2}(1) distribution using QQ-plots (Figure ^{2}(1) distribution. However, the distribution of the ^{2}(1) distribution. Calling genotypes and then treating those genotypes as being accurate produces a vast excess of false-positive signals if the ^{2}(1) distribution. For example, at a depth of 2×, 11% of the SNPs had a ^{2}(1) distribution for either 2× or 5× depths of coverage.

QQ-plots comparing the null distribution of the test statistic of interest with a ^{2}(1) distribution

**QQ-plots comparing the null distribution of the test statistic of interest with a χ ^{2}(1) distribution**. Each column corresponds to a different test statistic: (1)

**QQ-plot comparing the null distribution of the Armitage trend test statistic with a χ
^{2}(1) distribution**. QQ-plots comparing the null distribution of the test statistic of interest with a

Click here for file

We also generated receiver operating characteristic (ROC) curves for each of the different association tests. These curves show the power of the test at different false-positive rates. Since the distributions of some of the test statistics do not follow the ^{2}(1) distribution under the null hypothesis, to make a fair comparison, we obtained the critical value for each false positive rate based on the empirical null distribution. The power is computed as the fraction of simulated disease loci that have a statistic exceeding the critical value. Overall, we find that the LRT

Receiver operating characteristic (ROC) curves of four tests of association

**Receiver operating characteristic (ROC) curves of four tests of association**. For the definition of the four statistics, see the caption of Figure 4. Assuming 500 cases and 500 controls, a set of 20,000 sites were simulated under the null and under the alternative at individual sequencing depths of 2×, 5×, and 10× (three columns). At each false positive rate (x-axis), the corresponding critical value was computed using the empirical null distribution. The true positive rate (power; y-axis) was obtained by computing the fraction of causative sites with test statistics that exceed the critical value.

**Receiver operating characteristic curve of the Armitage trend test**. Receiver operating characteristic (ROC) curves of four tests of association. For the definition of the four statistics, see the caption of Additional file

Click here for file

Application to real data

We analyzed 200 exomes from controls for a disease association study that have been sequenced using Illumina technology at a per-individual depth of 8×

First, we explored the accuracy of the estimates of the MAF from next-generation sequencing data for 50 SNPs by comparing them to the estimated MAFs from Sequenom genotype data. Both the estimates using the ML method and the genotype calling method without filtering are highly correlated with the estimates made from the Sequenom genotype data (i.e., a small standardized difference between the two estimates in Figure

Estimates of allele frequency computed from 200 individuals using next-generation sequencing data vs. Sequenom genotype data

**Estimates of allele frequency computed from 200 individuals using next-generation sequencing data vs. Sequenom genotype data**. At each site, only individuals that have both Sequenom genotype data and sequencing data were used for estimation of allele frequency. For the sequencing data, estimates of MAF were obtained using three different methods (Call NF; Call F; and ML). The standardized difference for each estimate was computed as

We next examined the distribution of MAFs computed using several approaches across a range of sequencing depths from our next-generation exome sequencing data (Figure ^{-5 }using a rank-sum-test

Distribution of the minor allele frequency estimated from the exomes of 200 sequenced individuals

**Distribution of the minor allele frequency estimated from the exomes of 200 sequenced individuals**. For each site, the minor allele frequency was estimated using four different methods: (1) the ML method with unknown minor allele, (2) the ML method with a known or fixed minor allele, (3) calling genotypes without filtering (Call NF), and (4) calling genotypes with filtering (Call F). Each site is classified into bins based on the depth of coverage. Furthermore, in each histogram, sites with estimated MAF less than 2% are not considered. For the number of SNPs that were used for this analysis, see Table 1.

Number of SNPs with estimated MAF larger than 2% using a particular method (row) within each bin (column) defined by average sequence depth across individuals.

**0.5×-3×**

**3×-6×**

**6×-9×**

**9×-12×**

**12×-25×**

ML (unknown)

18324

12564

9102

6778

11862

ML (known)

15282

11482

8742

6651

11810

Call NF

123546

63415

19516

9695

13035

Call F

391488

21511

10018

7145

12026

Finally, we used this exome-resequencing data to simulate a case-control association study. To examine the distribution of the association test statistics under the null hypothesis, we randomly assigned 100 individuals to a case group and the other 100 to the control group. For all SNPs on chromosome 2 with MAF estimates > 2% (based on the unknown minor allele ML method), we tested for allele frequency differences between cases and controls by computing the ^{2}(1) distribution. As seen in simulation studies, the null distribution of the ^{2}(1) distribution. However, the null distribution of the ^{2}(1) distribution. The inflation factor

QQ-plots comparing the association test statistics for allele frequency differences between 100 cases and 100 controls to a ^{2}(1) distribution

**QQ-plots comparing the association test statistics for allele frequency differences between 100 cases and 100 controls to a χ ^{2}(1) distribution**. Phenotypes were randomly assigned to indivdiduals in the exome resequencing dataset such that there are 100 cases and 100 controls. For each site, three statistics were computed: the

Discussion

The likelihood method discussed here is an extension of our previous approach

Though not surprising, it is important to note that with higher sequencing coverage, the particular approach used to estimate allele frequencies does not matter as much. For depths of coverage > 10×, the genotype calling methods both with and without filtering behave appropriately and similarly to the ML approach. Thus, with high depths of coverage, the traditional and simple method of calling genotypes and then treating those genotypes as being known with certainty is still effective. The reason for this is that with such high depth, the called genotypes are likely to be accurate. With lower depths of coverage, however, there is considerable uncertainty regarding the true genotype. Often the most-likely genotype will not be the true genotype, leading to biases in estimates of allele frequency and spurious signals of association in case-control studies. In this situation, the ML method is a superior approach.

In our simulations, we compared the performance of our ML approach to a relatively simple genotype calling approach (see Methods). It is possible that more sophisticated genotype calling approaches such as SOAPsnp

We have explored whether it is better to call genotypes with filtering or without filtering when analyzing low-coverage data. Intuitively, one would expect that if there was uncertainty in the genotypes, it would be better to call genotypes only if one was very confident in that genotype and treat the other less confident genotypes as missing data. However, as discussed by Johnson et al.

Studies have suggested that genotype calling approaches that use LD information to call genotypes

As currently implemented our method does not tackle the problem of SNP calling itself. In principle, our approach could be extended to use a LRT to call SNPs. Specifically, the test could compare the probability of the data under the hypothesis that there is no SNP at a given site (_{0 }: _{A }

Finally, our likelihood method has some limitations. It cannot estimate the frequencies of very rare alleles from low-coverage data. This is not so much a deficiency with the likelihood approach, but instead, speaks to the difficulty in detecting very rare variants using little data in a background of sequencing errors. To reliably detect and estimate the frequencies of rare variants with < 1% frequency, higher-coverage sequencing will be required. However, approaches that take genotype uncertainty into account may still be important. As shown by Garner

Conclusions

We have evaluated the performance of a likelihood method and genotype calling methods to estimate the minor allele frequency from next-generation sequencing data. The likelihood method accurately estimates allele frequencies even when applied to low-coverage data (e.g., < 4×per individual) since it models the uncertainty in assigning individual genotypes. However, genotype calling approaches can lead to biased inferences when applied to low-coverage data. We have also extended the likelihood approach to test for differences in estimated minor allele frequency between cases and controls. Through simulations and the analysis of exomes from 200 individuals, our LRT has appropriate false-positive rates and higher power than genotype calling approaches when analyzing low-coverage data. Finally, we have shown that under certain circumstances, if one uses genotype calling approaches, it is better to not filter genotypes based on the call confidence score.

Methods

Simulation studies

We performed extensive simulation studies to compare the performance of the likelihood methods with methods based on called genotypes. Specifically, we simulated data to assess (1) the accuracy of the estimates of the MAF, (2) the accuracy in estimating the distribution of MAFs across genome, and (3) statistical power in association mapping studies. Due to computational constraints, we simulated SNPs in the sequencing data directly rather than simulating raw sequencing reads.

We simulated SNPs with a specified MAF, number of individuals and per-individual sequencing depth. When simulating causal SNPs in association studies, MAFs for cases and controls were assigned using a multiplicative disease model. For this model, the prevalence of the disease was fixed at 10%. We examined two sets of MAFs and relative risks. First, the combined MAF in cases and controls was 1% and the relative risk was 2. Second, the combined MAF was 5% and the relative risk was 1.5. As an example, with a combined MAF of 1% and a relative risk of 2.0, the obtained MAFs for cases and controls are 1.98% and 0.89%, respectively. Each individual genotype was simulated assuming Hardy-Weinberg equilibrium with the given MAF. Read bases were then generated for each individual by copying each allele a Poisson-distributed number of times with mean equal to the half the specified individual depth. Each read base then may have been altered to one of the other three nucleotides at a specified type-specific sequencing error rate. The type-specific error rates used to simulate the data were estimated (see below) from 200 exomes sequenced using the Illumina platform

We also evaluated the performance of the approaches to estimate a distribution of MAFs when the specified allele frequencies were drawn from a the stationary distribution under a Wright-Fisher model with a population size of 10,000. Under such a model, population allele frequencies are proportional to 1/x, where x is the frequency of the allele in the population

Various methods have been proposed to compute genotype likelihoods from next-generation sequencing data, which recalibrate quality scores of read bases and attempt to correct for sequencing error structures and other complexities in the genome

The data consist of the observed counts of each of the four nucleotides (_{i}
_{j}
_{i}
_{j}
_{i}
_{j}
_{i}
_{j}
_{b}
_{b' }

where _{i}
_{j}
_{i}
_{j}
_{i}
_{j}
_{i}
_{j}

**Estimates of type-specific sequencing error rates**. Type-specific sequencing error rates estimated from 200 exomes

Click here for file

Analysis of of real data

We also analyzed 200 Danish exomes that had been sequenced using Illumina technology at a coverage of about 8×per individual

We examined the performance of the ML approach as applied to this dataset in three different ways. First, we used 50 SNPs in which Sequenom genotype data were available in most of individuals to compare the MAFs estimated from the Illumina sequencing data to those estimated from the genotype data. Here, for each site, we used only those individuals that have both genotype data and sequence data. For most of the sites (>95%), more than 170 individuals satisfy this condition. Second, we examined the proportion of SNPs with different frequencies when using different strategies to estimate the MAFs. Finally, we used these data to simulate a case-control association study by randomly assigning 100 exomes to a case group. We then examined the behavior of the different test statistics under the null hypothesis.

Availability of software

All the source code used for our simulation studies, estimation of parameters, and tests of association are publicly available (Additional files

**Manual of our programs: **
**simreseq**
** and **
**testassoc**

Click here for file

**Source code of our programs: **
**simreseq **
**and **
**testassoc**

Click here for file

Authors' contributions

SYK participated in the design of the study, carried out simulation studies and statistical analyses, and was involved in drafting the manuscript. KEL was involved in drafting the manuscript and revising it critically for important intellectual content. AA and TSK participated in processing the sequence data and helped with computational aspects of the project. YL, GT, NG, TJ, GA, DW, TJ, TH, OP and JW participated in the cohort design and data coordination. RN conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This study was supported by a grant from The Lundbeck Foundation to The Lundbeck Foundation Centre for Applied Medical Genomics in Personalized Disease Prediction, Prevention and Care (LuCamp), and by National Institutes of Health [R01-MHG084695; R01-HG003229]. Kirk Lohmueller was supported by the Miller Research Institute at UC Berkeley, and Anders Albrechtsen and Gitte Andersen were supported by the Danish Research Council.