Open Access Highly Accessed Research article

Estimation of allele frequency and association mapping using next-generation sequencing data

Su Yeon Kim1*, Kirk E Lohmueller1, Anders Albrechtsen2, Yingrui Li3, Thorfinn Korneliussen4, Geng Tian356, Niels Grarup7, Tao Jiang3, Gitte Andersen8, Daniel Witte9, Torben Jorgensen1011, Torben Hansen127, Oluf Pedersen131478, Jun Wang34 and Rasmus Nielsen14

Author Affiliations

1 Departments of Integrative Biology and Statistics, UC Berkeley, Berkeley CA 94720, USA

2 Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark

3 Beijing Genomics Institute, Shenzhen 518083, China

4 Department of Biology, University of Copenhagen, Copenhagen, Denmark

5 Beijing Institute of Genomics, Chinese Academy of Science, Beijing 101300, China

6 The Graduate University of Chinese Academy of Sciences, Beijing 100062, China

7 Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

8 Hagedorn Research Institute, Copenhagen, Denmark

9 Steno Diabetes Center, Gentofte, Denmark

10 Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark

11 Research Centre for Prevention and Health, Glostrup University Hospital, Glostrup, Denmark

12 Faculty of Health Sciences, University of Southern Denmark, Odense, Denmark

13 Faculty of Health Sciences, University of Aarhus, Aarhus, Denmark

14 Institute of Biomedical Sciences, University of Copenhagen, Copenhagen, Denmark

For all author emails, please log on.

BMC Bioinformatics 2011, 12:231  doi:10.1186/1471-2105-12-231

Published: 11 June 2011



Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15X). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage and high error rates.


We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data.


Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.