Open Access Highly Accessed Methodology article

Calculation of Tajima’s D and other neutrality test statistics from low depth next-generation sequencing data

Thorfinn Sand Korneliussen1*, Ida Moltke23, Anders Albrechtsen3 and Rasmus Nielsen4

Author Affiliations

1 Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Oestervoldgade 5-7, DK-1350, Copenhagen, Denmark

2 Department of Human Genetics, University of Chicago, 920 E. 58th Street, CLSC 5th floor, Chicago, IL 60637, USA

3 The Bioinformatics Centre, Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, DK-2200, Copenhagen, Denmark

4 Departments of Integrative Biology and Statistics, UC-Berkeley, 4098 VLSB, Berkeley, California 94720, USA

For all author emails, please log on.

BMC Bioinformatics 2013, 14:289  doi:10.1186/1471-2105-14-289

Published: 2 October 2013



A number of different statistics are used for detecting natural selection using DNA sequencing data, including statistics that are summaries of the frequency spectrum, such as Tajima’s D. These statistics are now often being applied in the analysis of Next Generation Sequencing (NGS) data. However, estimates of frequency spectra from NGS data are strongly affected by low sequencing coverage; the inherent technology dependent variation in sequencing depth causes systematic differences in the value of the statistic among genomic regions.


We have developed an approach that accommodates the uncertainty of the data when calculating site frequency based neutrality test statistics. A salient feature of this approach is that it implicitly solves the problems of varying sequencing depth, missing data and avoids the need to infer variable sites for the analysis and thereby avoids ascertainment problems introduced by a SNP discovery process.


Using an empirical Bayes approach for fast computations, we show that this method produces results for low-coverage NGS data comparable to those achieved when the genotypes are known without uncertainty. We also validate the method in an analysis of data from the 1000 genomes project. The method is implemented in a fast framework which enables researchers to perform these neutrality tests on a genome-wide scale.

Next-generation sequencing; Darwinian selection; Neutrality tests