V-Phaser 2: variant inference for viral populations
Broad Institute of MIT & Harvard, 7 Cambridge Center, Cambridge, MA 02142 USA
BMC Genomics 2013, 14:674 doi:10.1186/1471-2164-14-674Published: 3 October 2013
Massively parallel sequencing offers the possibility of revolutionizing the study of viral populations by providing ultra deep sequencing (tens to hundreds of thousand fold coverage) of complete viral genomes. However, differentiation of true low frequency variants from sequencing errors remains challenging.
We developed a software package, V-Phaser 2, for inferring intrahost diversity within viral populations. This program adds three major new methodologies to the state of the art: a technique to efficiently utilize paired end read data for calling phased variants, a new strategy to represent and infer length polymorphisms, and an in line filter for erroneous calls arising from systematic sequencing artifacts. We have also heavily optimized memory and run time performance. This combination of algorithmic and technical advances allows V-Phaser 2 to fully utilize extremely deep paired end sequencing data (such as generated by Illumina sequencers) to accurately infer low frequency intrahost variants in viral populations in reasonable time on a standard desktop computer. V-Phaser 2 was validated and compared to both QuRe and the original V-Phaser on three datasets obtained from two viral populations: a mixture of eight known strains of West Nile Virus (WNV) sequenced on both 454 Titanium and Illumina MiSeq and a mixture of twenty-four known strains of WNV sequenced only on 454 Titanium. V-Phaser 2 outperformed the other two programs in both sensitivity and specificity while using more than five fold less time and memory.
We developed V-Phaser 2, a publicly available software tool (V-Phaser 2 can be accessed via: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-phaser-2 webcite and is freely available for academic use) that enables the efficient analysis of ultra-deep sequencing data produced by common next generation sequencing platforms for viral populations.