Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Selected articles from the 8th International Symposium on Bioinformatics Research and Applications (ISBRA'12)

Open Access Methodology Article

Reconstruction of viral population structure from next-generation sequencing data using multicommodity flows

Pavel Skums1*, Nicholas Mancuso2, Alexander Artyomenko2, Bassam Tork2, Ion Mandoiu3, Yury Khudyakov1 and Alex Zelikovsky2

  • * Corresponding author: Pavel Skums kki8@cdc.gov

  • † Equal contributors

Author Affiliations

1 Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA

2 Department of Computer Science, Georgia State University, 34 Peachtree str., 30303, Atlanta, GA, USA

3 Department of Computer Science and Engineering, University of Connecticut, 06269, Storrs, CT, USA

For all author emails, please log on.

BMC Bioinformatics 2013, 14(Suppl 9):S2  doi:10.1186/1471-2105-14-S9-S2

Published: 28 June 2013

Abstract

Background

Highly mutable RNA viruses exist in infected hosts as heterogeneous populations of genetically close variants known as quasispecies. Next-generation sequencing (NGS) allows for analysing a large number of viral sequences from infected patients, presenting a novel opportunity for studying the structure of a viral population and understanding virus evolution, drug resistance and immune escape. Accurate reconstruction of genetic composition of intra-host viral populations involves assembling the NGS short reads into whole-genome sequences and estimating frequencies of individual viral variants. Although a few approaches were developed for this task, accurate reconstruction of quasispecies populations remains greatly unresolved.

Results

Two new methods, AmpMCF and ShotMCF, for reconstruction of the whole-genome intra-host viral variants and estimation of their frequencies were developed, based on Multicommodity Flows (MCFs). AmpMCF was designed for NGS reads obtained from individual PCR amplicons and ShotMCF for NGS shotgun reads. While AmpMCF, based on covering formulation, identifies a minimal set of quasispecies explaining all observed reads, ShotMCS, based on packing formulation, engages the maximal number of reads to generate the most probable set of quasispecies. Both methods were evaluated on simulated data in comparison to Maximum Bandwidth and ViSpA, previously developed state-of-the-art algorithms for estimating quasispecies spectra from the NGS amplicon and shotgun reads, respectively. Both algorithms were accurate in estimation of quasispecies frequencies, especially from large datasets.

Conclusions

The problem of viral population reconstruction from amplicon or shotgun NGS reads was solved using the MCF formulation. The two methods, ShotMCF and AmpMCF, developed here afford accurate reconstruction of the structure of intra-host viral population from NGS reads. The implementations of the algorithms are available at http://alan.cs.gsu.edu/vira.html webcite (AmpMCF) and http://alan.cs.gsu.edu/NGS/?q=content/shotmcf webcite (ShotMCF).