Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

PVT: An Efficient Computational Procedure to Speed up Next-generation Sequence Analysis

Ranjan Kumar Maji1, Arijita Sarkar1, Sunirmal Khatua2, Subhasis Dasgupta3 and Zhumur Ghosh1*

Author Affiliations

1 Bioinformatics Centre, Bose Institute, Kolkata 700054, India

2 Department of Computer Science and Engineering, University of Calcutta, Kolkata 700009, India

3 Electronics and Communication Sciences Unit (ECSU), Indian Statistical Institute, Kolkata 700108, India

For all author emails, please log on.

BMC Bioinformatics 2014, 15:167  doi:10.1186/1471-2105-15-167

Published: 4 June 2014

Abstract

Background

High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the different types of NGS data, there are certain common challenging steps involved in analysing those data. Spliced alignment is one such fundamental step in NGS data analysis which is extremely computational intensive as well as time consuming. There exists serious problem even with the most widely used spliced alignment tools. TopHat is one such widely used spliced alignment tools which although supports multithreading, does not efficiently utilize computational resources in terms of CPU utilization and memory. Here we have introduced PVT (Pipelined Version of TopHat) where we take up a modular approach by breaking TopHat’s serial execution into a pipeline of multiple stages, thereby increasing the degree of parallelization and computational resource utilization. Thus we address the discrepancies in TopHat so as to analyze large NGS data efficiently.

Results

We analysed the SRA dataset (SRX026839 and SRX026838) consisting of single end reads and SRA data SRR1027730 consisting of paired-end reads. We used TopHat v2.0.8 to analyse these datasets and noted the CPU usage, memory footprint and execution time during spliced alignment. With this basic information, we designed PVT, a pipelined version of TopHat that removes the redundant computational steps during ‘spliced alignment’ and breaks the job into a pipeline of multiple stages (each comprising of different step(s)) to improve its resource utilization, thus reducing the execution time.

Conclusions

PVT provides an improvement over TopHat for spliced alignment of NGS data analysis. PVT thus resulted in the reduction of the execution time to ~23% for the single end read dataset. Further, PVT designed for paired end reads showed an improved performance of ~41% over TopHat (for the chosen data) with respect to execution time. Moreover we propose PVT-Cloud which implements PVT pipeline in cloud computing system.

Keywords:
NGS-data analysis; RNA-Seq; Cloud computing; Big data; Parallel computing; Paired end read analysis; Single end read analysis; Pipeline architecture