Open Access Highly Accessed Software

Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community

Konstantinos Krampis1*, Tim Booth2, Brad Chapman3, Bela Tiwari4, Mesude Bicak2, Dawn Field2 and Karen E Nelson1

Author Affiliations

1 J.Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850, USA

2 CEH Wallingford, Benson Lane, Wallingford, UK

3 Bioinformatics Core, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USA

4 CLC Bio, Finlandsgade 10, 8200 Århus N, Denmark

For all author emails, please log on.

BMC Bioinformatics 2012, 13:42  doi:10.1186/1471-2105-13-42

Published: 19 March 2012

Additional files

Additional file 1:

Supplementary 1 Cloud BioLinux software documentation in the form of a mini, self-contained website. Users need to download and uncompress the .zip file, and open through a web browser the "index.html" file available on the main directory. (ZIP 1823 kb).

Format: ZIP Size: 1.8MB Download file or display content in a new window

Open Data

Additional file 2:

Supplementary 2. Overall, the pipeline takes as input sequencing reads, converts them to standard Fastq format, aligning to a reference genome, doing SNP calling, and producing a summary PDF of results. Furthermore, it leverages the CloudMan server that is included in Cloud BioLinux VM for provisioning a Sun Grid Engine (SGE) cluster composed of multiple copies of the VM on the cloud, and the RabbitMQ distributed messaging queue for data coordination and exchange between the cluster nodes respectively (a.) sequence files for each lane are split based on the sample barcodes and the parts are aligned using Bowtie to the genome in parallel, by being submitted as separate SGE computational tasks across the cluster nodes. The pipeline then sorts and merges BAM alignment files from multiple lanes if sequences have the same barcode, producing a single representative BAM file for each barcoded sub-sample. At this step, RabbitMQ enables message exchange between the cluster nodes in order to identify BAM sequence alignment sets with the same barcode on different nodes, and the pipeline scripts transfer and merge those sets on a single node (b.) the second phase parallelizes the processing of each alignment file with read quality assessment, variant calling and visualization, using an identical approach based on SGE parallel computation across cluster nodes, with the software tools performing each of these steps being respectively Fastqc and Picard, GATK and BigWig. (JPEG 160 kb).

Format: JPEG Size: 160KB Download file

Open Data