Workflow for establishment and execution of the reciprocal smallest distance algorithm using the Elastic MapReduce framework on the Amazon Elastic Compute Cloud (EC2). (1) Preconfiguration involves the general setup and porting of the RSD program and genomes to the Amazon S3, and configuration of the Mappers for executing the BLAST and RSD runs within the cluster. (2) Instantiation specifies the Amazon EC2 instance type (e.g. small, medium, or large), logging of cloud cluster performance, and preparation of the runner files as described in the Methods. (3) Job Flow Execution launches the processes across the cluster using the command-line arguments indicated in Table 1. This is done for the Blast and RSD steps separately. (4) The All-vs-All BLAST utilizes the BLAST runner and BLAST mapper to generate a complete set of results for all genomes under consideration. (5) The Ortholog computation step utilizes the RSD runner file and RSD mapper to estimate orthologs and evolutionary distances for all genomes under study. This step utilizes the stored BLAST results from step 4 and can be run asynchronously, at any time after the BLAST processes complete. The Amazon S3 storage bucket was used for persistent storage of BLAST and RSD results. The Hadoop Distributed File System (HDFS) was used for local storage of genomes, and genome-specific BLAST results for faster I/O when running the RSD step. Additional details are provided in the Methods.
Wall et al. BMC Bioinformatics 2010 11:259 doi:10.1186/1471-2105-11-259