Table 1

Elastic Map Reduce commands





Activates the "streaming" module



File(s) to be processed by EMR

hdfs:///home/hadoop/blast_runner hdfs:///home/hadoop/ortho_runner


Name of mapper file

s3n://rsd_bucket/ s3n://rsd_bucket/


None required, reduction done within RSD algorithm



Individual symlinks to the executables, genomes,

s3n://rsd_bucket/executables.tar.gz #executables,#genomes, #RSD_standalone,#blastinput,#results



-- jobconf

Number of blast and ortholog calculation processes

= N

-- jobconf

Total number of task trackers

= 8

--jobconf mapred. task, timeout

Time at which a process was considered a failure and restarted

= 86400000 ms

--jobconf mapred.tasktracker.expiry.interval

Time at which an instance was declared dead.

3600000 (set to be large to avoid instance shut down with long running jobs)


If true, EMR will speculate that a job is running slow and run the same job in parallel

False (because the time for each genome-vs-genome run varied widely, we elected to set this argument to False to ensure maximal availability of the cluster)

Specific commands passed through the Ruby command line client to the Elastic MapReduce program (EMR) from Amazon Web Services. The inputs specified correspond to (1) the BLAST step and (2) the ortholog computation step of the RSD cloud algorithm. These configurations settings correspond to both the EMR and Hadoop frameworks, with two exceptions: In EMR, a --j parameter can be used to provide an identifier for the entire cluster, useful only in cases where more than one cloud cluster is needed simultaneously. In Hadoop, these commands are passed directly to the streaming.jar program, obviating the need for the --stream argument.

Wall et al. BMC Bioinformatics 2010 11:259   doi:10.1186/1471-2105-11-259

Open Data