<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1471-2105-13-77</ui>
   <ji>1471-2105</ji>
   <fm>
      <dochead>Software</dochead>
      <bibl>
         <title>
            <p>Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support</p>
         </title>
         <aug>
            <au id="A1" ca="yes"><snm>Abouelhoda</snm><fnm>Mohamed</fnm><insr iid="I1"/><insr iid="I3"/><email>mabouelhoda@yahoo.com</email></au>
            <au id="A2"><snm>Issa</snm><mnm>Alaa</mnm><fnm>Shadi</fnm><insr iid="I1"/><email>salaa@nileu.edu.eg</email></au>
            <au id="A3"><snm>Ghanem</snm><fnm>Moustafa</fnm><insr iid="I1"/><insr iid="I2"/><email>mmg@doc.ic.ac.uk</email></au>
         </aug>
         <insg>
            <ins id="I1"><p>Center for Informatics Sciences, Nile University, Giza, Egypt</p></ins>
            <ins id="I2"><p>Department of Computing, Imperial College London, London, SW7 2AZ, UK</p></ins>
            <ins id="I3"><p>Faculty of Engineering, Cairo University, Giza, Egypt</p></ins>
         </insg>
         <source>BMC Bioinformatics</source>
         <issn>1471-2105</issn>
         <pubdate>2012</pubdate>
         <volume>13</volume>
         <issue>1</issue>
         <fpage>77</fpage>
         <url>http://www.biomedcentral.com/1471-2105/13/77</url>
         <xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-13-77</pubid><pubid idtype="pmpid">22559942</pubid></pubidlist></xrefbib>
      </bibl>
      <history><rec><date><day>15</day><month>8</month><year>2011</year></date></rec><acc><date><day>4</day><month>5</month><year>2012</year></date></acc><pub><date><day>4</day><month>5</month><year>2012</year></date></pub></history>
      <cpyrt><year>2012</year><collab>Abouelhoda et al.; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Over the past decade the workflow system paradigm has evolved as an efficient and user-friendly approach for developing complex bioinformatics applications. Two popular workflow systems that have gained acceptance by the bioinformatics community are Taverna and Galaxy. Each system has a large user-base and supports an ever-growing repository of application workflows. However, workflows developed for one system cannot be imported and executed easily on the other. The lack of interoperability is due to differences in the models of computation, workflow languages, and architectures of both systems. This lack of interoperability limits sharing of workflows between the user communities and leads to duplication of development efforts.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>In this paper, we present <it>Tavaxy</it>, a stand-alone system for creating and executing workflows based on using an extensible set of re-usable workflow patterns. <it>Tavaxy</it> offers a set of new features that simplify and enhance the development of sequence analysis applications: It allows the integration of existing Taverna and Galaxy workflows in a single environment, and supports the use of cloud computing capabilities. The integration of existing Taverna and Galaxy workflows is supported seamlessly at both run-time and design-time levels, based on the concepts of hierarchical workflows and workflow patterns. The use of cloud computing in <it>Tavaxy</it> is flexible, where the users can either instantiate the whole system on the cloud, or delegate the execution of certain sub-workflows to the cloud infrastructure.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p><it>Tavaxy</it> reduces the workflow development cycle by introducing the use of workflow patterns to simplify workflow creation. It enables the re-use and integration of existing (sub-) workflows from Taverna and Galaxy, and allows the creation of hybrid workflows. Its additional features exploit recent advances in high performance cloud computing to cope with the increasing data size and complexity of analysis.</p>
               <p>The system can be accessed either through a cloud-enabled web-interface or downloaded and installed to run within the user's local environment. All resources related to <it>Tavaxy</it> are available at <url>http://www.tavaxy.org</url>.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <sec>
            <st>
               <p>Increasing complexity of analysis and scientific workflow paradigm</p>
            </st>
            <p>The advent of high-throughput sequencing technologies - accompanied with the recent advances in open source software tools, open access data sources, and cloud computing platforms - has enabled the genomics community to develop and use sophisticated application <it>workflows</it>. Such workflows start with voluminous raw sequences and end with detailed structural, functional, and evolutionary results. The workflows involve the use of multiple software tools and data resources in a staged fashion, with the output of one tool being passed as input to the next. As one example, a personalized medicine workflow <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp> based on <it>Next Generation Sequencing</it> (NGS) technology can start with short DNA sequences (reads) of an individual human genome and end with a diagnostic and prognostic report, or potentially even with a treatment plan if clinical data were available. This workflow involves the use of multiple software tools to assess the quality of the reads, to map them to a reference human genome, to identify the sequence variations, to query databases for the sake of associating variations to diseases, and to check for novel variants. As another example, consider a workflow in the area of metagenomics <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. Such workflow can start with a large collection of sequenced reads and end up with determination of the existing micro-organisms in the environmental sample and an estimation of their relative abundance. This workflow also involves different tasks and software tools, such as those used for assessing the quality of the reads, assembling them into longer DNA segments, querying them against different databases, and conducting phylogenetic and taxonomical analyses.</p>
            <p>To simplify the design and execution of complex bioinformatics workflows, especially those that use multiple software tools and data resources, a number of scientific workflow systems have been developed over the past decade. Examples include Taverna <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>, Kepler <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, Triana <abbrgrp><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp>, Galaxy <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, Conveyor <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> Pegasus <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, Pegasys <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, Gene Pattern <abbrgrp><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>, Discovery Net <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>, and OMII-BPEL <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>; see <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> for a survey and comparison of some of these tools.</p>
            <p>All such workflow systems typically adopt an abstract representation of a workflow in the form of a directed graph, where nodes represent tasks to be executed and edges represent either data flow or execution dependencies between different tasks. Based on this abstraction, and through a visual front-end, the user can intuitively build and modify complex applications with little or no programming expertise. The workflow system maps the edges and nodes in the graph to real data and software components. The <it>workflow engine</it> (also called <it>execution</it> or <it>enactment engine</it>) executes the software components either locally on the user machine or remotely at distributed locations. The engine takes care of data transfer between the nodes and can also exploit the use of high performance computing architectures, if available, so that independent tasks run in parallel. This makes the application scientist focus on the logic of their applications and no longer worry about the technical details of invoking the software components or use of distributed computing resources.</p>
            <p>Within the bioinformatics community, two workflow systems have gained increasing popularity, as reflected by their large and growing user communities. These are Galaxy <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and Taverna <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. Both systems are efficient, open source, and satisfy to a great extent the requirements of the bioinformatics community. Taverna has been developed primarily to simplify the development of workflows that access and use analyses tasks deployed as remote web and grid services. It comes with an associated directory of popular remote bioinformatics services and provides an environment that coordinates their invocation and execution. Galaxy has been developed primarily to facilitate the execution of software tools on local (high performance computing) infrastructure while still simplifying access to data held on remote biological resources. Its installation includes a large library of tools and pre-made scripts for data processing. Both systems are extensible, allowing their users to integrate new services and tools easily. Each system offers log files to capture the history of experiment details. Furthermore, both systems provide web-portals allowing users to share and publish their workflows: These are the myExperiment portal for Taverna <abbrgrp><abbr bid="B23">23</abbr><abbr bid="B24">24</abbr></abbrgrp> and the <it>Public Pages</it> for Galaxy <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The features of both systems are continuously being updated by their development teams and their user communities are active in developing and sharing new application workflows.</p>
            <p>However, since both Taverna and Galaxy have been developed with different use cases and execution environments in mind, each system tends to be suited to different styles of bioinformatics applications. The key differences between the two systems can be categorized into three major classes:</p>
            <p indent="1">1. Execution environment and system design: Taverna is oriented towards using web-services for invoking remote applications, while Galaxy is oriented towards efficient execution on a local infrastructures.</p>
            <p indent="1">2. Model of computation: Taverna includes control constructs such as conditionals and iterations, and data constructs that can handle (nested) lists (in parallel) using a number of pre-defined operations. These constructs are not directly available in Galaxy, which puts a limitation on the types of workflows that can be executed on Galaxy.</p>
            <p indent="1">3. Workflow description language: Taverna uses the XML-based language SCUFL for describing the workflows, while Galaxy expresses workflows in its own language using JSON format.</p>
            <p>These differences lead to two major consequences: First, some tasks can be implemented easily on one system but would be difficult to implement on the other without considerable programming effort. Second, a (sub-) workflow developed on one system cannot be imported and re-used by the other easily (i.e., <it>lack of interoperability</it>), which limits sharing of workflows between their communities and leads to duplication of development efforts.</p>
         </sec>
         <sec>
            <st>
               <p>Our contribution</p>
            </st>
            <p>In this paper, we present <it>Tavaxy</it>, a pattern-based workflow system that can integrate the use and execution of Taverna and Galaxy workflows in a single environment. The focus of <it>Tavaxy</it> is facilitating the efficient execution of sequence analysis tasks on high performance computing infrastructures and cloud computing systems. <it>Tavaxy</it> builds on the features of Taverna or Galaxy providing the following benefits:</p>
            <p indent="1">&#183; <it>Single entry point: Tavaxy</it> is a standalone pattern-based workflow system providing an extensible set of patterns, and allows easy integration with other workflow systems. It provides a single environment to open, edit, and execute its own workflows as well as integrate native Taverna and Galaxy whole- or sub-workflows, thus enabling users to compose hybrid workflows. Figure <figr fid="F1">1</figr> summarizes the different integration use cases at run-time and design-time levels in <it>Tavaxy</it>. (The replacement of remote calls with local tools is addressed in the next paragraph. The computation of maximal external sub-workflows is a performance optimization step discussed later in this paper.)</p>
            <p indent="1">&#183; <it>Transparent use of local and remote resources:</it> For most programs, <it>Tavaxy</it> allows its user to choose whether a task should run on local or remote computational resources. Furthermore, if a Taverna workflow is imported (e.g., from my Experiment), <it>Tavaxy</it> offers users an option to replace calls to remote web services automatically with calls to corresponding tools that run on a local computing infrastructure, or vice versa. (Note that almost all the workflows published on myExperiment are based on using remote services). Changing the default mode of invocation in either Taverna or Galaxy requires programming knowledge, and it is difficult to achieve by the non-programming scientist.</p>
            <p indent="1">&#183; <it>Simplified control and data constructs: Tavaxy</it> supports a set of advanced control constructs (e.g., <it>conditionals</it> and <it>iterations</it>) and data constructs (e.g., nested lists) and allows their execution on the local or remote computational infrastructures. The use of these constructs, which are referred to as &#8220;patterns&#8221; in <it>Tavaxy</it>, facilitates the design of workflows and enables further parallelization, where the data items passed to a node can be processed in parallel. The user of <it>Tavaxy</it> has the extra advantages of 1) adding these constructs to imported Galaxy workflows, and 2) using these constructs on the local infrastructures; features that are available only in Taverna and only for remote tools.</p>
            <p>Beyond these integration issues, <it>Tavaxy</it> provides the following additional features that facilitate authoring and execution of workflows:</p>
            <p indent="1">&#183; <it>Enhanced usability: Tavaxy</it> uses flowchart-like elements to represent control and data constructs. The workflow nodes are annotated with icons to reflect if they are executed locally or remotely. The tool parameters can be defined either at the design- or run-time of the workflow. The data patterns offered in <it>Tavaxy</it> further facilitate the composition of workflows, making them more compact, and enable exploitation of local high performance computing infrastructure without any additional effort. Furthermore, each user has an account associated with its data and each workflow is further associated with its history as well as previously used datasets within the user account.</p>
            <p indent="1">&#183; <it>Modularity: Tavaxy</it> is modular; it separates the workflow composition and management modules from the workflow engine. Its workflow engine is a standalone application accepting both workflow definitions and data as input. This feature, as will be made clear later in the manuscript, is of crucial importance for implementing control constructs and for supporting cloud computing.</p>
            <p indent="1">&#183; <it>High performance computing infrastructure support: Tavaxy</it> can readily run on a computer cluster, once a job scheduler system (like PBS Torque or SGE) and a distributed file system (like NFS) are installed. The execution of parallel tasks is handled automatically by the workflow engine, hiding all invocation details.</p>
            <p indent="1">&#183; <it>Cloud computing support: Tavaxy</it> is cloud computing friendly, enabling users to scale-up their computational infrastructure on a pay-as-you go basis, with reduced configuration efforts. Through a simple interface within the <it>Tavaxy</it> environment, a user who has a cloud computing account (e.g., at the Amazon AWS platform) can easily instantiate the whole system on the cloud, or alternatively use a mixed mode where his local version of the system can delegate the execution of a sub-workflow or a single task to a <it>Tavaxy</it> cloud instance.</p>
            <p>In the remaining part of this section, we will review basic concepts of workflow interoperability and <it>workflow patterns</it> that contributed to the design and development of <it>Tavaxy</it>.</p>
            <fig id="F1"><title><p>Figure 1 </p></title><caption><p>Use diagram of integrating Taverna, Galaxy, and Tavaxy workflows.</p></caption><text>
   <p><b>Use diagram of integrating Taverna, Galaxy, and Tavaxy workflows.</b> Tavaxy is a standalone workflow system that executes Tavaxy workflows as well as integrates and executes Taverna and Galaxy workflows. Galaxy workflows are compatible with Tavaxy and can be imported and executed directly on the system. For Taverna workflows, the integration can take place at either run-time or design-time. At run time, the Taverna (sub-) workflows can be executed as a whole by calling the Taverna engine. They can also be saved as sub-workflows and used within other Tavaxy workflows. At workflow design time, Taverna workflows are translated to the Tavaxy language, enabling them to be edited and enhanced. In this case, the user has the option of replacing any of the remote calls in the Taverna workflow with calls to equivalent local tools. Any remaining Taverna sub-workflow fragments can be directly executed using the Taverna engine. As an optimization, sub-workflows can be encapsulated into maximal external sub-workflows so as to minimize execution overheads. The implementation section addresses the maximal external sub-workflows in more details.</p>
</text><graphic file="1471-2105-13-77-1"/></fig>
         </sec>
         <sec>
            <st>
               <p>Related technical work</p>
            </st>
            <sec>
               <st>
                  <p>Workflow interoperability</p>
               </st>
               <p>Our approach described in this paper goes beyond the run-time &#8220;black-box&#8221; invocation of one system from the other, which was used in the work of <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B26">26</abbr></abbrgrp> to enable interoperability between Galaxy and Taverna. To highlight the difference, the Workflow Management Coalition, WfMC, <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> defines eight models, or approaches, for achieving interoperability between workflow systems. These models can be grouped broadly into two major categories: 1) Run-time interoperability, where one system invokes the other system through APIs. 2) Design-time interoperability, where the two systems are based on a) the same model of computation (MoC); or b) the same languages (or at least translation between languages is feasible), or c) the same execution environment (or at least the existence of an abstract third-party middleware). These three design-time issues are discussed in detail in the paper of Elmroth et al. <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>.</p>
               <p>As discussed earlier, both Taverna and Galaxy have different models of computation and different languages. In this paper, we use ideas from the workflow interoperability literature and introduce the concept of patterns to integrate and execute Taverna and Galaxy workflows in <it>Tavaxy</it> at both run-time and design-time levels..</p>
            </sec>
            <sec>
               <st>
                  <p>Workflow patterns</p>
               </st>
               <p>Workflow patterns are a set of constructs that model a (usually recurrent) requirement (sub-process); the description of these constructs is an integral part of the pattern definition. Workflow patterns, despite being less formal than workflow languages, have become increasingly popular due to their practical relevance in comparing and understanding the features of different workflow languages. As originally introduced in <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, workflow patterns were used to characterize business workflows and were categorized into four types: <it>control flow</it><it>data flow</it><it>resource and operational</it>, and <it>exception handling</it> patterns. We note that the concept of patterns is in general applicable to scientific workflows. In <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>, we used this concept for the first time to demonstrate the feasibility of achieving interoperability between Taverna and Galaxy. Our work in this paper extends this demonstrative work by providing a larger set of the patterns, and also by providing a complete implementation of them within a functional and usable system.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Implementation</p>
         </st>
         <sec>
            <st>
               <p>Tavaxy model of computation and language</p>
            </st>
            <p><it>Tavaxy</it> workflows are directed acyclic graphs (DAGs), where nodes represent computational tools and edges correspond to data flows or dependencies between them. The workflow patterns defined and used in <it>Tavaxy</it> have special meanings in this DAG, as will be explained in detail later in the pattern implementation subsection. The <it>Tavaxy</it> engine is based on a data flow model of execution <abbrgrp><abbr bid="B22">22</abbr><abbr bid="B28">28</abbr><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr><abbr bid="B34">34</abbr></abbrgrp>, in which each node (task) can start computation only once all its input data are available. The <it>Tavaxy</it> workflow engine is responsible for keeping track of the status of each node and for invoking its execution when the predecessor nodes have finished execution and when its input data is completely available. When executing on a single processor, where tasks are executed sequentially, the order of task invocation can be determined in advance. This is achieved by traversing the DAG and scheduling a node (i.e., adding it to the ready queue) only if all the its predecessor nodes are already scheduled. As such, this scheduling is referred to as a <it>static</it> scheduling <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. When executing on multiple processors, independent ready tasks can be executed in parallel. In this case, the <it>Tavaxy</it> engine keeps looking for ready tasks and launches them concurrently. On multi-core machines, the engine uses multi-threading to handle the execution of concurrent tasks. On a computer cluster, it passes the concurrent tasks to a job-scheduler, which in turn distributes them for execution on different cluster nodes. The default job-scheduler used in <it>Tavaxy</it> is PBS Torque, and it is set-up over a shared file system (NFS for local setting and S3 for cloud infrastructure) to guarantee availability of data for all cluster nodes.</p>
            <p>A <it>Tavaxy</it> workflow is defined and stored in tSCUFL format, which is similar in flavor to the Taverna SCUFL format. However, there are two main differences between the two formats: 1) A node&#8217;s parameters are represented in tSCUFL by default as attributes of the respective tool, whereas they are considered as input items in SCUFL. 2) The workflow patterns (e.g., conditionals and iteration) are explicitly specified in tSCUFL but implicitly defined in SCUFL.</p>
         </sec>
         <sec>
            <st>
               <p>Integrating Galaxy and Taverna workflows in Tavaxy</p>
            </st>
            <p><it>Tavaxy</it> provides an easy-to-use environment allowing the execution of <it>Tavaxy</it> workflows that integrate Taverna and Galaxy workflows as sub-workflows. Such integration can be achieved at both <it>design-time</it> and <it>run-time</it>:</p>
            <p>For run-time integration, <it>Tavaxy</it> can execute both Galaxy and Taverna (sub-) workflows &#8216;as is&#8217;, with no modification. For Galaxy workflows, this is straightforward, because the <it>Tavaxy</it> engine is compatible with the Galaxy engine and follows the same model of computation. For Taverna workflows, <it>Tavaxy</it> can execute a Taverna (sub-) workflow by invoking the Taverna engine through a command line interface that takes both the Taverna (sub-) workflow file and its data as input. The <it>Tavaxy</it> mapper component assures the correct data transfer between the Taverna engine and other nodes. This is achieved by setting source and destination directories and input/output file names in appropriate manners.</p>
            <p>For design-time integration, <it>Tavaxy</it> imports and manipulates workflows written in either Galaxy or Taverna formats. <it>Tavaxy</it> can import a Galaxy workflow file to its environment, allowing its modification and execution. The engineering work for this step includes translation of the JSON objects of the Galaxy workflow to the tSCUFL format of <it>Tavaxy</it>. For Taverna workflows, the implementation addresses the differences in the model of computation and workflow languages. Specifically, the workflow engine of <it>Tavaxy</it> is a data-flow oriented one, with no <it>explicit</it> specification of control constructs, while the Taverna engine supports both data- and control-flow constructs.</p>
            <p>The Taverna workflow language is SCUFL/t2flow but that of <it>Tavaxy</it> is tSCUFL. To overcome these differences, we use the concept of <it>workflow patterns</it> to 1) execute (&#8220;simulate&#8221;) the execution of Taverna control and data constructs on the data-driven workflow engine of <it>Tavaxy</it>; and 2) to provide a pragmatic solution to language translation where a Taverna (sub-) workflow is decomposed into a set of patterns that are then re-written in <it>Tavaxy</it> format. The following section introduces the <it>Tavaxy</it> workflow patterns and their implementation.</p>
         </sec>
         <sec>
            <st>
               <p>Workflow patterns: Definitions and implementation</p>
            </st>
            <p>We divide the <it>Tavaxy</it> workflow patterns into two groups: control patterns and data patterns. In the remainder of this subsection, we define these patterns and their implementation on the <it>Tavaxy</it> data-flow engine.</p>
            <sec>
               <st>
                  <p>Control patterns</p>
               </st>
               <p>Control patterns specify execution dependencies between tasks. For most control patterns, data flow is still required and is defined as part of the control pattern specification itself. The following are the key control patterns used in <it>Tavaxy</it>:</p>
               <p indent="1">1. <it>Sequence:</it> In this pattern, task <it>B</it> runs after the termination of task <it>A</it>, as shown in Figure <figr fid="F2">2</figr>(a). The data produced by <it>A</it> is subsequently processed by <it>B</it> and moves over an edge whose start is an output port at <it>A</it> and whose destination is an input port at <it>B</it>. The concept of ports makes it possible to select which pieces of data produced by <it>A</it> are passed to <it>B</it>. Desired execution dependencies involving no data can be achieved on the <it>Tavaxy</it> data flow engine by a special token (dummy output) from <it>A</it> to <it>B</it>. The current engine of <it>Tavaxy</it> does not support streaming, and the tasks are stateless, according to the discussion of Lud&#228;scher et al. <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>.</p>
               <p indent="1">2. <it>Synchronous Merge:</it> A task is invoked only if all its predecessor tasks are executed; Figure <figr fid="F2">2</figr>(b) depicts this pattern with three tasks <it>A</it>, <it>B</it>, and <it>C</it>, where task <it>A</it> and <it>B</it> should be completed before <it>C</it>. This pattern also specifies that task <it>C</it> takes two inputs (one from <it>A</it> and another from <it>B</it>) and the data flowing from <it>A</it> and <it>B</it> to <it>C</it> goes to different input ports.</p>
               <p indent="1">3. <it>(Parallel) Synchronous fork:</it> Figure <figr fid="F2">2</figr>(c) depicts this pattern with three tasks <it>A</it>, <it>B</it>, and <it>C</it>. Tasks <it>B</it> and <it>C</it> run after the execution of <it>A</it>. The data output from <it>A</it> flows according to one of two schemes, as specified by the user through settings of ports: 1) One copy of the data output of <it>A</it> is passed to <it>B</it> and another one to <it>C</it>. 2) Different data output items of <it>A</it> are passed to <it>B</it> and <it>C</it>. The tasks <it>B</it> and <it>C</it> can always run in parallel, because their input set is already available and they are independent.</p>
               <p indent="1">4. <it>Multi-choice fork:</it> This pattern includes the use of an <it>if-else</it> construct to execute a task if a condition is satisfied. This condition is defined by the user through an implementation of a &#936; function. Figure <figr fid="F2">2</figr>(d) shows an example, where either <it>B</it> or <it>C</it> is executed, depending on the &#936; function, whose domain may include the input data coming from <it>A</it>. Note that the input data to <it>B</it> and <it>C</it>, which can come from any other node including <it>A</it>, is not specified in the Figure. Because this pattern specifies run-time execution dependencies, it is not directly defined over a data-flow engine. Therefore, we implemented this pattern on the <it>Tavaxy</it> engine by creating a special node containing a program that implements the switch function. The engine executes this node as a usual task. The program for switch pattern takes the following as input: 1) the multi-choice condition, and 2) the data to be passed to the next tasks. It then checks the condition and passes a success signal to the branch satisfying the condition and passes fail signal to the branch violating that condition. The success and fail signals are special tokens recognized by <it>Tavaxy</it> nodes.</p>
               <p indent="1">5. <it>Iteration:</it> This pattern specifies repetition of a workflow task. In Figure <figr fid="F2">2</figr>(e), the execution of node <it>B</it>, which could be a sub-workflow, is repeated many times. The number of iterations can be either fixed or dependent on the data produced at each step. In each iteration, an output of task <it>B</it> can replace the corresponding input. For example, a parameter file can be passed to <it>B</it> and at each iteration this parameter file is modified and passed again to <it>B</it>. Node <it>C</it>, which represents any node that uses the output of <it>B</it>, is invoked only after the iteration pattern terminates. The iteration pattern is represented by a special node in <it>Tavaxy</it> and the associated program that implements it takes the following items as input: 1) the task (or sub-workflow) that iterates, 2) its parameters, 3) the termination criteria (defined by python script), and 4) information about feedback data. The iteration is implemented as a <it>do-while</it> loop, where the tasks in the body of the loop are encapsulated as a sub-workflow. <it>Tavaxy</it> is invoked recursively to execute this sub-workflow in each iteration. The output of the iteration pattern is specified by the user and is passed to the next task upon termination. The loop iterations are in general stateless; but the user can modify the included sub-workflow to keep state information.</p>
            </sec>
            <fig id="F2"><title><p>Figure 2 </p></title><caption><p>Workflow patterns of Tavaxy.</p></caption><text>
   <p><b>Workflow patterns of Tavaxy.</b> Workflow patterns modeling the execution of workflow tasks. The parts (<b>a</b>), (<b>b</b>), (<b>c</b>), (<b>d</b>), and (<b>e</b>) represent the sequence (pipeline) pattern, the synchronous merge, the synchronous fork, multi-choice fork, and iteration control patterns, respectively. The part (<b>f</b>) shows how a list of data items is processed, and (<b>g</b>) shows dot/cross product operation. The parts (<b>h</b>) and (<b>j</b>) represent the data select and data merge patterns, respectively.</p>
</text><graphic file="1471-2105-13-77-2"/></fig>
            <sec>
               <st>
                  <p>Advanced data patterns and types</p>
               </st>
               <p indent="1">1. <it>(Nested) Lists:</it> In this pattern, the input to a node is a list of <it>n</it> items. The program associated with the node is invoked independently <it>n</it> times on each of the list items. Figure <figr fid="F2">2</figr>(f) shows an example where a list (<it>x</it><sub><it>1</it></sub><it>,&#8230;,x</it><sub><it>n</it></sub>) is passed to <it>A</it>. The output is also a list (<it>A(x</it><sub><it>1</it></sub><it>),&#8230;,A(x</it><sub><it>n</it></sub><it>)</it>). Note that if the list option is not specified in the node, then the respective program is invoked once and the input list is handled as a single object, as in the sequence pattern. For example, a program for Primer design would consider a multi FASTA file as a list and is invoked multiple times on each item (sequence), while an alignment program would consider the sequences of the multi-FASTA file as a single object to build a multiple sequence alignment. In <it>Tavaxy</it>, it is possible to process the list items in parallel, without extra programming effort. Furthermore, a list can be a list of lists defined in a recursive manner, so as to support a nested collection of items, according to the notion of <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B32">32</abbr></abbrgrp>. The jobs corresponding to the processing of every list item are stateless, according to the discussion of Lud&#228;scher et al. <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. However, the script implementing the list keeps track of the running jobs, and reports an error message if any job failed.</p>
               <p>Over this list data type, we define a set of operators that can be used by the main program associated with the node.</p>
               <p indent="1">&#183; <it>Dot product:</it> Given two lists <it>A[a</it><sub><it>1</it></sub><it>,&#8230;,a</it><sub><it>n</it></sub><it>]</it> and <it>B[b</it><sub><it>1</it></sub><it>,..,b</it><sub><it>m</it></sub>], <it>n&#8201;&#8804;&#8201;m</it> as input, a dot product operation produces the <it>n</it> tuples [<it>(a</it><sub><it>1</it></sub><it>,b</it><sub><it>1</it></sub><it>),..,(a</it><sub><it>n</it></sub><it>,b</it><sub><it>n</it></sub><it>)</it>] which are processed independently by the respective program, see Figure <figr fid="F2">2</figr>(g). (lists [<it>b</it><sub><it>n+1</it></sub><it>,&#8230;,b</it><sub><it>m</it></sub>] items are ignored.) This operation can be extended to multiple lists.</p>
               <p indent="1">&#183; <it>Cross product:</it> Given two lists <it>A[a</it><sub><it>1</it></sub><it>,&#8230;,a</it><sub><it>n</it></sub><it>]</it> and <it>B</it>[<it>b</it><sub><it>1</it></sub><it>,..,b</it><sub><it>m</it></sub>], <it>n&#8201;&lt;&#8201;m</it> as input, a cross product operation produces the set of (<it>n&#8201;&#215;&#8201;m</it>) tuples <inline-formula><m:math name="1471-2105-13-77-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mo stretchy="true">{</m:mo>
   <m:mo stretchy="true">(</m:mo>
   <m:msub>
      <m:mi>a</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mo>,</m:mo>
   <m:msub>
      <m:mi>b</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
   <m:mo stretchy="true">)</m:mo>
   <m:mo stretchy="true">|</m:mo>
   <m:mo>,</m:mo>
   <m:mi>i</m:mi>
   <m:mo>&#8712;</m:mo>
   <m:mo stretchy="true">[</m:mo>
   <m:mtext>1..</m:mtext>
   <m:mi>n</m:mi>
   <m:mo stretchy="true">]</m:mo>
   <m:mo>,</m:mo>
   <m:mi>j</m:mi>
   <m:mo>&#8712;</m:mo>
   <m:mo stretchy="true">[</m:mo>
   <m:mtext>1..</m:mtext>
   <m:mi>m</m:mi>
   <m:mo stretchy="true">]</m:mo>
   <m:mo stretchy="true">}</m:mo>
</m:mrow>
</m:math></inline-formula>, which are processed independently by the respective program. This option can be used, for example, for comparing two protein sets (each coming from one species) to each other to identify orthologs. If <it>A&#8201;=&#8201;B</it>, then we compare the set of proteins to themselves to identify paralogs.</p>
               <p>The list operations are implemented by a generic tool-wrapper of <it>Tavaxy</it>. As we will explain later in the sub-section describing the architecture of <it>Tavaxy</it>, this wrapper is what is invoked by the workflow engine, and it is the one that invokes the program to be executed. The wrapper pre-processes the input and can make parallel invocations on different list items if <it>Tavaxy</it> is executing on a multiprocessor machine. The data collect pattern (specified below) can then be used to combine the results back in list format.</p>
               <p indent="1">2. <it>Data select:</it> Consider Figure <figr fid="F2">2</figr>(h) with the three tasks <it>A</it>, <it>B</it>, and <it>C</it>. The data select pattern takes as input 1) Output data from <it>A</it> and <it>B</it>, denoted by <inline-formula><m:math name="1471-2105-13-77-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>a</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula> and <inline-formula><m:math name="1471-2105-13-77-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>b</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula>, respectively. It takes also an implementation of a function &#936; that operates on properties of <inline-formula><m:math name="1471-2105-13-77-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>a</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula> or <inline-formula><m:math name="1471-2105-13-77-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>b</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula>. Without loss of generality, the output of this pattern is <inline-formula><m:math name="1471-2105-13-77-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>a</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula>, if <inline-formula><m:math name="1471-2105-13-77-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>&#936;</m:mi>
   <m:mo stretchy="true">(</m:mo>
   <m:mo stretchy="true">(</m:mo>
   <m:mover accent="true">
      <m:mi>a</m:mi>
      <m:mo>&#8594;</m:mo>
   </m:mover>
   <m:mo>,</m:mo>
   <m:mover accent="true">
      <m:mi>b</m:mi>
      <m:mo>&#8594;</m:mo>
   </m:mover>
   <m:mo stretchy="true">)</m:mo>
   <m:mo stretchy="true">)</m:mo>
</m:mrow>
</m:math></inline-formula> is true, otherwise it is <inline-formula><m:math name="1471-2105-13-77-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>b</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula>. The output of the pattern can be passed to another node <it>C</it>. This pattern is implemented in a similar way to the multi-choice pattern, where it specifies selection of certain data flow.</p>
               <p indent="1">3. <it>Data collect (Merge):</it> This pattern, which is depicted in Figure <figr fid="F2">2</figr>(j), specifies that the data outputs of <it>A</it> and <it>B</it> are collected (concatenated) together in a list; i.e., the output is <inline-formula><m:math name="1471-2105-13-77-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mo stretchy="true">[</m:mo>
   <m:mover>
      <m:mi>a</m:mi>
      <m:mo>&#8594;</m:mo>
   </m:mover>
   <m:mo>,</m:mo>
   <m:mover>
      <m:mi>b</m:mi>
      <m:mo>&#8594;</m:mo>
   </m:mover>
   <m:mo stretchy="true">]</m:mo>
</m:mrow>
</m:math></inline-formula>. Note that <inline-formula><m:math name="1471-2105-13-77-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>a</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula> or <inline-formula><m:math name="1471-2105-13-77-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mover>
   <m:mi>b</m:mi>
   <m:mo>&#8594;</m:mo>
</m:mover>
</m:math></inline-formula> could be a list of objects as well, which leads to creation of nested collections. This pattern is implemented in a similar way to the data select pattern, where data items are collected.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Tavaxy architecture</p>
            </st>
            <p>Figure <figr fid="F3">3</figr> (left) shows the architecture of <it>Tavaxy</it>, which is composed of four main components: 1) workflow authoring module, 2) workflow pattern database, 3) workflow mapper, and 4) workflow engine. On top of these components, we developed user accounts to maintain users workflows and data. We also developed a repository of public workflows that is shared between users. Figure <figr fid="F3">3</figr> (upper right) shows the main <it>Tavaxy</it> page containing links to different system parts and utilities.</p>
            <fig id="F3"><title><p>Figure 3 </p></title><caption><p>Tavaxy architecture and interface.</p></caption><text>
   <p><b>Tavaxy architecture and interface.</b> Left: Tavaxy Architecture. The authoring module (workflow editor) is where users compose, open, and import workflows into Tavaxy. The imported workflows can be in tSCUFL, SCUFL, t2flow, JSON formats. The mapping module produces tSCUFL files to be executed by the engine. The engine invokes either local tools or remote services. Upper right: The main interface of the Tavaxy system containing links to the authoring module, user&#8217;s workflow, user&#8217;s data, workflow repository, and other utilities and cloud tools. Lower right: The workflow authoring module, where the switch pattern is depicted. The cloud symbol and the parameter port appear on the tool node. On the righthand panel, the user can choose if a tool runs locally or on the cloud.</p>
</text><graphic file="1471-2105-13-77-3"/></fig>
            <sec>
               <st>
                  <p>Workflow authoring tool and language</p>
               </st>
               <p>The <it>Tavaxy</it> workflow authoring module (workflow editor) is a web-based drag-and-drop editor that builds on the look and feel of Galaxy with two key modifications. First, it supports a user-defined set of workflow patterns that are similar to those used in a traditional flowchart. Second, it allows users to tag which workflow nodes execute on the local infrastructure and which execute using remote resources. For each node, there is a form that can be used to set the node&#8217;s parameters. Furthermore, each node has a specific port that can accept a parameters file that can be used to over-write parameter values set through the web-interface. The use of a parameters file allows changing of the value of parameters at run time. Figure <figr fid="F3">3</figr> (lower right) shows the <it>Tavaxy</it> authoring module and highlights some of its key features.</p>
            </sec>
            <sec>
               <st>
                  <p>Workflow mapper</p>
               </st>
               <p>The workflow mapper performs the following set of tasks:</p>
               <p indent="1">&#183; The mapper parses the input tSCUFL file and checks its syntax. It translates the Galaxy JSON format and TavernaSCUFL format to the <it>Tavaxy</it>tSCUFL format. Depending on user choices, it can replace remote Taverna calls with calls to corresponding local tools. The nodes that are still executed remotely by the Taverna engine will be encapsulated as a sub-workflow. Each sub-workflow is then associated with a <it>Tavaxy</it> node that invokes the Taverna engine so as to execute the corresponding sub-workflow. The mapper sets the names of the sub-workflow input and output files in an appropriate manner so that the data correctly flows between the nodes. Additional file <supplr sid="S1">1</supplr> (in the supplementary material) contains the re-writing rules for translating SCUFL to tSCUFL formats, including control constructs and replacement of remote services with local tools.</p>
               <p indent="1">&#183; The mapper optimizes the execution of a workflow by identifying the tasks that will be executed by the Taverna engine and aggregating them into <it>maximal external sub-workflows.</it>. A sub-workflow is called <it>external</it> if it includes only Taverna nodes and it is <it>maximal</it> if no extra external nodes can be added to it. The mapper determines the maximal external sub-workflows using a simple graph-growing algorithm, where we start with a sub-graph composed of a single Taverna node and keep adding external nodes to this sub-graph provided that there are edges connecting the new nodes to the sub-graph and no cycles are introduced. To find the next maximal external sub-workflow, we move to the next non-processed external node. After sub-workflow identification, the mapper encapsulates each maximal external sub-workflow in a new node and adjusts the input and output ports in an appropriate manner. Accordingly, the Taverna engine is invoked only once for each maximal external sub-workflow, which avoids the overhead of multiple Taverna calls. Note that Taverna uses multi-threading to handle execution of independent tasks, including remote invocations. Hence, the use of maximal external sub-workflows with remote calls entails no loss in efficiency.</p>
               <suppl id="S1">
                  <title>
                     <p>Additional file 1</p>
                  </title>
                  <text>
                     <p><b>Re-writing rules for translating SCUFL to tSCUFL.</b> A PDF file describing the re-writing rules for translating a Tavernaworkflow in SCUFL format into Tavaxy workflow in tSCUFL format.</p>
                  </text>
                  <file name="1471-2105-13-77-S1.pdf">
   <p>Click here for file</p>
</file>
               </suppl>
            </sec>
            <sec>
               <st>
                  <p>Workflow engine</p>
               </st>
               <p>The <it>Tavaxy</it> engine is based on the data flow model of execution discussed earlier in this section. It is written in Python, based on some Galaxy functions to save development time. The <it>Tavaxy</it> engine (compared to the Galaxy engine) is standalone and not tightly coupled with the web-interface and database-interface; i.e., it can be invoked programatically or using a command line interface. Furthermore, it can invoke itself in a recursive manner, which enables the implementation of different patterns and integration of heterogeneous workflows. By building on some of core features of Galaxy engine, the <it>Tavaxy</it> engine can be regarded as an extended and engineered version of that of Galaxy. The Taverna engine is invoked as any program (secondary engine) to achieve run time interoperability with Taverna workflows and to use it in invocation of remote services.</p>
               <p>All local tools in <it>Tavaxy</it> are wrapped within a generic wrapper that is invoked by the engine.</p>
               <p>This wrapper is responsible for the following:</p>
               <p indent="1">&#183; The wrapper decides whether the associated tool is executed or not, depending on reception of a special token (dummy data). The special token can correspond either to 1) execution dependency or 2) &#8220;do-not-execute&#8221; or &#8220;fail&#8221; signal from the preceding node, as in the case of the multi-choice pattern. In the former case, the wrapper executes the respective computational tool, while in the latter case, it will not invoke the tool and further passes the token to the output ports.</p>
               <p indent="1">&#183; It handles the list patterns by determining the list items, executing list operations, and invoking the respective program (in parallel) on list items.</p>
               <p indent="1">&#183; It uses cloud computing APIs to execute tasks on cloud computing platforms. The use of cloud computing is discussed below in more detail.</p>
            </sec>
            <sec>
               <st>
                  <p>Workflow pattern database</p>
               </st>
               <p>The workflow pattern database stores the definition and implementation of the workflow patterns used in <it>Tavaxy</it>. It also stores how the nodes associated with these patterns are rendered on the workflow authoring client. This pattern database is extensible by the user, who can define new patterns according to the rules of the <it>Tavaxy</it> system.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Use of cloud computing</p>
            </st>
            <p>As briefly mentioned before in the introduction, we provide three modes for using cloud computing: 1) whole system instantiation, 2) sub-workflow instantiation, and 3) tool (service) instantiation. To further simplify the use of the first mode, we installed an instance of <it>Tavaxy</it> (including the whole web-interface and tools) on an Amazon AWS virtual machine and deposited a public image of it at the Amazon web-site. A user who has an Amazon account can directly start the image and use it. Based on Amazon APIs, this image can establish a computer cluster upon its activation. The user can specifically define the type of nodes (e.g., large or extra large) and their number. The Amazon S3 storage is used as a shared storage for the computer cluster. We developed several interface functions that manage data transfer among the compute nodes and the shared storage of the cluster at run time. Figure <figr fid="F4">4</figr>(left) shows a screen shot of the <it>Tavaxy</it> interface page, where the user can configure the cluster and storage.</p>
            <fig id="F4"><title><p>Figure 4 </p></title><caption><p>Use of cloud computing in Tavaxy.</p></caption><text>
   <p><b>Use of cloud computing in Tavaxy.</b> Left: The web interface for setting the computer cluster on the cloud. Right: The architecture of Tavaxy showing the local and cloud versions of the system. The data flows from the local version to either the mounted disk attached to the main machine or to the persistent S3 storage. The S3 storage serves two purposes: 1) persistent storage and 2) shared storage for the computer cluster.</p>
</text><graphic file="1471-2105-13-77-4"/></fig>
            <p>In the second mode, the user already has a <it>Tavaxy</it> version installed on his local machines (called <it>local Tavaxy</it>) and delegates the execution of one or multiple sub-workflows to be executed on the cloud. To support this scenario, a lightweight version of <it>Tavaxy</it> has been deposited at the Amazon platform as a virtual machine image. From a simple user interface in the local <it>Tavaxy</it>, the user can start and configure a cloud cluster using the prepared <it>Tavaxy</it> image.</p>
            <p>At run-time, the local version of <it>Tavaxy</it> communicates with the cloud counterpart, using a simple asynchronous protocol (similar to the REST protocol), to send the sub-workflow, execute it, and retrieve the results. The input and output data related to such a sub-workflow flow according to one of two scenarios:</p>
            <p indent="1">1. The input data is sent to the mounted disk of the main cloud machine along with the workflow to be executed. After processing, the output is sent back to the local <it>Tavaxy</it>. After termination of the machine, the input and result data are lost, unless they are moved by the user to a persistent storage. This scenario is useful in case no computer cluster is needed.</p>
            <p indent="1">2. The input data is sent to a shared volume in the persistent S3 storage (this can be done offline), where the compute nodes of the computer cluster can easily access it. Because reads and writes to S3 require the use of Amazon APIs, we developed special scripts to facilitate this access between the local <it>Tavaxy</it> and S3 on one side and between the compute nodes and S3 on the other side. After execution of the sub-workflow, a copy of the output is maintained on the S3 and another copy is sent to the local <it>Tavaxy</it> to complete the execution of the main workflow.</p>
            <p>The third mode is a special case of the second mode, where the user can delegate the execution of only a single task to the cloud. For this mode, we also use a simple protocol to manage the data transfer and remote execution of the task on the cloud. Figure <figr fid="F4">4</figr>(right) shows the architecture of the cloud version of <it>Tavaxy</it> and the data flows among its components.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Accessing Tavaxy</p>
            </st>
            <p>There are different ways to access and use the <it>Tavaxy</it> system from its main home page:</p>
            <p indent="1">1. Downloadable version: The whole <it>Tavaxy</it> system, with all features described in this manuscript, can be downloaded for local use. The bioinformatics packages are provided in a separate compressed folder, because we assume that some users already have installed the packages of interest on their local systems and just need the <it>Tavaxy</it> system. The packages currently include about 160 open source tools, coming from EMBOSS <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>, SAMtools <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>, fastx <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>, NCBI BLAST Toolkit <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr><abbr bid="B40">40</abbr></abbrgrp>, and other individual sequence analysis programs. Addition of extra tools is explained in the <it>Tavaxy</it> manual.</p>
            <p indent="1">2. Web-based access: We offer a traditional web-based interface to a <it>Tavaxy</it> instance for running small and moderate size jobs. For large scale jobs, we recommend the use of cloud version.</p>
            <p indent="1">3. Cloud-computing based access: In this mode, each user creates a <it>Tavaxy</it> instance with the hardware configuration of choice on the AWS cloud. The interesting feature in this model is that multiple users may have multiple <it>Tavaxy</it> systems, each with different configuration (number and type of &#8216;virtual&#8217; machines). The <it>Tavaxy</it> instances on the cloud already include the 160 tools currently tested. They also include a number of databases to be used with the cloud machines, such as the NCBI (nucleotide and protein) and swissprot databases.</p>
         </sec>
         <sec>
            <st>
               <p>Pre-imported workflows</p>
            </st>
            <p>At the time of preparing this manuscript (June 2011), the Taverna repository myExperiment contained 557 workflows in SCUFL (Taverna1) format and 554 workflows in t2flow format (Taverna2). By manual inspection, we found that 296 workflows (96 in SCUFL format and 200 in t2flows format) are related to the sequence analysis domain, which is the main focus of this version of <it>Tavaxy</it>. To help the community, we already imported all these workflows into the <it>Tavaxy</it> environment, and arranged them in a special web-accessible repository for public use. We also provided the user with optimized versions of the sequence analysis workflows, where many of the web-services are replaced with local invocations of the corresponding local tools distributed with <it>Tavaxy</it>. We also imported all public Galaxy workflows from the Galaxy Public Pages and added them to this repository. The workflows imported from both the Taverna and Galaxy repositories are included in the <it>Tavaxy</it> system, and will be kept up-to-date on its web-site. These workflows can serve as &#8220;design patterns&#8221; that can can be used to speed up workflow development cycle, when developing more complex workflows.</p>
         </sec>
         <sec>
            <st>
               <p>Experiments overview</p>
            </st>
            <p>In the following sub-sections, we introduce two case studies that demonstrate the key features of <it>Tavaxy</it>. In the first case study, we demonstrate 1) how Taverna, Galaxy, and <it>Tavaxy</it> sub-workflows can be integrated in a single <it>Tavaxy</it> workflow, highlighting both the integration capabilities and use of workflow patterns; and 2) the optimization steps included before the execution of imported workflows and their effects on the performance of the system. In the second case study, we demonstrate 1) the use of <it>Tavaxy</it> for a metagenomics workflow based on NGS data; 2) the advantages of using advanced data patterns in facilitating the workflow design and supporting parallel execution; 3) the speed-up achieved by using local HPC infrastructure; and finally 4) the efficient and cost-saving use of cloud computing.</p>
         </sec>
         <sec>
            <st>
               <p>Case study I: Composing heterogeneous sub-workflows on Tavaxy</p>
            </st>
            <p>Figure <figr fid="F5">5</figr> shows a workflow for finding homologous protein sequences and analyzing them. The workflow starts with reading a DNA/protein sequence from the user. If the input is a DNA sequence, it is translated to a protein sequence. The input sequence is passed to BLAST <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B39">39</abbr></abbrgrp> to find similar sequences. The output of BLAST, which is a list of Genbank IDs, is then compared to a user-provided list of sequence IDs to exclude common sequences from the output list. The protein sequences of the exclusive IDs are then retrieved and passed to the programs ClustalW <abbrgrp><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp> and MUSCLE <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> for computing multiple alignment. ClustalW is also used for computing a phylogenetic tree.</p>
            <fig id="F5"><title><p>Figure 5 </p></title><caption><p>Protein analysis workflow.</p></caption><text>
   <p><b>Protein analysis workflow.</b> Workflow for finding and analyzing homologous protein sequences. The highlighted parts are extra sub-workflows from Galaxy and Tavaxy, and the remaining parts correspond to a Taverna workflow already deposited at myExperimentweb-site.</p>
</text><graphic file="1471-2105-13-77-5"/></fig>
            <p>Searching the myExperiment repository, there is already an existing Taverna implementation for most of the desired workflow, deposited under the name &#8220;workflow_for_protein_sequence_analysis&#8221; <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>, and Figure <figr fid="F6">6</figr> shows its implementation as it appears in the Taverna authoring module. The missing functionality in this Taverna workflow are the two parts highlighted in Figure <figr fid="F5">5</figr>, including the parts for translating the DNA sequence into protein sequence and the one for MUSCLE-consensus. In the original Taverna implementation, the software tools BLAST, ClustalW, and phylogeny plotting are invoked through web-service interfaces. The other intermediate steps are executed by built-in Taverna programs.</p>
            <fig id="F6"><title><p>Figure 6 </p></title><caption><p>Taverna implementation of the protein analysis workflow.</p></caption><text>
   <p><b>Taverna implementation of the protein analysis workflow.</b> Taverna implementation of the workflow in Figure <figr fid="F5">5</figr>. All program parameters (e.g., BLAST tool to be used and UPGMA NJ option) are considered as input to the workflow. High resolution versions of the figures of this paper are available in Additional File <supplr sid="S2">2</supplr>.</p>
</text><graphic file="1471-2105-13-77-6"/></fig>
            <suppl id="S2">
               <title>
                  <p>Additional file 2 </p>
               </title>
               <text>
                  <p><b>Paper figures in original size.</b> Compressed folder containing the paper figures in original size for better visualization.</p>
               </text>
               <file name="1471-2105-13-77-S2.zip">
   <p>Click here for file</p>
</file>
            </suppl>
            <p>We downloaded the Taverna workflow and imported it into <it>Tavaxy</it>; Figure <figr fid="F7">7</figr> shows the same workflow in the <it>Tavaxy</it> environment. At this step, the user may choose to execute this workflow as it is from <it>Tavaxy</it>, or may choose to optimize the execution of the workflow and/or customize it by adding further tasks. For example, for this workflow, the user can replace web-services with equivalent locally installed tools through a simple user interface. The workflow mapper carries out this replacement and can, according to user choices, coalesce the remaining Taverna tasks into maximal sub-workflows, as described earlier in the <it>Tavaxy</it> implementation section. In this example, we decided that the ClustalW and the phylogeny analysis parts of the workflow run on the local infrastructure, while the BLAST part still runs remotely. Figure <figr fid="F8">8</figr> shows the optimized version of this workflow, where the maximal Taverna sub-workflows are computed. The functionality of this workflow can be augmented with further tasks. First, we re-used a native Galaxy (sub-) workflow that computes multiple alignment using the MUSCLE program and computes the consensus sequence. Second, we added a <it>Tavaxy</it> sub-workflow, in which the DNA sequences are translated into protein sequences, instead of ignoring processing them. To link the translated sequences to the other parts of the workflow for further analysis, the <it>data merge</it> pattern is used to pass the protein sequences. These extra parts are highlighted in Figures <figr fid="F5">5</figr> and <figr fid="F8">8</figr>.</p>
            <fig id="F7"><title><p>Figure 7 </p></title><caption><p>Imported Tavernaworkflow in Tavaxy.</p></caption><text>
   <p><b>Imported Tavernaworkflow in Tavaxy.</b> The imported Tavernaworkflow in Figure <figr fid="F6">6</figr>. The Tavaxy switch pattern is explicitly represented. The switch patterns are represented by diamond shapes. The upper switch pattern checks if the input sequence is DNA. If false, the lower switch pattern checks if it is a protein one. The dashed polygons mark two maximal external sub-workflows which will be encapsulated in the optimization step, as in Figure <figr fid="F8">8</figr>.</p>
</text><graphic file="1471-2105-13-77-7"/></fig>
            <fig id="F8"><title><p>Figure 8 </p></title><caption><p>Hybrid and optimized workflow in Tavaxy.</p></caption><text>
   <p><b>Hybrid and optimized workflow in Tavaxy.</b> The workflow in Figure <figr fid="F7">7</figr> after optimization and augmentation with extra components. Sub-workflows 1 and 2 are the maximal external sub-workflows marked in Figure <figr fid="F7">7</figr> by dashed polygons. The extra Galaxyworkflow and Tavaxy nodes are also shown.</p>
</text><graphic file="1471-2105-13-77-8"/></fig>
            <sec>
               <st>
                  <p>Measuring the performance</p>
               </st>
               <p>We conducted an experiment to evaluate the overhead associated with invoking the Tavern engine to execute remote tasks, before and after the optimization step. We used the original Taverna workflow and its imported version (i.e., we did not use the extra Galaxy and Tavaxy sub-workflows shown in Figure <figr fid="F8">8</figr>), with the list of input protein IDs (for checking duplicates) being empty. We measured the running time of this workflow with respect to three different execution scenarios. In the first scenario, the original Taverna workflow was executed on Taverna, where the tasks are executed remotely. In the second scenario, the workflow was executed after replacing the remote tools with equivalent local ones (except for BLAST). In the third scenario, the workflow was executed after conducting the optimization step to reduce the number of invocations of the local Taverna engine.</p>
               <p>For this experiment, we used the example protein sequence distributed with the Taverna workflow on myExperiment. We also used another set of proteins used by Kerk et al. <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> to update the Protein Phosphatase database with novel members. The basic idea of their work is to use a set of representative human proteins from different phosphatase classes to identify homologs from different genomes. It is worth mentioning that the workflow at hand automates most of the manual steps conducted in the study of Kerk et al. <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>. Hence, it can be used to systematically and automatically revisit the protein phosphatase repertoire.</p>
               <p>Table <tblr tid="T1">1</tblr> shows the average running times for the different execution scenarios specified above. The experiments were conducted on an 8 core machine (AMD Opteron 1.2&#8201;GHz processors) and 64&#8201;GB RAM. It can be noted that the workflow is not compute-intensive, as it handles one protein sequence at a time and the amount of transferred data on the web is not too large. Therefore, it does not take much time to execute on Taverna. Running the workflow from <it>Tavaxy</it> after using local tools without optimization led to higher execution time due to the overhead associated with the invocation of the Taverna engine at each step. After optimization into maximal external sub-workflows, this time decreased and the overhead was minimized. We note that the time on <it>Tavaxy</it> for the last five proteins is slower than that of Taverna. The reason for this is that these proteins are shorter than the others, which means short running time. Hence, the overheads of calling Taverna outweigh the gain in saving data transfer and using local tools. It is important to note that this overhead is proportional to the complexity of the workflow and not to the data size. This means that it would be neglected for time consuming experiments.</p>
               <table id="T1">
                  <title>
                     <p>Table 1</p>
                  </title>
                  <caption>
                     <p>
                        <b>The average running times for the protein workflow</b>
                     </p>
                  </caption>
                  <tgroup align="left" cols="4">
                     <colspec align="left" colname="c1" colnum="1" colwidth="1*"/>
                     <colspec align="left" colname="c2" colnum="2" colwidth="1*"/>
                     <colspec align="left" colname="c3" colnum="3" colwidth="1*"/>
                     <colspec align="left" colname="c4" colnum="4" colwidth="1*"/>
                     <thead valign="top">
                        <row rowsep="1">
                           <entry align="left" colname="c1">
                              <p>
                                 <b>Protein Name</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>
                                 <b>Taverna</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>
                                 <b>Tavaxy-local</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>
                                 <b>Tavaxy-optimized</b>
                              </p>
                           </entry>
                        </row>
                     </thead>
                     <tfoot>
                        <p>The average running times in minutes for different protein sequences and for different execution scenarios of the protein homology workflow. The last protein, Example seq., is the example protein distributed with the Tavernaworkflow. The other proteins are from the study of <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>.</p>
                     </tfoot>
                     <tbody valign="top">
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>NP_061857.3 (gi|239047414)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>4:10</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>8:05</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:04</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>NP_203747.2 (gi|37674210)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>4:17</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>8:45</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:16</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>NP_060327.2 (gi|24586675)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>4:36</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>7:36</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:21</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Q9UNH5.1 (gi|55976620)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>4:31</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>7:24</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:13</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>O60729.1 (gi|55976216)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>4:35</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>7:18</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:03</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>P30304.2 (gi|50403734)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>2:50</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>8:30</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:01</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>P30305.2 (gi|21264471)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>2:50</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>8:30</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:01</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>NP_001781.2 (gi|125625350)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>1:48</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>7:22</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:20</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>NP_054907.1 (gi|7661832)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>2:08</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>7:36</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:54</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Example seq.</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>1:21</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>7:17</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:44</p>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
               <p>Despite the differences in the design of the Taverna, Galaxy, and <it>Tavaxy</it> engines, we performed an extra experiment to compare their performance. We used the sub-workflow in this case study, including the BLAST and ClustalW calls, as a test workflow. This sub-workflow is highlighted in Figure <figr fid="F5">5</figr> and denoted as &#8216;core workflow&#8217;. For Taverna, we used the local installation of the programs and we wrote special shell scripts to run them on the local infrastructure. (This is not a usual use case for using Taverna and it is not a straightforward task for the non-programming scientist.) The results of this experiment, which are shown in Table <tblr tid="T2">2</tblr>, indicates that the performance of the three systems is very similar. We note a little overhead when using Galaxy and <it>Tavaxy</it>, because the engines of both systems are designed for multiple users, while the Taverna engine is desktop based serving a single user. We also note that the <it>Tavaxy</it> engine, as expected, is a little slower than that of Galaxy. This can be attributed to the overhead associated with the extra wrapper module developed for handling the patterns and cloud functionalities. Note that these overheads are proportional to the workflow size, and would be negligible for large datasets.</p>
               <table id="T2">
                  <title>
                     <p>Table 2</p>
                  </title>
                  <caption>
                     <p>
                        <b>The average running times for protein homology sub-workflow on the Taverna, Galaxy, and Tavaxy systems</b>
                     </p>
                  </caption>
                  <tgroup align="left" cols="5">
                     <colspec align="left" colname="c1" colnum="1" colwidth="1*"/>
                     <colspec align="left" colname="c2" colnum="2" colwidth="1*"/>
                     <colspec align="left" colname="c3" colnum="3" colwidth="1*"/>
                     <colspec align="left" colname="c4" colnum="4" colwidth="1*"/>
                     <colspec align="left" colname="c5" colnum="5" colwidth="1*"/>
                     <thead valign="top">
                        <row rowsep="1">
                           <entry align="left" colname="c1">
                              <p>
                                 <b>Database</b>
                              </p>
                           </entry>
                           <entry align="left" colname="c2">
                              <p>
                                 <b>Sequence</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>
                                 <b>Taverna (local)</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>
                                 <b>Galaxy</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>
                                 <b>Tavaxy</b>
                              </p>
                           </entry>
                        </row>
                     </thead>
                     <tfoot>
                        <p>The average running times (in minutes) of the workflow involving BLAST and ClustalWfor the protein sequences in Table <tblr tid="T1">1</tblr>. The whole workflow runs on local infrastructure. The queries are performed against the swissprot and NCBI refseq databases.</p>
                     </tfoot>
                     <tbody valign="top">
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>0swissprot</p>
                           </entry>
                           <entry colname="c2">
                              <p>NP_061857.3</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:32</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:38</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:37</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_203747.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:31</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:32</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:34</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_060327.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:33</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:40</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:41</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>Q9UNH5.1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:21</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:23</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:23</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>O60729.1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:17</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:23</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:23</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>P30304.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:20</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:15</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:25</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>P30305.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:20</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:25</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:28</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_001781.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:19</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:25</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:26</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_054907.1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:18</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:22</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:26</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>Example Seq.</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>0:15</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>0:17</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>0:19</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>refseq</p>
                           </entry>
                           <entry colname="c2">
                              <p>NP_061857.3</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>2:56</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:20</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>3:15</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_203747.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>2:58</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>3:17</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>3:28</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_060327.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>2:12</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:00</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>2:02</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>Q9UNH5.1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>1:59</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:01</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>2:05</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>O60729.1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>1:50</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>1:39</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>1:42</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>P30304.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>2:51</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:53</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>2:56</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>P30305.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>2:12</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:21</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>2:22</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_001781.2</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>1:50</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>1:57</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>1:59</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>NP_054907.1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>1:53</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>2:01</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>2:04</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry colname="c2">
                              <p>Example Seq.</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>1:46</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>1:43</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>1:47</p>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Case study 2: A metagenomics workflow</p>
            </st>
            <p>Figure <figr fid="F9">9</figr> (left) shows a flow chart representation of a metagenomics workflow deposited on the Galaxy public pages <abbrgrp><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>. The input to this workflow is a set of NGS reads and associated quality data. The workflow starts with quality check of the reads and computation of their lengths. The high quality reads are queried, using the MegaBLASTtool <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>, against two different databases chosen by the user. The reads with good alignment coverage are retained for further analysis. Finally, the taxonomical information of the successful reads are extracted from the alignment file and a taxonomy tree is plotted.</p>
            <fig id="F9"><title><p>Figure 9 </p></title><caption><p>Metagenomics workflow.</p></caption><text>
   <p><b>Metagenomics workflow.</b> Left: Metagenomics workflow as originally provided by Galaxy. Right: the re-designed version of this workflow using the list pattern of Tavaxy.</p>
</text><graphic file="1471-2105-13-77-9"/></fig>
            <p>In the original implementation of this workflow on Galaxy <abbrgrp><abbr bid="B46">46</abbr><abbr bid="B47">47</abbr></abbrgrp>, and as depicted in the schematic representation of Figure <figr fid="F9">9</figr>, we can identify two issues: First, the input reads are passed to MegaBLAST is a single multi-FASTA file which implies sequential processing of the queries against the database. Second, there are two nodes for MegaBLAST: one to consider the NCBI_WGS database and the other to consider the NCBI_NT database. To query more databases in Galaxy, additional nodes should be manually added; this will yield a bulky workflow for a large number of databases. In <it>Tavaxy</it>, we can enhance the design and execution of this workflow with respect to these two issues.</p>
            <p>For the first issue, we use the <it>Tavaxy list pattern</it> in association with MegaBLAST so that the input multi-FASTA file is handled as a list of items. This will immediately lead to parallelization of this step. A list item could be a single FASTA sequence or a block of multiple FASTA sequences. We recommend that the input reads are divided into a list of <it>n</it> blocks, each of size <it>k</it> sequences. The parameter <it>k</it> is set by the user and it should be proportional to the number of processors available. (The list is defined by a special node and its items (blocks) are separated by a special user-defined symbol.) When the workflow with the list pattern is executed, multiple versions of MegaBLAST will be invoked to handle these blocks in parallel.</p>
            <p>For the second issue, concerning the simple integration of more databases, we will use only just one MegaBLAST node and create a list of input databases. This list is passed as input to the MegaBLAST node. To ensure that each read is queried against all given databases, we use the <it>cross product</it> operation defined over the list of databases and the list of input sequences. For <it>m</it> databases and <it>n</it> blocks, we have <it>n&#8201;&#215;&#8201;m</it> invocations of MegaBLAST, which can be handled in parallel without extra coding effort.</p>
            <p>Figure <figr fid="F9">9</figr> (right) shows a schematic representation of the enhanced workflow with the list pattern. Figure <figr fid="F10">10</figr> shows the implementation of the enhanced workflow in <it>Tavaxy</it>. In this figure, the special node &#8220;split_into_list&#8221; defines the list items from the multi-FASTA file.</p>
            <fig id="F10"><title><p>Figure 10 </p></title><caption><p>The enhanced metagenomics workflow.</p></caption><text>
   <p><b>The enhanced metagenomics workflow.</b> The enhanced metagenomics workflow as implemented in Tavaxy.</p>
</text><graphic file="1471-2105-13-77-10"/></fig>
            <sec>
               <st>
                  <p>Measuring the performance</p>
               </st>
               <p>We tested the performance of the enhanced metagenomics workflow on a computer cluster using two datasets. The first was the dataset used by Huson et al. <abbrgrp><abbr bid="B48">48</abbr></abbrgrp>, constituting a metagenomic survey of the Sargasso Sea <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>. This dataset, which represents four environmental samples, is composed of 20,000 Sanger reads, where 10,000 come from Sample 1 and another 10,000 come from Samples 2&#8211;4. The second dataset is the windshield data set of <abbrgrp><abbr bid="B46">46</abbr></abbrgrp>, which is composed of two collections of 454 FLX reads. These reads came from the DNA of the organic matter on the windshield of a moving vehicle that visited two geographic locations (trips A and B). We used the reads of the left part of the windshield experienced both trips. The number of reads are 70343 (&#8776; 15.7 Mbp) and 89783 (18 Mbp) for trips A and B, respectively. For MegaBLAST, we used the NCBI_HTGS, NCBI_NT, and NCBI_ENV datasets.</p>
               <p>Table <tblr tid="T3">3</tblr> shows the average running times over a computer cluster of different compute nodes. (The cluster is composed of three machines, each with 8 cores (AMD Opteron 1.2&#8201;GHz processors), and 64&#8201;GB RAM, connected with a 1Gb Ethernet switch.) In this experiment, the list pattern divided the input data into 11 blocks, each with size&#8201;&#8776;&#8201;1000 sequences in case of the Sargasso data and&#8201;&#8776;&#8201;7000 sequences in case of the windshield data. (For a cross product with 3 databases, we have 33 jobs in total.) From the table, it can be seen that the running times decrease with the increased number of cores.</p>
               <table id="T3">
                  <title>
                     <p>Table 3</p>
                  </title>
                  <caption>
                     <p>
                        <b>The average running times of the metagenomics workflow on local infrastructure</b>
                     </p>
                  </caption>
                  <tgroup align="left" cols="7">
                     <colspec align="left" colname="c1" colnum="1" colwidth="1*"/>
                     <colspec align="left" colname="c2" colnum="2" colwidth="1*"/>
                     <colspec align="left" colname="c3" colnum="3" colwidth="1*"/>
                     <colspec align="left" colname="c4" colnum="4" colwidth="1*"/>
                     <colspec align="left" colname="c5" colnum="5" colwidth="1*"/>
                     <colspec align="left" colname="c6" colnum="6" colwidth="1*"/>
                     <colspec align="left" colname="c7" colnum="7" colwidth="1*"/>
                     <thead valign="top">
                        <row rowsep="1">
                           <entry align="left" colname="c1">
                              <p>
                                 <b>Dataset</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c2" nameend="c7" namest="c2">
                              <p>
                                 <b>Cores</b>
                              </p>
                           </entry>
                        </row>
                     </thead>
                     <tfoot>
                        <p>The average running times in minutes for varying numbers of processors (cluster cores) on the local infrastructure.</p>
                     </tfoot>
                     <tbody valign="top">
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry align="center" colname="c2">
                              <p>1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>2</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>4</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>8</p>
                           </entry>
                           <entry align="center" colname="c6">
                              <p>16</p>
                           </entry>
                           <entry align="center" colname="c7">
                              <p>32</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Windshield Trip A (left)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>163</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>86</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>48</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>31</p>
                           </entry>
                           <entry align="center" colname="c6">
                              <p>21</p>
                           </entry>
                           <entry align="center" colname="c7">
                              <p>15</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Windshield Trip B (left)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>204</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>98</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>55</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>35</p>
                           </entry>
                           <entry align="center" colname="c6">
                              <p>21</p>
                           </entry>
                           <entry align="center" colname="c7">
                              <p>16</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Sargasso Sea (Sample 1)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>109</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>59</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>35</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>21</p>
                           </entry>
                           <entry align="center" colname="c6">
                              <p>13</p>
                           </entry>
                           <entry align="center" colname="c7">
                              <p>10</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Sargasso Sea (Samples 2&#8211;4)</p>
                           </entry>
                           <entry align="center" colname="c2">
                              <p>113</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>67</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>39</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>23</p>
                           </entry>
                           <entry align="center" colname="c6">
                              <p>14</p>
                           </entry>
                           <entry align="center" colname="c7">
                              <p>10</p>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </sec>
            <sec>
               <st>
                  <p>Use of cloud computing</p>
               </st>
               <p>We used the cloud computing features of <it>Tavaxy</it> on the sub-workflow level to execute the metagenomics workflow. The purpose is to test the use of cloud computing in terms of execution time and cost of computation. Here, we focused on the sub-workflow mode of using cloud computing, because it demonstrates the case. We decided to run the sub-workflow involving MegaBLAST with the list pattern on the cloud because it is the most compute-intensive part in this workflow. From the <it>Tavaxy</it> interface, we established a computer cluster on the AWS cloud. Each node includes a copy of the databases needed by MegaBLAST. The shared S3 cloud storage is attached to the cluster to maintain the output and intermediate results. For this experiment, we used Amazon instances of type &#8220;Extra Large&#8221;, with 8 cores (&#8776;1.2&#8201;GHz Xeon Processor), 15&#8201;GB RAM, and 1,690&#8201;GB storage. The establishment of the cluster with the storage took a few minutes from the machine images.</p>
               <p>Table <tblr tid="T4">4</tblr> shows the execution times of the workflow for the same datasets mentioned before using different cluster sizes on the cloud. It also includes the monetary cost of running this workflow, for each cluster size. It is interesting to see that the use of more machines led to faster running time and reduced cost. In our case, the four machines (with total 32 cores) working in parallel run for less than one hour and cost totally $2.7. This is cheaper and faster than using a single machine that runs for about 6 hours and costs $4.1.</p>
               <table id="T4">
                  <title>
                     <p>Table 4</p>
                  </title>
                  <caption>
                     <p>
                        <b>The average running times of the metagenomics workflow on the AWS cloud</b>
                     </p>
                  </caption>
                  <tgroup align="left" cols="6">
                     <colspec align="left" colname="c1" colnum="1" colwidth="1*"/>
                     <colspec align="left" colname="c2" colnum="2" colwidth="1*"/>
                     <colspec align="left" colname="c3" colnum="3" colwidth="1*"/>
                     <colspec align="left" colname="c4" colnum="4" colwidth="1*"/>
                     <colspec align="left" colname="c5" colnum="5" colwidth="1*"/>
                     <colspec align="left" colname="c6" colnum="6" colwidth="1*"/>
                     <thead valign="top">
                        <row rowsep="1">
                           <entry align="left" colname="c1">
                              <p>
                                 <b>Dataset</b>
                              </p>
                           </entry>
                           <entry align="center" colname="c2" nameend="c6" namest="c2">
                              <p>
                                 <b>Cores</b>
                              </p>
                           </entry>
                        </row>
                     </thead>
                     <tfoot>
                        <p>The average running times in minutes for a computer cluster on the cloud. The number in bracket is the computation cost in US Dollars for the US-East site with $0.68 per hour (2011 AWS price list). (Note that partial computing hour of an instance is billed on Amazon as a full hour.)</p>
                     </tfoot>
                     <tbody valign="top">
                        <row rowsep="1">
                           <entry colname="c1"/>
                           <entry align="center" colname="c2">
                              <p>1</p>
                           </entry>
                           <entry align="center" colname="c3">
                              <p>4</p>
                           </entry>
                           <entry align="center" colname="c4">
                              <p>8</p>
                           </entry>
                           <entry align="center" colname="c5">
                              <p>16</p>
                           </entry>
                           <entry align="center" colname="c6">
                              <p>32</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Windshield Trip A (left)</p>
                           </entry>
                           <entry colname="c2">
                              <p>330 ($4.1)</p>
                           </entry>
                           <entry colname="c3">
                              <p>82 ($4.1)</p>
                           </entry>
                           <entry colname="c4">
                              <p>34 ($4.1)</p>
                           </entry>
                           <entry colname="c5">
                              <p>18 ($1.4)</p>
                           </entry>
                           <entry colname="c6">
                              <p>14 ($2.7)</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Windshield Trip B (left)</p>
                           </entry>
                           <entry colname="c2">
                              <p>371 ($4.8)</p>
                           </entry>
                           <entry colname="c3">
                              <p>91 ($4.8)</p>
                           </entry>
                           <entry colname="c4">
                              <p>40 ($4.1)</p>
                           </entry>
                           <entry colname="c5">
                              <p>21 ($1.4)</p>
                           </entry>
                           <entry colname="c6">
                              <p>15 ($2.7)</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Sargasso Sea (Sample 1)</p>
                           </entry>
                           <entry colname="c2">
                              <p>252 ($3.4)</p>
                           </entry>
                           <entry colname="c3">
                              <p>60 ($3.4)</p>
                           </entry>
                           <entry colname="c4">
                              <p>22 ($3.4)</p>
                           </entry>
                           <entry colname="c5">
                              <p>15 ($1.4)</p>
                           </entry>
                           <entry colname="c6">
                              <p>9 ($2.7)</p>
                           </entry>
                        </row>
                        <row rowsep="1">
                           <entry colname="c1">
                              <p>Sargasso Sea (Samples 2&#8211;4)</p>
                           </entry>
                           <entry colname="c2">
                              <p>299 ($3.4)</p>
                           </entry>
                           <entry colname="c3">
                              <p>73 ($3.4)</p>
                           </entry>
                           <entry colname="c4">
                              <p>26 ($3.4)</p>
                           </entry>
                           <entry colname="c5">
                              <p>16 ($1.4)</p>
                           </entry>
                           <entry colname="c6">
                              <p>10 ($2.7)</p>
                           </entry>
                        </row>
                     </tbody>
                  </tgroup>
               </table>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>In this paper we introduced <it>Tavaxy</it>, a stand-alone pattern-based workflow system that can also integrate the use of Taverna and Galaxy workflows in a single environment, enabling their modification and execution. The <it>Tavaxy</it> integration approach is based on the use of hierarchical workflows and workflow patterns. <it>Tavaxy</it> also supports the use of local high-performance computing and the use of cloud computing. The focus of the current version of <it>Tavaxy</it>is on simplifying the development of sequence analysis applications, and we demonstrated its features and advantages using two sequence analysis case studies. Future versions of the system will support further applications in transcriptomics and proteomics.</p>
         <p>We also introduced a set of advanced data patterns that simplify the composition of a variety of sequence analysis tasks and simplify the use of parallel computing resources for executing them. In future work, we will extend the available patterns to support more complex sequence analysis tasks, as well as other application domains. <it>Tavaxy</it>is currently shipped with its own repository of pre-imported Tavernaand Galaxy workflows to facilitate their immediate use. This repository can be regarded as a set of &#8220;design patterns&#8221; that can help in speeding up composition of more complex workflows.</p>
         <p>In the current version of <it>Tavaxy</it>, we have set up the system for use on a traditional computer cluster on the AWS cloud. We have not yet investigated other HPC options, such as the Amazon Elastic MapReduce or the use of GPUs.</p>
         <p>In future versions of <it>Tavaxy</it> we will investigate the use of these options to support efficient execution at the sub-workflow and task levels. We will also investigate the use of other cloud computing platforms.</p>
         <p>Finally, we believe that one of the key advantages of <it>Tavaxy </it>is that it provides a solution that consolidates the use of remote web-services, cloud computing, and local computing infrastructures. In our model, the use of remote web-services is limited to only those shared tools that cannot be made locally available, the use of a local infrastructure supports the execution of affordable tasks, and the use of cloud computing provides a scalable solution to compute- and data-intensive tasks.</p>
      </sec>
      <sec>
         <st>
            <p>
               <b>Availability and requirements</b>
            </p>
         </st>
         <p><b>0.0.0.1. Project name:</b><it>Tavaxy</it>.</p>
         <p><b>0.0.0.2. Project home page:</b> http://www.tavaxy.org.</p>
         <p><b>0.0.0.3. Operating system(s):</b> Linux.</p>
         <p><b>0.0.0.4. Programming language:</b> Python, C, Java script, JSF</p>
         <p><b>0.0.0.5. Other requirements:</b> Compatible with the browsers FireFox, Chrome, Safari, and Opera. See the manual for more details.</p>
         <p><b>0.0.0.6. License:</b> Free for academics. Authorization license needed for commercial usage (Please contact the corresponding author for more details).</p>
         <p><b>0.0.0.7. Any restrictions to use by non-academics:</b> No restrictions.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interest</p>
         </st>
         <p>The authors declare no conflict of interest.</p>
      </sec>
      <sec>
         <st>
            <p>Authors&#8217; contributions</p>
         </st>
         <p>MA led the <it>Tavaxy</it> project. MA and MG contributed to theoretical developments of the architecture and workflow patterns which form the basis of <it>Tavaxy</it>. SA and MA developed and tested the software and implemented the workflows. All authors wrote and approved the manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgment</p>
            </st>
            <p>We thank the Galaxy team for making the code of their system available under the open source license, which helped us re-use a number of Galaxy components within our system. We also thank the Taverna team for making their system available under open source license and for their valuable feedback on the manuscript. We thank Peter Tonellato and Dennis Wall from Harvard Medical School, as well as the Amazon team, for providing us with AWS compute hours. We thank Mohamed Elkalioby from Nile University for his support in establishing the cloud computing infrastructure. We thank Sondos Seif from Nile University for helping us in software engineering tasks.</p>
         </sec>
      </ack>
      <refgrp><bibl id="B1"><title><p>Challenges of sequencing human genomes</p></title><aug><au><snm>Koboldt</snm><fnm>D</fnm></au><au><snm>Ding</snm><fnm>L</fnm></au><au><snm>Mardis</snm><fnm>E</fnm></au><au><snm>Wilson</snm><fnm>R</fnm></au></aug><source>Briefings in Bioinformics</source><pubdate>2010</pubdate><volume>11</volume><issue>5</issue><fpage>484</fpage><lpage>498</lpage><xrefbib><pubid idtype="doi">10.1093/bib/bbq016</pubid></xrefbib></bibl><bibl id="B2"><title><p>Next-generation sequencing: from basic research to diagnostics</p></title><aug><au><snm>Voelkerding</snm><fnm>K</fnm></au><au><snm>Dames</snm><fnm>S</fnm></au><au><snm>Durtschi</snm><fnm>J</fnm></au></aug><source>Clin Chem</source><pubdate>2009</pubdate><volume>55</volume><issue>4</issue><fpage>641</fpage><lpage>658</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1373/clinchem.2008.112789</pubid><pubid idtype="pmpid" link="fulltext">19246620</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>GAMES identifies and annotates mutations in next-generation sequencing projects</p></title><aug><au><snm>Sana</snm><fnm>M</fnm></au><au><snm>Iascone</snm><fnm>M</fnm></au><au><snm>Marchetti</snm><fnm>D</fnm></au><au><snm>Palatini</snm><fnm>J</fnm></au><au><snm>Galasso</snm><fnm>M</fnm></au><au><snm>Volinia</snm><fnm>S</fnm></au></aug><source>Bioinformics</source><pubdate>2010</pubdate><volume>27</volume><fpage>9</fpage><lpage>13</lpage></bibl><bibl id="B4"><title><p>A primer on metagenomics</p></title><aug><au><snm>Wooley</snm><fnm>J</fnm></au><au><snm>Godzik</snm><fnm>A</fnm></au><au><snm>Friedberg</snm><fnm>I</fnm></au></aug><source>PLoS Comput Biol.</source><pubdate>2010</pubdate><volume>146</volume><issue>2</issue><fpage>e1000667</fpage></bibl><bibl id="B5"><title><p>Recent progress and new challenges in metagenomics for biotechnology</p></title><aug><au><snm>Chistoserdova</snm><fnm>L</fnm></au></aug><source>Biotechnological Letters</source><pubdate>2010</pubdate><volume>32</volume><fpage>1351</fpage><lpage>1359</lpage><xrefbib><pubid idtype="doi">10.1007/s10529-010-0306-9</pubid></xrefbib></bibl><bibl id="B6"><title><p>P H: A Bioinformatician&#8217;s guide to metagenomics</p></title><aug><au><snm>Kunin</snm><fnm>V</fnm></au><au><snm>Copeland</snm><fnm>A</fnm></au><au><snm>Lapidus</snm><fnm>A</fnm></au><au><snm>Mavromatis</snm><fnm>K</fnm></au></aug><source>Microbiol. Mol. Biology Reviews</source><pubdate>2008</pubdate><volume>72</volume><issue>4</issue><fpage>557</fpage><lpage>578</lpage><xrefbib><pubid idtype="doi">10.1128/MMBR.00009-08</pubid></xrefbib></bibl><bibl id="B7"><title><p>Microbial metagenomics: beyond the genome</p></title><aug><au><snm>Gilbert</snm><fnm>J</fnm></au><au><snm>Dupont</snm><fnm>C</fnm></au></aug><source>Annual Review of Marine Science</source><pubdate>2010</pubdate><volume>3</volume><fpage>347</fpage><lpage>371</lpage></bibl><bibl id="B8"><title><p>Taverna: a tool for the composition and enactment of bioinformatics workflows</p></title><aug><au><snm>Oinn</snm><fnm>T</fnm></au><au><snm>Addis</snm><fnm>M</fnm></au><au><snm>Ferris</snm><fnm>J</fnm></au><au><snm>Marvin</snm><fnm>D</fnm></au><etal/></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><issue>17</issue><fpage>3045</fpage><lpage>3054</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bth361</pubid><pubid idtype="pmpid" link="fulltext">15201187</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>Taverna: a tool for building and running workflows of services</p></title><aug><au><snm>Hull</snm><fnm>D</fnm></au><au><snm>Wolstencroft</snm><fnm>K</fnm></au><au><snm>Stevens</snm><fnm>R</fnm></au><au><snm>Goble</snm><fnm>C</fnm></au><etal/></aug><source>Nucleic Acids Res</source><pubdate>2006</pubdate><volume>34</volume><fpage>W729</fpage><lpage>W732</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkl320</pubid><pubid idtype="pmcid">1538887</pubid><pubid idtype="pmpid" link="fulltext">16845108</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>D H, et al: Scientific workflow management and the Kepler system</p></title><aug><au><snm>Lud&#228;scher</snm><fnm>B</fnm></au><au><snm>Altintas</snm><fnm>I</fnm></au><au><snm>Berkley</snm><fnm>C</fnm></au></aug><source>Concurrency and Computation: Practice and Experience</source><pubdate>2006</pubdate><volume>18</volume><issue>10</issue><fpage>1039</fpage><lpage>1065</lpage><xrefbib><pubid idtype="doi">10.1002/cpe.994</pubid></xrefbib></bibl><bibl id="B11"><title><p>Visual Grid workflow in Triana</p></title><aug><au><snm>Taylor</snm><fnm>I</fnm></au><au><snm>Shields</snm><fnm>M</fnm></au><au><snm>Wang</snm><fnm>I</fnm></au><au><snm>Harrison</snm><fnm>A</fnm></au></aug><source>J. Grid Computing</source><pubdate>2005</pubdate><volume>3</volume><issue>3&#8211;4</issue><fpage>153</fpage><lpage>169</lpage></bibl><bibl id="B12"><title><p>The Triana workflow environment: Architecture and Applications</p></title><aug><au><snm>Taylor</snm><fnm>I</fnm></au><au><snm>Shields</snm><fnm>M</fnm></au><au><snm>Wang</snm><fnm>I</fnm></au><au><snm>Harrison</snm><fnm>A</fnm></au></aug><source>Workflows for e-Science</source><publisher>Springer</publisher><pubdate>2007</pubdate><fpage>320</fpage><lpage>339</lpage></bibl><bibl id="B13"><title><p>Galaxy: a platform for interactive large-scale genome analysis</p></title><aug><au><snm>Giardine</snm><fnm>B</fnm></au><au><snm>Riemer</snm><fnm>C</fnm></au><au><snm>Hardison</snm><fnm>R</fnm></au><etal/></aug><source>Genome Res</source><pubdate>2005</pubdate><volume>15</volume><issue>10</issue><fpage>1451</fpage><lpage>1455</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.4086505</pubid><pubid idtype="pmcid">1240089</pubid><pubid idtype="pmpid" link="fulltext">16169926</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Conveyor: a workflow engine for bioinformatics analyses</p></title><aug><au><snm>Linke</snm><fnm>B</fnm></au><au><snm>Giegerich</snm><fnm>R</fnm></au><au><snm>Goesmann</snm><fnm>A</fnm></au></aug><source>Bioinformatics</source><pubdate>2011</pubdate><volume>27</volume><issue>7</issue><fpage>903</fpage><lpage>911</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btr040</pubid><pubid idtype="pmpid" link="fulltext">21278189</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Pegasus: A framework for mapping complex scientific workflows onto distributed systems</p></title><aug><au><snm>Deelman</snm><fnm>E</fnm></au><au><snm>Singh</snm><fnm>G</fnm></au><au><snm>Su</snm><fnm>MH</fnm></au><au><snm>Blythe</snm><fnm>J</fnm></au><au><snm>Gil</snm><fnm>Y</fnm></au><au><snm>Kesselman</snm><fnm>C</fnm></au><au><snm>Mehta</snm><fnm>G</fnm></au><au><snm>Vahi</snm><fnm>K</fnm></au><au><snm>Berriman</snm><fnm>GB</fnm></au><au><snm>Good</snm><fnm>J</fnm></au><au><snm>Laity</snm><fnm>A</fnm></au><au><snm>Jacob</snm><fnm>JC</fnm></au><au><snm>Katz</snm><fnm>D</fnm></au></aug><source>Sci Program</source><pubdate>2005</pubdate><volume>3</volume><fpage>219</fpage><lpage>237</lpage></bibl><bibl id="B16"><title><p>Pegasys: software for executing and integrating analyses of biological sequences</p></title><aug><au><snm>Shah</snm><fnm>S</fnm></au><au><snm>He</snm><fnm>D</fnm></au><au><snm>Sawkins</snm><fnm>J</fnm></au><au><snm>Druce</snm><fnm>J</fnm></au><au><snm>Quon</snm><fnm>G</fnm></au><au><snm>Lett</snm><fnm>D</fnm></au><au><snm>Zheng</snm><fnm>G</fnm></au><au><snm>Xu</snm><fnm>T</fnm></au><au><snm>Ouellette</snm><fnm>B</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2004</pubdate><volume>5</volume><fpage>40</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-5-40</pubid><pubid idtype="pmcid">406494</pubid><pubid idtype="pmpid" link="fulltext">15096276</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>GenePattern 2.0</p></title><aug><au><snm>Reich</snm><fnm>M</fnm></au><au><snm>Liefeld</snm><fnm>T</fnm></au><au><snm>Gould</snm><fnm>J</fnm></au><au><snm>Lerner</snm><fnm>J</fnm></au><au><snm>Tamayo</snm><fnm>P</fnm></au><au><snm>Mesirov</snm><fnm>J</fnm></au></aug><source>Nat Genet</source><pubdate>2006</pubdate><volume>38</volume><fpage>500</fpage><lpage>501</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng0506-500</pubid><pubid idtype="pmpid" link="fulltext">16642009</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>Using GenePattern for gene expression analysis</p></title><aug><au><snm>Kuehn</snm><fnm>H</fnm></au><au><snm>Liberzon</snm><fnm>A</fnm></au><au><snm>Reich</snm><fnm>M</fnm></au><au><snm>Mesirov</snm><fnm>JP</fnm></au></aug><source>Current Protocols Bioinformatics</source><pubdate>2008</pubdate><volume>Chapter 7</volume><issue>Unit 7</issue><fpage>12</fpage></bibl><bibl id="B19"><title><p>The discovery net system for high throughput bioinformatics</p></title><aug><au><snm>Rowe</snm><fnm>A</fnm></au><au><snm>Kalaitzopoulos</snm><fnm>D</fnm></au><au><snm>Osmond</snm><fnm>M</fnm></au><au><snm>Ghanem</snm><fnm>M</fnm></au><au><snm>Guo</snm><fnm>Y</fnm></au></aug><source>Bioinformatics</source><pubdate>2003</pubdate><volume>19</volume><issue>90001</issue><fpage>225i</fpage><lpage>231i</lpage></bibl><bibl id="B20"><aug><au><snm>Ghanem</snm><fnm>M</fnm></au><au><snm>Curcin</snm><fnm>V</fnm></au><au><snm>Wendel</snm><fnm>P</fnm></au><au><snm>Guo</snm><fnm>Y</fnm></au></aug><source>Building and using analytical workflows in discovery net</source><publisher>In Data mining on the Grid, John Wiley and Sons</publisher><pubdate>2008</pubdate></bibl><bibl id="B21"><title><p>The OMII software distribution</p></title><aug><au><snm>Bradley</snm><fnm>J</fnm></au><au><snm>Brown</snm><fnm>C</fnm></au><au><snm>Carpenter</snm><fnm>B</fnm></au><etal/></aug><source>In All Hands Meeting, Humana Press</source><pubdate>2006</pubdate><volume></volume><fpage>748</fpage><lpage>753</lpage></bibl><bibl id="B22"><aug><au><snm>Curcin</snm><fnm>V</fnm></au><au><snm>Ghanem</snm><fnm>M</fnm></au></aug><source>Scientific workflow systems - can one size fit all?</source><publisher>In Proceedings of CIBEC, IEEE</publisher><pubdate>2008</pubdate></bibl><bibl id="B23"><title><p>myExperiment: a repository and social network for the sharing of bioinformatics workflows</p></title><aug><au><snm>Goble</snm><fnm>C</fnm></au><au><snm>Bhagat</snm><fnm>J</fnm></au><au><snm>Aleksejevs</snm><fnm>S</fnm></au><au><snm>Cruickshank</snm><fnm>D</fnm></au><au><snm>Michaelides</snm><fnm>D</fnm></au><au><snm>Newman</snm><fnm>D</fnm></au><au><snm>Borkum</snm><fnm>M</fnm></au><au><snm>Bechhofer</snm><fnm>S</fnm></au><au><snm>Roos</snm><fnm>M</fnm></au><au><snm>Li</snm><fnm>P</fnm></au><au><snm>De Roure</snm><fnm>D</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2010</pubdate><volume>38</volume><issue>suppl 2</issue><fpage>W677</fpage><lpage>W682</lpage><xrefbib><pubidlist><pubid idtype="pmcid">2896080</pubid><pubid idtype="pmpid" link="fulltext">20501605</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><source>myExperiment</source><note>
   <url>http://www.myexperiment.org</url>
</note></bibl><bibl id="B25"><title><p>Meta-workflows: pattern-based interoperability between Galaxy and Taverna</p></title><aug><au><snm>Abouelhoda</snm><fnm>M</fnm></au><au><snm>Alaa</snm><fnm>S</fnm></au><au><snm>Ghanem</snm><fnm>M</fnm></au></aug><source>Wands&#8217;10: Proceedings of	the 1st International Workshop on Workflow Approaches to New Data-centric	Science.</source><publisher>ACM, New York, NY, USA</publisher><pubdate>2010</pubdate><fpage>1</fpage><lpage>8</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">22889432</pubid></xrefbib></bibl><bibl id="B26"><aug><au><snm>Karasavvas</snm><fnm>K</fnm></au></aug><source>eGalaxy</source><note>
   <url>https://trac.nbic.nl/elabfactory/wiki/eGalaxy</url>
</note></bibl><bibl id="B27"><title><p>Workflow Management Coalition Workflow Standard - Interoperability abstract specification. Document Number WFMC-TC-1012. Version 1</p></title><aug><au><cnm>WfMC</cnm></au></aug><note>Tech. rep., <url>https://www.wfmc.org</url></note></bibl><bibl id="B28"><title><p>Three fundamental dimensions of scientific workflow interoperability: Model of computation, language, and execution environment</p></title><aug><au><snm>Elmroth</snm><fnm>E</fnm></au><au><snm>Hernandez</snm><fnm>F</fnm></au><au><snm>Tordsson</snm><fnm>J</fnm></au></aug><source>Futur Gener Comput Syst</source><pubdate>2010</pubdate><volume>26</volume><issue>2</issue><fpage>245</fpage><lpage>256</lpage><xrefbib><pubid idtype="doi">10.1016/j.future.2009.08.011</pubid></xrefbib></bibl><bibl id="B29"><title><p>Workflow patterns</p></title><aug><au><snm>van der Aalst</snm><fnm>W</fnm></au><au><snm>Hofstede</snm><fnm>A</fnm></au><au><snm>Kiepuszewski</snm><fnm>B</fnm></au><au><snm>Barros</snm><fnm>A</fnm></au></aug><source>Distributed and Parallel Databases</source><pubdate>2003</pubdate><volume>14</volume><issue>3</issue><fpage>5</fpage><lpage>51</lpage></bibl><bibl id="B30"><aug><au><snm>Shields</snm><fnm>M</fnm></au></aug><source>Control-versus data-driven workflows, In Workflows for e-Science</source><publisher>Springer</publisher><pubdate>2007</pubdate><fpage>167</fpage><lpage>173</lpage></bibl><bibl id="B31"><title><p>Scientific workflows: business as usual?</p></title><aug><au><snm>Lud&#228;scher</snm><fnm>B</fnm></au><au><snm>Weske</snm><fnm>M</fnm></au><au><snm>Mcphillips</snm><fnm>T</fnm></au><au><snm>Bowers</snm><fnm>S</fnm></au></aug><source>In Proceedings of the 7th International Conference on Business Process Management</source><publisher>BPM&#8217;09 Springer-Verlag</publisher><pubdate>2009</pubdate><fpage>31</fpage><lpage>47</lpage></bibl><bibl id="B32"><title><p>Collection-oriented scientific workflows for integrating and analyzing biological data</p></title><aug><au><snm>McPhillips</snm><fnm>T</fnm></au><au><snm>Bowers</snm><fnm>S</fnm></au><au><snm>Lud&#228;scher</snm><fnm>B</fnm></au></aug><source>Data Integration in the Life Sciences (DILS)</source><pubdate>2006</pubdate><volume>4075</volume><fpage>248</fpage><lpage>263</lpage><xrefbib><pubid idtype="doi">10.1007/11799511_23</pubid></xrefbib></bibl><bibl id="B33"><title><p>Coroutines and networks of parallel processes</p></title><aug><au><snm>Kahn</snm><fnm>G</fnm></au><au><snm>Macqueen</snm><fnm>D</fnm></au></aug><source>Information Processing 77</source><publisher>North Holland Publishing Company</publisher><pubdate>1977</pubdate><fpage>993</fpage><lpage>998</lpage></bibl><bibl id="B34"><title><p>Static scheduling of synchronous data flow programs for digital signal processing</p></title><aug><au><snm>Ashford</snm><fnm>E</fnm></au><au><snm>David</snm><fnm>L</fnm></au></aug><source>IEEE Trans Comput</source><pubdate>1987</pubdate><volume>36</volume><fpage>24</fpage><lpage>35</lpage></bibl><bibl id="B35"><title><p>EMBOSS: the European Molecular Biology Open Software Suite</p></title><aug><au><snm>Rice</snm><fnm>P</fnm></au><au><snm>Longden</snm><fnm>I</fnm></au><au><snm>Bleasby</snm><fnm>A</fnm></au></aug><source>Trends in Genetics</source><pubdate>2000</pubdate><volume>16</volume><issue>6</issue><fpage>276</fpage><lpage>277</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0168-9525(00)02024-2</pubid><pubid idtype="pmpid" link="fulltext">10827456</pubid></pubidlist></xrefbib></bibl><bibl id="B36"><title><p>The sequence alignment/map format and SAMtools</p></title><aug><au><snm>Li</snm><fnm>H</fnm></au><au><snm>Handsaker</snm><fnm>B</fnm></au><au><snm>Wysoker</snm><fnm>A</fnm></au><au><snm>Fennell</snm><fnm>T</fnm></au><au><snm>Ruan</snm><fnm>J</fnm></au><au><snm>Homer</snm><fnm>N</fnm></au><au><snm>Marth</snm><fnm>G</fnm></au><au><snm>Abecasis</snm><fnm>G</fnm></au><au><snm>Durbin</snm><fnm>R</fnm></au></aug><source>Bioinformatics</source><pubdate>2009</pubdate><volume>25</volume><issue>16</issue><fpage>2078</fpage><lpage>2079</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btp352</pubid><pubid idtype="pmcid">2723002</pubid><pubid idtype="pmpid" link="fulltext">19505943</pubid></pubidlist></xrefbib></bibl><bibl id="B37"><title><p>FASTX-Toolkit</p></title><note>
   <url>http://hannonlab.cshl.edu/fastx_toolkit</url>
</note></bibl><bibl id="B38"><title><p>A Basic Local Alignment Search Tool</p></title><aug><au><snm>Altschul</snm><fnm>SF</fnm></au><au><snm>Gish</snm><fnm>W</fnm></au><au><snm>Miller</snm><fnm>W</fnm></au><au><snm>Myers</snm><fnm>EW</fnm></au><au><snm>Lipman</snm><fnm>DJ</fnm></au></aug><source>J. Molecular Biology</source><pubdate>1990</pubdate><volume>215</volume><fpage>403</fpage><lpage>410</lpage></bibl><bibl id="B39"><title><p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs</p></title><aug><au><snm>Altschul</snm><fnm>S</fnm></au><au><snm>Madden</snm><fnm>TL</fnm></au><au><snm>Sch&#228;ffer</snm><fnm>AA</fnm></au><etal/></aug><source>Nucleic Acids Res</source><pubdate>1997</pubdate><volume>25</volume><issue>17</issue><fpage>3389</fpage><lpage>3402</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/25.17.3389</pubid><pubid idtype="pmcid">146917</pubid><pubid idtype="pmpid" link="fulltext">9254694</pubid></pubidlist></xrefbib></bibl><bibl id="B40"><title><p>A greedy algorithm for aligning DNA sequences</p></title><aug><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Schwartz</snm><fnm>S</fnm></au><au><snm>Wagner</snm><fnm>L</fnm></au><au><snm>Miller</snm><fnm>W</fnm></au></aug><source>J Comput Biol</source><pubdate>2000</pubdate><volume>7</volume><issue>1&#8211;2</issue><fpage>203</fpage><lpage>214</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">10890397</pubid></xrefbib></bibl><bibl id="B41"><title><p>CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice</p></title><aug><au><snm>Thompson</snm><fnm>J</fnm></au><au><snm>Higgins</snm><fnm>D</fnm></au><au><snm>Gibson</snm><fnm>T</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>1994</pubdate><volume>22</volume><fpage>4673</fpage><lpage>4680</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/22.22.4673</pubid><pubid idtype="pmcid">308517</pubid><pubid idtype="pmpid" link="fulltext">7984417</pubid></pubidlist></xrefbib></bibl><bibl id="B42"><title><p>T-Coffee: a novel method for fast and accurate multiple sequence alignment</p></title><aug><au><snm>Notredame</snm><fnm>C</fnm></au><au><snm>Higgins</snm><fnm>D</fnm></au><au><snm>Heringa</snm><fnm>J</fnm></au></aug><source>J. Molecular Biology</source><pubdate>2000</pubdate><volume>302</volume><fpage>205</fpage><lpage>217</lpage><xrefbib><pubid idtype="doi">10.1006/jmbi.2000.4042</pubid></xrefbib></bibl><bibl id="B43"><title><p>Muscle: multiple sequence alignment with high accuracy and high throughput</p></title><aug><au><snm>Edgar</snm><fnm>R</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2004</pubdate><volume>32</volume><fpage>1792</fpage><lpage>1797</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkh340</pubid><pubid idtype="pmcid">390337</pubid><pubid idtype="pmpid" link="fulltext">15034147</pubid></pubidlist></xrefbib></bibl><bibl id="B44"><title><p>Workflow for protein sequence analysis</p></title><aug><au><cnm>Monteiro M</cnm></au></aug><note>
   <url>http://www.myexperiment.org/workflows/124.html</url>
</note></bibl><bibl id="B45"><title><p>Evolutionary radiation pattern of novel protein phosphatases revealed by analysis of protein data from the completely sequenced genomes of humans, green algae, and higher plants</p></title><aug><au><snm>Kerk</snm><fnm>D</fnm></au><au><snm>Templeton</snm><fnm>G</fnm></au><au><snm>Moorhead</snm><fnm>G</fnm></au></aug><source>Plant Physiol</source><pubdate>2008</pubdate><volume>146</volume><issue>2</issue><fpage>351</fpage><lpage>367</lpage><xrefbib><pubidlist><pubid idtype="pmcid">2245839</pubid><pubid idtype="pmpid" link="fulltext">18156295</pubid></pubidlist></xrefbib></bibl><bibl id="B46"><title><p>Windshield splatter analysis with the Galaxy metagenomic pipeline</p></title><aug><au><snm>Kosakovsky Pond</snm><fnm>S</fnm></au><au><snm>Wadhawan</snm><fnm>S</fnm></au><au><snm>Chiaromonte</snm><fnm>F</fnm></au><au><snm>Ananda</snm><fnm>G</fnm></au><au><snm>Chung</snm><fnm>W</fnm></au><au><snm>Taylor</snm><fnm>J</fnm></au><au><snm>Nekrutenko</snm><fnm>A</fnm></au><au><snm>Team</snm><fnm>TG</fnm></au></aug><source>Genome Res</source><pubdate>2009</pubdate><volume>19</volume><issue>11</issue><fpage>2144</fpage><lpage>2153</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.094508.109</pubid><pubid idtype="pmcid">2775585</pubid><pubid idtype="pmpid" link="fulltext">19819906</pubid></pubidlist></xrefbib></bibl><bibl id="B47"><title><p>Galaxy Published Page: windshield splatter</p></title><note>
   <url>http://main.g2.bx.psu.edu/u/aun1/p/windshield-splatter</url>
</note></bibl><bibl id="B48"><title><p>MEGAN: Analysis of metagenomic data</p></title><aug><au><snm>DH Huson</snm><fnm>D</fnm></au><au><snm>AF</snm><fnm>A</fnm></au><au><snm>Qi</snm><fnm>J</fnm></au><au><snm>Schuster</snm><fnm>S</fnm></au></aug><source>Genome Res</source><pubdate>2007</pubdate><volume>17</volume><fpage>377</fpage><lpage>386</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.5969107</pubid><pubid idtype="pmcid">1800929</pubid><pubid idtype="pmpid" link="fulltext">17255551</pubid></pubidlist></xrefbib></bibl><bibl id="B49"><title><p>Nelson Wea: Environmental genome shotgun sequencing of the Sargasso Sea</p></title><aug><au><snm>Venter</snm><fnm>J</fnm></au><au><snm>Remington</snm><fnm>K</fnm></au><au><snm>Heidelberg</snm><fnm>J</fnm></au><au><snm>Halpern</snm><fnm>A</fnm></au><au><snm>Rusch</snm><fnm>D</fnm></au><au><snm>Eisen</snm><fnm>J</fnm></au><au><snm>Wu</snm><fnm>D</fnm></au><au><snm>Paulsen</snm><fnm>I</fnm></au><au><snm>Nelson</snm><fnm>K</fnm></au></aug><source>Science</source><pubdate>2004</pubdate><volume>17</volume><fpage>377</fpage><lpage>386</lpage></bibl></refgrp>
   </bm>
</art>