<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2164-13-S7-S28</ui><ji>1471-2164</ji><fm>
<dochead>Proceedings</dochead>
<bibl>
<title>
<p>A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework</p>
</title>
<aug>
<au ce="yes" id="A1"><snm>Chang</snm><fnm>Yu-Jung</fnm><insr iid="I1"/><email>yjchang@iis.sinica.edu.tw</email></au>
<au ce="yes" id="A2"><snm>Chen</snm><fnm>Chien-Chih</fnm><insr iid="I1"/><insr iid="I2"/><email>rocky@iis.sinica.edu.tw</email></au>
<au id="A3"><snm>Chen</snm><fnm>Chuen-Liang</fnm><insr iid="I2"/><email>clchen@csie.ntu.edu.tw</email></au>
<au ca="yes" id="A4"><snm>Ho</snm><fnm>Jan-Ming</fnm><insr iid="I1"/><email>hoho@iis.sinica.edu.tw</email></au>
</aug>
<insg>
<ins id="I1"><p>Institute of Information Science, Academia Sinica, Taipei, Taiwan, ROC</p></ins>
<ins id="I2"><p>Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC</p></ins>
</insg>
<source>BMC Genomics</source>


<supplement><title><p>Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology</p></title><editor>Shoba Ranganathan, Christian Sch&#246;nbach, Sissades Tongsima, Jonathan Chan and Tin Wee Tan</editor><sponsor><note>The articles in this supplement were supported by funding agencies as detailed in the Acknowledgement section of each article</note></sponsor><note>Proceedings</note></supplement><conference><title><p>Asia Pacific Bioinformatics Network (APBioNet) Eleventh International Conference on Bioinformatics (InCoB2012)</p></title><location>Bangkok, Thailand</location><date-range>3-5 October 2012</date-range><url>http://www.incob2012.org/</url></conference><issn>1471-2164</issn>
<pubdate>2012</pubdate>
<volume>13</volume>
<issue>Suppl 7</issue>
<fpage>S28</fpage>
<url>http://www.biomedcentral.com/1471-2164/13/S28/S28</url>
<xrefbib><pubidlist><pubid idtype="pmpid">23282094</pubid><pubid idtype="doi">10.1186/1471-2164-13-S7-S28</pubid></pubidlist></xrefbib>
</bibl>
<history><pub><date><day>13</day><month>12</month><year>2012</year></date></pub></history>
<cpyrt><year>2012</year><collab>Chang et al.; licensee BioMed Central Ltd.</collab><note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for <it>de novo </it>assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<p>We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of &gt; 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at <url>https://github.com/ice91/CloudBrush</url>.</p>
</sec>
</sec>
</abs>
</fm><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>With the rapid growth of DNA sequencing throughput delivered by next-generation sequencing technologies <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>, there is a pressing need for <it>de novo </it>assemblers to efficiently handle massive sequencing data of genomes using scalable, on-demand, and inexpensive commodity cloud servers. <it>De novo </it>genome assembly is a fundamental step in analyzing a newly sequenced genome without a backbone sequence. <it>De novo </it>assembly software must deal with sequencing errors, repeat structures, and the computational complexity of processing large volumes of data <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp>. The most recent assemblers use de Bruijn graphs <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B8">8</abbr>
<abbr bid="B9">9</abbr>
<abbr bid="B10">10</abbr>
</abbrgrp> or string graphs <abbrgrp>
<abbr bid="B11">11</abbr>
<abbr bid="B12">12</abbr>
<abbr bid="B13">13</abbr>
<abbr bid="B14">14</abbr>
</abbrgrp> to model and manipulate the sequence reads. Using the de Bruijn graph model of sequence assembly requires breaking reads into short k-mers <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>. Typically, de Bruijn graph-based assemblers must recover the information lost from the breaking of reads, and attempt to resolve small repeats using read threading algorithms <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp>. Using the string graph model of assembly can help avoid this issue. However, with the deeper coverage depth of read data, our preliminary studies show that the underlying string graph used to model the intersection of reads becomes much more complex than expected by previous assembly algorithms <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>.</p>
<p>After building the assembly graphs, algorithms based on de Bruijn graphs or string graphs manipulate the graph-theoretic models by using several operations of graph simplification to repair erroneous reads and to remove redundancy in graphs, such as removing short dead-end tips and bubbles of similar paths <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp>. Erroneous reads and repeats may also result in more compounds with branch structures that complicate the assembly, especially as the sequencing depths of reads become greater and error rates increase. One example of the challenges faced is the chimerical links of edges, also known as chimerical connections <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>, formed by partial overlap of two unrelated contigs (Figure <figr fid="F1">1</figr>), where the partial overlaps are caused by sequencing errors. Other examples are ambiguous branching caused by short repeats and "braids" formed by shared branches (Figures <figr fid="F2">2</figr>, <figr fid="F3">3</figr>).</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>The chimerical link structure C-G in a string graph</p></caption><text>
   <p><b>The chimerical link structure C-G in a string graph</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-1"/></fig>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>The short-repeat branch D-L in a string graph</p></caption><text>
   <p><b>The short-repeat branch D-L in a string graph</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-2"/></fig>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>The braid structure in a string graph</p></caption><text>
   <p><b>The braid structure in a string graph</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-3"/></fig>
<p>To unfold these complex branch patterns into correct linear paths in string graphs, we present an Edge Adjustment (EA) algorithm to remedy this problem. The algorithm utilizes the sequence information of all graph neighbors for each read and eliminates the edges connecting to reads containing rare bases. We also used simulated read datasets of <it>Escherichia coli </it>genomes of varying sequencing depths and error rates to verify the effectiveness of the EA algorithm. In addition, we integrated the EA algorithm into a distributed assembly program <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp> based on string graphs and MapReduce cloud computing framework <abbrgrp>
<abbr bid="B16">16</abbr>
<abbr bid="B17">17</abbr>
</abbrgrp>, known as CloudBrush. We evaluated the method against the GAGE benchmarks established by Salzberg et al <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp> to compare assembly quality with other <it>de novo </it>assembly tools. Moreover, we introduced a pair of novel indices to measure the quality of sequence assembly, known as precision and recall, to indicate whether the output contigs are faithfully aligned (i.e., without inversions or rearrangements) with a contiguous region in the target genome, and whether the output contigs fully cover the entire target genome. It is noteworthy that these two indices are important for follow-up annotation and analysis of the target genome. Finally, we ran CloudBrush on a nematode dataset using a computing cloud <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp> and analyzed its performance.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<sec>
<st>
<p>Structural defects in string graphs</p>
</st>
<p>In using the graph-based assembly approach, sequencing error may generate complex structures in the graph. For example, sequencing errors at the end of reads may create tips in the graph, and sequencing errors within long reads may create bubbles in the graph. Tips and bubbles are well-defined problems with a solution making use of the topological features of the graph as described in <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp> and <abbrgrp>
<abbr bid="B10">10</abbr>
</abbrgrp>. Some errors, however, create more complex structures that cannot be readily identified from the topology of the graph. In this report, we refer to these structures are "structural defects." A well-known structural defect is the chimerical link problem. Figure <figr fid="F1">1</figr> displays an example of chimerical links caused by sequencing error in string graph. In this instance, the chimerical link is caused by false overlap between node C and node G. In addition to sequencing errors, repeat regions also cause structural defects in a string graph; for example, the well-known "frayed rope" pattern <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp>. Furthermore, repeats shorter than the read lengths may also complicate processing in string graphs; for example, if a short repeat exists in reads D, E, F, I, J, L, and M, where C, D, E, and F are reads from a specific region in the genome, while I, J, L, and M are reads from another region in the same genome (Figure <figr fid="F2">2</figr>). It is noteworthy that in the string graph, the edge between nodes D and L is denoted as a "branch structure" which may lead an assembly algorithm to report an erroneous contig. In addition to false overlaps, missing overlaps also introduce structural defects into the string graph; for example, the formation of a braid structure caused by sequencing errors appearing in continuous reads (Figure <figr fid="F3">3</figr>). In this instance, two missing overlaps forbid the adjacent reads from being merged together; node B lost an overlap link to node C, and node D lost an overlap link to node E (Figure <figr fid="F3">3</figr>). Similar to the chimerical link problem, it is challenging to use topological features of the graph to deal with braid structures.</p>
</sec>
<sec>
<st>
<p>Edge Adjustment with the neighbors' contents</p>
</st>
<p>We present the Edge Adjustment (EA) algorithm to fix structural defects in string graphs. For a node <it>n </it>in the string graph <it>G</it>, the EA algorithm adjusts edges of <it>n </it>by examining neighbors of <it>n </it>to decide whether each neighbor has sequencing errors or not. Figure <figr fid="F4">4</figr> shows the pseudo code of the Edge Adjustment algorithm in sequential version. Note that we are dealing with NGS reads with the same length. Thus neighbors of <it>n </it>may be divided into two groups, i.e., forward neighbors and reverse neighbors. A forward neighbor of <it>n </it>overlaps with the suffix of <it>n</it>; while a reverse neighbor of <it>n </it>overlaps with the prefix of <it>n</it>. To construct node <it>n</it>'s Position Weight Matrix (PWM) of its neighbors in one of the two directions, we first align the reads of the neighbors to <it>n</it>. Then, we use the subsequences of each read ranging from the end of node <it>n </it>to the end of the second-last neighbor to define PWM of <it>n</it>. A consensus sequence of neighbors can be obtained by computing the PWM of the neighbors. PWM has four rows corresponding to A, T, C and G respectively. An element of PWM in column <it>i </it>is the number of occurrences of <it>&#946; </it>at position <it>i</it>, where <it>&#946;</it>&#8712;{A, T, C, G}. We may then define the consensus sequence of these subsequences as follows:</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>The pseudo code of the EA algorithm in sequential version</p></caption><text>
   <p><b>The pseudo code of the EA algorithm in sequential version</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-4"/></fig>
<p>
<display-formula id="M1">
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2164-13-S7-S28-i1"><m:mrow>
   <m:mi>C</m:mi>
   <m:mi>o</m:mi>
   <m:mi>n</m:mi>
   <m:mi>s</m:mi>
   <m:mi>e</m:mi>
   <m:mi>n</m:mi>
   <m:mi>s</m:mi>
   <m:mi>u</m:mi>
   <m:msub>
      <m:mrow>
         <m:mi>s</m:mi>
      </m:mrow>
      <m:mrow>
         <m:mi>i</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfenced close="}" open="{" separators="">
      <m:mrow>
         <m:mtable class="array" columnlines="none none none none none none none none none none none none none none none none none none none" equalcolumns="false" equalrows="false">
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#946;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">|</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#946;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-bin">/</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>S</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">&gt;</m:mo>
                  <m:mn>0</m:mn>
                  <m:mi>.</m:mi>
                  <m:mn>6</m:mn>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center">
                  <m:mi>&#8242;</m:mi>
                  <m:mi>N</m:mi>
                  <m:mi>&#8242;</m:mi>
                  <m:mo class="MathClass-rel">|</m:mo>
                  <m:mo class="MathClass-op">&#8704;</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>&#946;</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-bin">/</m:mo>
                  <m:msub>
                     <m:mrow>
                        <m:mi>S</m:mi>
                     </m:mrow>
                     <m:mrow>
                        <m:mi>i</m:mi>
                     </m:mrow>
                  </m:msub>
                  <m:mo class="MathClass-rel">&#8804;</m:mo>
                  <m:mn>0</m:mn>
                  <m:mi>.</m:mi>
                  <m:mn>6</m:mn>
                  <m:mi>&#160;</m:mi>
                  <m:mi>&#160;</m:mi>
               </m:mtd>
            </m:mtr>
            <m:mtr>
               <m:mtd class="array" columnalign="center"/>
            </m:mtr>
         </m:mtable>
      </m:mrow>
   </m:mfenced>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <it>i </it>represents the position in the consensus sequence corresponding to the column position in PWM; <it>&#946;</it>&#8712; {A, T, C, G}; <it>&#946;<sub>i </sub>
</it>is the number of occurrences of <it>&#946; </it>at position <it>i</it>; and <it>S<sub>i </sub>
</it>is the sum of occurrence of letters at position <it>i</it>. We use the letter 'N' at position <it>i </it>of the consensus sequence, if for every letter in {A, T, C, G}, we have <it>&#946;<sub>i </sub>/S<sub>i </sub>
</it>&#8804; 0.6. Note that if the percentage of 'N' in the consensus sequence is greater than 10%, then this consensus sequence is rejected by the EA algorithm and all neighbors in the specific direction are retained. Otherwise, the consensus sequence is used to detect sequencing errors in each neighbors <it>n' </it>of <it>n </it>by comparing the subsequence of <it>n' </it>with the consensus sequence. The edge <it>(n, n') </it>is removed if the subsequence of <it>n' </it>is found inconsistent with the consensus sequence. In our experiment, the subsequence of <it>n' </it>is said to be <it>consistent </it>with the consensus sequence if every character of the subsequence is equal to the character, except character 'N', on the consensus sequence at the same position. Note that, for each node of the string graph, the EA algorithm generates a consensus sequence for each direction to perform the consistency check and to remove edges which are inconsistent with the consensus sequence. In an illustration of an EA algorithm, read 1 has three neighboring reads: 2, 3, and 4 (Figure <figr fid="F5">5</figr>). The range of the PWM exists from the end of read 1 to the end of read 3. Since read 2 has a character 'A' which is different from the first character 'T' of the consensus sequence (Figure <figr fid="F5">5</figr>), the edge between read1 and read2 will be removed. Next, we use the following examples to illustrate the reduction of structural defects in a string graph by using the EA algorithm.</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>The illustration of position weight matrix</p></caption><text>
   <p><b>The illustration of position weight matrix</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-5"/></fig>
<p>One example each of a chimerical link problem, a branch structure problem, and a braid problem that were solved with the EA algorithm are displayed (Figures <figr fid="F6">6</figr>, <figr fid="F7">7</figr>, <figr fid="F8">8</figr>). To solve the chimerical link problem, the EA algorithm generates a consensus sequence for read A (shown in red) from the neighboring reads B, C, and D (Figure <figr fid="F6">6</figr>). Since read C has one character that is different from the consensus sequence, the overlap link between reads A and C will be removed. By contrast, the EA algorithm generates a consensus sequence for read G (shown in green) from the neighboring reads C, E, and F (Figure <figr fid="F6">6</figr>). Thus, the overlap link between reads C and G will be removed in a similar manner.</p>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>The example of the chimerical link structure was solved by using Edge Adjustment</p></caption><text>
   <p><b>The example of the chimerical link structure was solved by using Edge Adjustment</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-6"/></fig>
<fig id="F7"><title><p>Figure 7</p></title><caption><p>The example of the branch structure was solved by using Edge Adjustment</p></caption><text>
   <p><b>The example of the branch structure was solved by using Edge Adjustment</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-7"/></fig>
<fig id="F8"><title><p>Figure 8</p></title><caption><p>The example of braid structure was solved by using Edge Adjustment</p></caption><text>
   <p><b>The example of braid structure was solved by using Edge Adjustment</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-8"/></fig>
<p>To solve the branch structure problem, the EA algorithm generates a consensus sequence for read L from the neighboring reads D, I, and J (Figure <figr fid="F7">7</figr>). Therefore, read D differs from the consensus sequence, which is primarily represented by reads I and J. The overlap link between reads D and L is removed.</p>
<p>To solve the braid structure problem, in which instance the errors in reads C and D complicate the graph structure, the EA algorithm removes the overlap links between reads C and E and between reads B and D (Figure <figr fid="F8">8</figr>). Thus, reads C and D are isolated from the main graph, and no braid structure exists.</p>
</sec>
<sec>
<st>
<p>Analysis of edge adjustment</p>
</st>
<p>We prepared simulated datasets generated from the <it>E. coli </it>genome to evaluate effectiveness of the EA algorithm. In other words, the position of each read on the target genome, and thus positions of sequencing errors on the read are also present in the dataset. We subsequently construct the overlap graph of the dataset by creating a node to present each read, and an edge between each pair of reads if they have a sequence overlap with size no smaller than an integer <it>k</it>. Two attributes are associated with each edge of the overlap graph from the simulated data. In the first attribute, if the positions of the two reads overlap with each other on the genome, then the overlapping region is designated as a <it>true </it>edge; otherwise, it is designated as a <it>false </it>edge. The second attribute is used to denote whether any sequencing error exists on the two reads of the edge. Therefore, we can now classify edges of the overlap graph into four classes according to these two attributes. Class I denotes the subset of <it>true </it>edges without sequencing errors; class II denotes the subset of true edges with sequencing errors; class III denotes the subset of <it>false </it>edges with sequencing errors; and class IV denotes the subset of <it>false </it>edges without sequencing errors. It is noteworthy that class I edges are most desired to improve the quality of data for subsequent stages of sequence assembly. By contrast, class III edges are chimerical edges; class II edges contain sequencing errors; and class IV edges contain reads that intersect repeats. Edges of classes II, III, and IV may introduce errors or structural defects into the later stages of sequence assembly. Therefore, it is the design goal of the EA algorithm to minimize the number of class II, III, and IV edges and to maximize the number of class I edges.</p>
<p>To test the effectiveness of the EA algorithm, we generated four sets of simulated data. In the first and second sets, 36-bp reads were generated at a constant coverage depth of 100&#215;, and single base errors were inserted at rates of 0.5% and 1%, respectively. In the third and fourth sets, 150-bp reads were generated at a constant coverage depth of 200&#215;, and single base errors were inserted at rates of 0.5% and 1%, respectively. Table <tblr tid="T1">1</tblr> shows the number of edges of the overlap graphs before and after performing the EA algorithm. We observed that most of the edges removed by EA algorithm were class II edges (i.e., possessing sequencing errors). We also observed that the EA algorithm was quite effective in removing class III (chimerical) edges for the two 150-bp datasets, and satisfactory in removing the class III edges for the two 36-bp datasets. By contrast, only about 20% of the class IV edges (i.e., those containing reads that intersect repeats) are removed by the EA algorithm.</p>
<tbl id="T1"><title><p>Table 1</p></title><caption><p>The edge analysis of overlap graph before and after Edge Adjustment</p></caption><tblbdy cols="5">
      <r>
         <c ca="center">
            <p>
               <b>Simulated</b>
            </p>
            <p>
               <b><it>E. coli </it>Dataset</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Edge Type</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of edges before</b>
            </p>
            <p>
               <b>Edge Adjustment</b>
            </p>
         </c>
         <c ca="center" cspan="2">
            <p>
               <b># of edges after</b>
            </p>
            <p>
               <b>Edge Adjustment</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>100 &#215; 36 bp</p>
            <p>0.5% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p>Class I</p>
         </c>
         <c ca="center">
            <p>92829732</p>
         </c>
         <c ca="center">
            <p>92754696</p>
         </c>
         <c ca="center">
            <p>[99.92%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class II</p>
         </c>
         <c ca="center">
            <p>14519426</p>
         </c>
         <c ca="center">
            <p>322510</p>
         </c>
         <c ca="center">
            <p>[2.22%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class III</p>
         </c>
         <c ca="center">
            <p>252762</p>
         </c>
         <c ca="center">
            <p>118542</p>
         </c>
         <c ca="center">
            <p>[46.90%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class IV</p>
         </c>
         <c ca="center">
            <p>377856</p>
         </c>
         <c ca="center">
            <p>294110</p>
         </c>
         <c ca="center">
            <p>[77.84%]</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>100 &#215; 36 bp</p>
            <p>1% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p>Class I</p>
         </c>
         <c ca="center">
            <p>76439532</p>
         </c>
         <c ca="center">
            <p>76364264</p>
         </c>
         <c ca="center">
            <p>[99.90%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class II</p>
         </c>
         <c ca="center">
            <p>24836446</p>
         </c>
         <c ca="center">
            <p>749900</p>
         </c>
         <c ca="center">
            <p>[3.02%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class III</p>
         </c>
         <c ca="center">
            <p>358432</p>
         </c>
         <c ca="center">
            <p>76162</p>
         </c>
         <c ca="center">
            <p>[21.25%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class IV</p>
         </c>
         <c ca="center">
            <p>132412</p>
         </c>
         <c ca="center">
            <p>92834</p>
         </c>
         <c ca="center">
            <p>[70.11%]</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>200 &#215; 150 bp</p>
            <p>0.5% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p>Class I</p>
         </c>
         <c ca="center">
            <p>115230002</p>
         </c>
         <c ca="center">
            <p>115163888</p>
         </c>
         <c ca="center">
            <p>[99.94%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class II</p>
         </c>
         <c ca="center">
            <p>74214420</p>
         </c>
         <c ca="center">
            <p>438274</p>
         </c>
         <c ca="center">
            <p>[0.59%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class III</p>
         </c>
         <c ca="center">
            <p>1347100</p>
         </c>
         <c ca="center">
            <p>51988</p>
         </c>
         <c ca="center">
            <p>[3.86%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class IV</p>
         </c>
         <c ca="center">
            <p>403836</p>
         </c>
         <c ca="center">
            <p>322746</p>
         </c>
         <c ca="center">
            <p>[79.92%]</p>
         </c>
      </r>
      <r>
         <c cspan="5">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>200 &#215; 150 bp</p>
            <p>1% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p>Class I</p>
         </c>
         <c ca="center">
            <p>32604042</p>
         </c>
         <c ca="center">
            <p>32580388</p>
         </c>
         <c ca="center">
            <p>[99.93%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class II</p>
         </c>
         <c ca="center">
            <p>53758272</p>
         </c>
         <c ca="center">
            <p>554020</p>
         </c>
         <c ca="center">
            <p>[1.03%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class III</p>
         </c>
         <c ca="center">
            <p>1422472</p>
         </c>
         <c ca="center">
            <p>57494</p>
         </c>
         <c ca="center">
            <p>[4.04%]</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Class IV</p>
         </c>
         <c ca="center">
            <p>256952</p>
         </c>
         <c ca="center">
            <p>225124</p>
         </c>
         <c ca="center">
            <p>[87.61%]</p>
         </c>
      </r>
   </tblbdy></tbl>
<p>We define a braid index to provide an approximate measure of the number of braid structures in a set <it>S </it>of reads. To acquire the braid index, we first constructed the overlap graph <it>G<sup>o</sup>(S) </it>of <it>S</it>. We next constructed a simplified string graph <it>G<sup>s</sup>(S) </it>of S which is obtained from <it>G<sup>o</sup>(S) </it>by removing contained reads, transitive edges, and concatenating, "one-in one-out" nodes. For each node <it>v </it>of <it>G<sup>o</sup>(S)</it>, we next examined its neighborhood for a pair of vertices, <it>u<sub>1 </sub>
</it>and <it>u<sub>2</sub>
</it>, and an additional vertex <it>v'</it>, such that the following properties exist: (1) (<it>u<sub>1</sub>, u<sub>2</sub>
</it>) is not an edge of <it>G<sup>o</sup>(S)</it>; (2) <it>u<sub>1 </sub>
</it>and <it>u<sub>2 </sub>
</it>form a consensus when both are aligned to <it>v</it>; (3) <it>(v, v') </it>is not an edge of <it>G<sup>o</sup>(S); </it>and (4) <it>v </it>and <it>v' </it>form a consensus when aligned to <it>u<sub>1 </sub>
</it>and <it>u<sub>2</sub>
</it>. The braid index is then defined as the number of tuples <it>(v, v', u<sub>1</sub>, u<sub>2</sub>) </it>satisfying the aforementioned four properties. Table <tblr tid="T2">2</tblr> shows the braid indices of the simplified string graphs of the four data sets with and without the performance of the EA algorithm. We observed that a dataset with a larger sequencing error has a larger braid index and may therefore possess more complicated braid structures. By contrast, the EA algorithm has also been shown to be effective in removing braid structures.</p>
<tbl id="T2"><title><p>Table 2</p></title><caption><p>The analysis of simplified string graph with and without Edge Adjustment</p></caption><tblbdy cols="4">
      <r>
         <c ca="center">
            <p>
               <b>Simulated Data</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Graph feature</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>without</b>
            </p>
            <p>
               <b>Edge Adjustment</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>with</b>
            </p>
            <p>
               <b>Edge Adjustment</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="4">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>100 &#215; 36 bp</p>
            <p>0.5% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p># of node</p>
         </c>
         <c ca="center">
            <p>2502312</p>
         </c>
         <c ca="center">
            <p>1572470</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p># of edge</p>
         </c>
         <c ca="center">
            <p>2220162</p>
         </c>
         <c ca="center">
            <p>26079</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>braid index</p>
         </c>
         <c ca="center">
            <p>342736</p>
         </c>
         <c ca="center">
            <p>750</p>
         </c>
      </r>
      <r>
         <c cspan="4">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>100 &#215; 36 bp</p>
            <p>1% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p># of node</p>
         </c>
         <c ca="center">
            <p>4418943</p>
         </c>
         <c ca="center">
            <p>2964253</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p># of edge</p>
         </c>
         <c ca="center">
            <p>4051264</p>
         </c>
         <c ca="center">
            <p>46649</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>braid index</p>
         </c>
         <c ca="center">
            <p>873835</p>
         </c>
         <c ca="center">
            <p>802</p>
         </c>
      </r>
      <r>
         <c cspan="4">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>200 &#215; 150 bp</p>
            <p>0.5% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p># of node</p>
         </c>
         <c ca="center">
            <p>3839687</p>
         </c>
         <c ca="center">
            <p>2680727</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p># of edge</p>
         </c>
         <c ca="center">
            <p>6618017</p>
         </c>
         <c ca="center">
            <p>7739</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>braid index</p>
         </c>
         <c ca="center">
            <p>1750824</p>
         </c>
         <c ca="center">
            <p>242</p>
         </c>
      </r>
      <r>
         <c cspan="4">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>200 &#215; 150 bp</p>
            <p>1% error</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p># of node</p>
         </c>
         <c ca="center">
            <p>5085964</p>
         </c>
         <c ca="center">
            <p>4245557</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p># of edge</p>
         </c>
         <c ca="center">
            <p>8501560</p>
         </c>
         <c ca="center">
            <p>16767</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>braid index</p>
         </c>
         <c ca="center">
            <p>2350695</p>
         </c>
         <c ca="center">
            <p>413</p>
         </c>
      </r>
   </tblbdy></tbl>
</sec>
<sec>
<st>
<p>Evaluation of assembly accuracy</p>
</st>
<p>Hypothetically, a perfect assembly result produces nothing but subsquences of the reference sequences. In particular, rearrangements do not exist in any contigs. To distinguish superior assembly results from those containing collapsed repetitive regions or rearrangements, we designed a strict measurement scheme known as precision and recall. The precision and recall focus on the quality of the contigs. A contig must be aligned along its whole length with a base similarity of at least 95% in order to be considered valid. The union of all the valid contig areas in the references was treated as a true positive, and the recall was defined using the following formula:</p>
<p>
<display-formula id="M2">
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2164-13-S7-S28-i2"><m:mrow>
   <m:mstyle class="text">
      <m:mtext class="textsf" mathvariant="sans-serif">Recall</m:mtext>
   </m:mstyle>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">number&#160;of&#160;true&#160;positive&#160;bases&#160;in&#160;reference&#160;</m:mtext>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">total&#160;length&#160;of&#160;reference&#160;sequence</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Similarly, the union of all the valid contigs areas on the side of contigs was treated as a true positive in contigs, and the precision was defined using the following formula:</p>
<p>
<display-formula id="M3">
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1471-2164-13-S7-S28-i3"><m:mrow>
   <m:mstyle class="text">
      <m:mtext class="textsf" mathvariant="sans-serif">Precision</m:mtext>
   </m:mstyle>
   <m:mo class="MathClass-rel">=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">number&#160;of&#160;true&#160;positive&#160;bases&#160;in&#160;contigs</m:mtext>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle class="text">
            <m:mtext class="textsf" mathvariant="sans-serif">total&#160;length&#160;of&#160;contigs</m:mtext>
         </m:mstyle>
      </m:mrow>
   </m:mfrac>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Importantly, we only evaluate contigs whose length &#8805; 200 bp.</p>
<p>We used three real and two simulated datasets to test CloudBrush and the other assemblers. The first real dataset was a set of short read data from an <it>E. coli </it>library (NCBI Short Read Archive, accession no. SRX000429) consisting of 20.8 M 36-bp reads. The second real dataset was released by Illumina, which included 12 M paired-end 150-bp reads. This dataset contains sequences from a well-characterized <it>E. coli </it>strain K-12 MG1655 library sequenced on an Illumina MiSeq platform. For the two real datasets, we select the first half of reads to evaluate assemblers, and their coverage depth was 81&#215; and 197&#215;, respectively. We used D1 and D2 to denote the 36-bp and 150-bp datasets, respectively. Furthermore, we downloaded <it>Caenorhabditis elegans </it>sequence reads (strain N2) from the NCBI SRA (accession no. SRX026594) as the D3 dataset, consisting of 33.8 M read pairs sequenced using the Illumina Genome Analyzer II and a constant coverage depth of 67&#215;. The two simulated datasets were generated at random from the <it>E. coli </it>K-12 genome using 36-bp reads with 100&#215; coverage depth and 1% mismatch errors, and with 100-bp reads with 200&#215; coverage depth and 1% mismatch errors.</p>
<p>We performed assemblies on these datasets using Edena <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp>, Velvet <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>, Contrail <abbrgrp>
<abbr bid="B10">10</abbr>
</abbrgrp> and CloudBrush assemblers. Edena is the first string graph-based assembler for data of short reads. Velvet is one of the first de Bruijn graph-based assemblers for short reads that is often used as a standard tool for assembling small- to medium-sized genomes. Contrail is the first de Bruijn graph-based assembler using the MapReduce framework. Each assembler is required to set the parameter <it>k</it>, i.e., the minimum length of overlap for two contigs to form a longer contig. Considering the relationship between parameter <it>k </it>and coverage depth <abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp>, we used <it>k </it>= 21 on dataset D1 and 100&#215; simulated data, <it>k </it>= 75 on dataset D2 and 200&#215; simulated data, and k = 51 on dataset D3. Importantly, we did not use pair-end information in this experiment.</p>
<p>Figure <figr fid="F9">9</figr> shows the precision and recall of contigs with different length thresholds on the two simulated datasets of <it>E. coli </it>genome with a 1% error rate and datasets D1 and D2. We observed that CloudBrush outperforms the others for the two simulated datasets; the other assemblers generated more mis-assembly contigs when reads become longer from 36 bp to 150 bp (Figures <figr fid="F9">9a</figr> and <figr fid="F9">9b</figr>). For datasets D1 and D2, CloudBrush have similar performance of precision and recall leading the other assemblers (Figures <figr fid="F9">9c</figr> and <figr fid="F9">9d</figr>). Since longer reads and a larger error rate may generate more complex structure defects. CloudBrush may have a greater ability to handle complicated graph structures by using the EA algorithm.</p>
<fig id="F9"><title><p>Figure 9</p></title><caption><p>The variation of precision and recall with different lower bounds of length on simulated data and datasets D1 and D2</p></caption><text>
   <p><b>The variation of precision and recall with different lower bounds of length on simulated data and datasets D1 and D2</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-9"/></fig>
<p>We considered a number of different evaluation criteria, which are summarized in Tables <tblr tid="T3">3</tblr> and <tblr tid="T4">4</tblr>. It is noteworthy that CloudBrush and Contrail ran on a cluster with 150 nodes each having 2 core CPU and 4 GB of RAM; while Edena and Velvet ran on a single machine which has 16 core CPU and 128 GB of RAM. Besides, Edena failed to work on datasets D2 and D3 in longer read data; therefore, no results were generated. Furthermore, we computed precision and recall by parsing the result of MegaBLAST <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp>.</p>
<tbl id="T3"><title><p>Table 3</p></title><caption><p>Evaluation of assemblies of the simulated dataset (100&#215;, 36 bp, 1% error) and dataset D1 with CloudBrush, Contrail, Velvet, and Edena</p></caption><tblbdy cols="10">
      <r>
         <c ca="center">
            <p>
               <b>Dataset</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Assembler</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of contigs<sup>1</sup></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N50</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Largest contig size</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Precision</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Recall</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of valid</b>
            </p>
            <p>
               <b>contigs<sup>1</sup></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of invalid contigs<sup>1</sup></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Runtime</b>
            </p>
            <p>
               <b>(sec)</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>100 &#215; 36 bp</p>
            <p>1% error</p>
         </c>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>447</p>
         </c>
         <c ca="center">
            <p>17907</p>
         </c>
         <c ca="center">
            <p>95387</p>
         </c>
         <c ca="center">
            <p>99.79%</p>
         </c>
         <c ca="center">
            <p>97.51%</p>
         </c>
         <c ca="center">
            <p>420</p>
         </c>
         <c ca="center">
            <p>27</p>
         </c>
         <c ca="center">
            <p>6218</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Contrail</p>
         </c>
         <c ca="center">
            <p>906</p>
         </c>
         <c ca="center">
            <p>8982</p>
         </c>
         <c ca="center">
            <p>40066</p>
         </c>
         <c ca="center">
            <p>99.72%</p>
         </c>
         <c ca="center">
            <p>96.76%</p>
         </c>
         <c ca="center">
            <p>858</p>
         </c>
         <c ca="center">
            <p>48</p>
         </c>
         <c ca="center">
            <p>5499</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>507</p>
         </c>
         <c ca="center">
            <p>15632</p>
         </c>
         <c ca="center">
            <p>100501</p>
         </c>
         <c ca="center">
            <p>99.68%</p>
         </c>
         <c ca="center">
            <p>96.95%</p>
         </c>
         <c ca="center">
            <p>498</p>
         </c>
         <c ca="center">
            <p>9</p>
         </c>
         <c ca="center">
            <p>590</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Edena</p>
         </c>
         <c ca="center">
            <p>4012</p>
         </c>
         <c ca="center">
            <p>1436</p>
         </c>
         <c ca="center">
            <p>11264</p>
         </c>
         <c ca="center">
            <p>98.84%</p>
         </c>
         <c ca="center">
            <p>91.85%</p>
         </c>
         <c ca="center">
            <p>3868</p>
         </c>
         <c ca="center">
            <p>144</p>
         </c>
         <c ca="center">
            <p>2524</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>D1 dataset</p>
         </c>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>521</p>
         </c>
         <c ca="center">
            <p>15149</p>
         </c>
         <c ca="center">
            <p>66832</p>
         </c>
         <c ca="center">
            <p>99.26%</p>
         </c>
         <c ca="center">
            <p>97.10%</p>
         </c>
         <c ca="center">
            <p>481</p>
         </c>
         <c ca="center">
            <p>40</p>
         </c>
         <c ca="center">
            <p>5555</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Contrail</p>
         </c>
         <c ca="center">
            <p>930</p>
         </c>
         <c ca="center">
            <p>8605</p>
         </c>
         <c ca="center">
            <p>40066</p>
         </c>
         <c ca="center">
            <p>99.73%</p>
         </c>
         <c ca="center">
            <p>96.81%</p>
         </c>
         <c ca="center">
            <p>886</p>
         </c>
         <c ca="center">
            <p>44</p>
         </c>
         <c ca="center">
            <p>4789</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>505</p>
         </c>
         <c ca="center">
            <p>15862</p>
         </c>
         <c ca="center">
            <p>73042</p>
         </c>
         <c ca="center">
            <p>99.62%</p>
         </c>
         <c ca="center">
            <p>96.90%</p>
         </c>
         <c ca="center">
            <p>494</p>
         </c>
         <c ca="center">
            <p>11</p>
         </c>
         <c ca="center">
            <p>452</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Edena</p>
         </c>
         <c ca="center">
            <p>889</p>
         </c>
         <c ca="center">
            <p>9045</p>
         </c>
         <c ca="center">
            <p>44942</p>
         </c>
         <c ca="center">
            <p>99.18%</p>
         </c>
         <c ca="center">
            <p>96.34%</p>
         </c>
         <c ca="center">
            <p>823</p>
         </c>
         <c ca="center">
            <p>66</p>
         </c>
         <c ca="center">
            <p>1401</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p><sup>1 </sup>Contigs with lengths &gt; 200 bp are counted.</p>
   </tblfn></tbl>
<tbl id="T4"><title><p>Table 4</p></title><caption><p>Evaluation of assemblies of the simulated dataset (200 &#215; 150 bp, 1% error) and dataset D2 and D3 with CloudBrush, Contrail, and Velvet</p></caption><tblbdy cols="10">
      <r>
         <c ca="center">
            <p>
               <b>Dataset</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Assembler</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of contigs<sup>1</sup></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N50</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Largest</b>
            </p>
            <p>
               <b>contig size</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Prec</b>
            </p>
            <p>
               <b>-ision</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Recall</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of valid</b>
            </p>
            <p>
               <b>contigs<sup>1</sup></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of invalid</b>
            </p>
            <p>
               <b>contigs<sup>1</sup></b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Runtime</b>
            </p>
            <p>
               <b>(sec)</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>200 &#215; 150 bp</p>
            <p>1% error</p>
         </c>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>229</p>
         </c>
         <c ca="center">
            <p>112531</p>
         </c>
         <c ca="center">
            <p>327245</p>
         </c>
         <c ca="center">
            <p>99.20%</p>
         </c>
         <c ca="center">
            <p>96.00%</p>
         </c>
         <c ca="center">
            <p>152</p>
         </c>
         <c ca="center">
            <p>77</p>
         </c>
         <c ca="center">
            <p>10616</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Contrail</p>
         </c>
         <c ca="center">
            <p>2540</p>
         </c>
         <c ca="center">
            <p>7554</p>
         </c>
         <c ca="center">
            <p>36335</p>
         </c>
         <c ca="center">
            <p>90.12%</p>
         </c>
         <c ca="center">
            <p>95.92%</p>
         </c>
         <c ca="center">
            <p>957</p>
         </c>
         <c ca="center">
            <p>1583</p>
         </c>
         <c ca="center">
            <p>15823</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>209</p>
         </c>
         <c ca="center">
            <p>78642</p>
         </c>
         <c ca="center">
            <p>327101</p>
         </c>
         <c ca="center">
            <p>99.63%</p>
         </c>
         <c ca="center">
            <p>98.10%</p>
         </c>
         <c ca="center">
            <p>168</p>
         </c>
         <c ca="center">
            <p>41</p>
         </c>
         <c ca="center">
            <p>1317</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>D2</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>361</p>
         </c>
         <c ca="center">
            <p>52961</p>
         </c>
         <c ca="center">
            <p>156592</p>
         </c>
         <c ca="center">
            <p>98.10%</p>
         </c>
         <c ca="center">
            <p>98.15%</p>
         </c>
         <c ca="center">
            <p>230</p>
         </c>
         <c ca="center">
            <p>131</p>
         </c>
         <c ca="center">
            <p>8622</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Contrail</p>
         </c>
         <c ca="center">
            <p>300</p>
         </c>
         <c ca="center">
            <p>43609</p>
         </c>
         <c ca="center">
            <p>124089</p>
         </c>
         <c ca="center">
            <p>98.47%</p>
         </c>
         <c ca="center">
            <p>96.98%</p>
         </c>
         <c ca="center">
            <p>250</p>
         </c>
         <c ca="center">
            <p>50</p>
         </c>
         <c ca="center">
            <p>7200</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>189</p>
         </c>
         <c ca="center">
            <p>71764</p>
         </c>
         <c ca="center">
            <p>174184</p>
         </c>
         <c ca="center">
            <p>93.60%</p>
         </c>
         <c ca="center">
            <p>92.20%</p>
         </c>
         <c ca="center">
            <p>164</p>
         </c>
         <c ca="center">
            <p>25</p>
         </c>
         <c ca="center">
            <p>927</p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>D3</p>
            <p>dataset</p>
         </c>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>37064</p>
         </c>
         <c ca="center">
            <p>8880</p>
         </c>
         <c ca="center">
            <p>114585</p>
         </c>
         <c ca="center">
            <p>93.65%</p>
         </c>
         <c ca="center">
            <p>92.41%</p>
         </c>
         <c ca="center">
            <p>24603</p>
         </c>
         <c ca="center">
            <p>10387</p>
         </c>
         <c ca="center">
            <p>48603</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Contrail</p>
         </c>
         <c ca="center">
            <p>31870</p>
         </c>
         <c ca="center">
            <p>8274</p>
         </c>
         <c ca="center">
            <p>105244</p>
         </c>
         <c ca="center">
            <p>96.99%</p>
         </c>
         <c ca="center">
            <p>90.89%</p>
         </c>
         <c ca="center">
            <p>25236</p>
         </c>
         <c ca="center">
            <p>6116</p>
         </c>
         <c ca="center">
            <p>44619</p>
         </c>
      </r>
      <r>
         <c>
            <p/>
         </c>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>23565</p>
         </c>
         <c ca="center">
            <p>10847</p>
         </c>
         <c ca="center">
            <p>106863</p>
         </c>
         <c ca="center">
            <p>95.55%</p>
         </c>
         <c ca="center">
            <p>89.01%</p>
         </c>
         <c ca="center">
            <p>20187</p>
         </c>
         <c ca="center">
            <p>2838</p>
         </c>
         <c ca="center">
            <p>13963</p>
         </c>
      </r>
   </tblbdy><tblfn>
      <p><sup>1 </sup>Contigs with lengths &gt; 200 bp are counted.</p>
   </tblfn></tbl>
</sec>
<sec>
<st>
<p>Comparison with other tools using GAGE benchmarks</p>
</st>
<p>To provide a comprehensive comparison, we used the benchmarks of GAGE <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp> to evaluate CloudBrush and compared it with eight assemblers that were evaluated in GAGE benchmarks. Since GAGE provides the assembly results for each assembler, we used the precision and recall to evaluate each assembler to complement the evaluation of GAGE. Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr> summarize the validation results for the two genomes <it>Staphylococcus aureus </it>and <it>Rhodobacter sphaeroides</it>.</p>
<tbl id="T5"><title><p>Table 5</p></title><caption><p>Evaluation of <it>S aureus </it>(genome size 2,872,915 bp)</p></caption><tblbdy cols="10">
      <r>
         <c ca="center">
            <p>
               <b>Assembler</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Num</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N50 (kb)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N50</b>
            </p>
            <p>
               <b>corr. (kb)</b>
            </p>
         </c>
         <c ca="left">
            <p>
               <b>Indel</b>
            </p>
            <p>
               <b>&gt; 5 bp</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Misjoins</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Precision</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Recall</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of valid</b>
            </p>
            <p>
               <b>contigs (&gt; 200 bp)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of invalid</b>
            </p>
            <p>
               <b>contigs</b>
            </p>
            <p>
               <b>(&gt; 200 bp)</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ABySS</p>
         </c>
         <c ca="center">
            <p>302</p>
         </c>
         <c ca="center">
            <p>29.2</p>
         </c>
         <c ca="center">
            <p>24.8</p>
         </c>
         <c ca="left">
            <p>9</p>
         </c>
         <c ca="center">
            <p>5</p>
         </c>
         <c ca="center">
            <p>75.06%</p>
         </c>
         <c ca="center">
            <p>94.31%</p>
         </c>
         <c ca="center">
            <p>219</p>
         </c>
         <c ca="center">
            <p>83</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ALLPATHS-LG</p>
         </c>
         <c ca="center">
            <p>60</p>
         </c>
         <c ca="center">
            <p>96.7</p>
         </c>
         <c ca="center">
            <p>66.2</p>
         </c>
         <c ca="left">
            <p>12</p>
         </c>
         <c ca="center">
            <p>4</p>
         </c>
         <c ca="center">
            <p>93.35%</p>
         </c>
         <c ca="center">
            <p>92.28%</p>
         </c>
         <c ca="center">
            <p>55</p>
         </c>
         <c ca="center">
            <p>5</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Bambus2</p>
         </c>
         <c ca="center">
            <p>109</p>
         </c>
         <c ca="center">
            <p>50.2</p>
         </c>
         <c ca="center">
            <p>16.7</p>
         </c>
         <c ca="left">
            <p>164</p>
         </c>
         <c ca="center">
            <p>13</p>
         </c>
         <c ca="center">
            <p>63.20%</p>
         </c>
         <c ca="center">
            <p>61.69%</p>
         </c>
         <c ca="center">
            <p>90</p>
         </c>
         <c ca="center">
            <p>19</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>MSR-CA</p>
         </c>
         <c ca="center">
            <p>94</p>
         </c>
         <c ca="center">
            <p>59.2</p>
         </c>
         <c ca="center">
            <p>48.2</p>
         </c>
         <c ca="left">
            <p>10</p>
         </c>
         <c ca="center">
            <p>12</p>
         </c>
         <c ca="center">
            <p>90.14%</p>
         </c>
         <c ca="center">
            <p>88.96%</p>
         </c>
         <c ca="center">
            <p>79</p>
         </c>
         <c ca="center">
            <p>15</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>SGA</p>
         </c>
         <c ca="center">
            <p>1252</p>
         </c>
         <c ca="center">
            <p>4</p>
         </c>
         <c ca="center">
            <p>4</p>
         </c>
         <c ca="left">
            <p>2</p>
         </c>
         <c ca="center">
            <p>4</p>
         </c>
         <c ca="center">
            <p>97.95%</p>
         </c>
         <c ca="center">
            <p>95.61%</p>
         </c>
         <c ca="center">
            <p>1134</p>
         </c>
         <c ca="center">
            <p>118</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>SOAPdenovo</p>
         </c>
         <c ca="center">
            <p>107</p>
         </c>
         <c ca="center">
            <p>288.2</p>
         </c>
         <c ca="center">
            <p>62.7</p>
         </c>
         <c ca="left">
            <p>31</p>
         </c>
         <c ca="center">
            <p>17</p>
         </c>
         <c ca="center">
            <p>60.22%</p>
         </c>
         <c ca="center">
            <p>60.35%</p>
         </c>
         <c ca="center">
            <p>59</p>
         </c>
         <c ca="center">
            <p>48</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>162</p>
         </c>
         <c ca="center">
            <p>48.4</p>
         </c>
         <c ca="center">
            <p>41.5</p>
         </c>
         <c ca="left">
            <p>14</p>
         </c>
         <c ca="center">
            <p>14</p>
         </c>
         <c ca="center">
            <p>82.66%</p>
         </c>
         <c ca="center">
            <p>81.08%</p>
         </c>
         <c ca="center">
            <p>136</p>
         </c>
         <c ca="center">
            <p>26</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>527</p>
         </c>
         <c ca="center">
            <p>9.7</p>
         </c>
         <c ca="center">
            <p>9.5</p>
         </c>
         <c ca="left">
            <p>2</p>
         </c>
         <c ca="center">
            <p>10</p>
         </c>
         <c ca="center">
            <p>96.72%</p>
         </c>
         <c ca="center">
            <p>96.00%</p>
         </c>
         <c ca="center">
            <p>447</p>
         </c>
         <c ca="center">
            <p>80</p>
         </c>
      </r>
   </tblbdy></tbl>
<tbl id="T6"><title><p>Table 6</p></title><caption><p>Evaluation of <it>R. sphaeroides </it>(genome size 4,603,060 bp)</p></caption><tblbdy cols="10">
      <r>
         <c ca="center">
            <p>
               <b>Assembler</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Num</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N50 (kb)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>N50</b>
            </p>
            <p>
               <b>corr. (kb)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Indel</b>
            </p>
            <p>
               <b>&gt; 5 bp</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Misjoins</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Precision</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b>Recall</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of valid</b>
            </p>
            <p>
               <b>contig</b>
            </p>
            <p>
               <b>(&gt; 200 bp)</b>
            </p>
         </c>
         <c ca="center">
            <p>
               <b># of invalid</b>
            </p>
            <p>
               <b>contig</b>
            </p>
            <p>
               <b>(&gt; 200 bp)</b>
            </p>
         </c>
      </r>
      <r>
         <c cspan="10">
            <hr/>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ABySS</p>
         </c>
         <c ca="center">
            <p>1915</p>
         </c>
         <c ca="center">
            <p>5.9</p>
         </c>
         <c ca="center">
            <p>4.2</p>
         </c>
         <c ca="center">
            <p>34</p>
         </c>
         <c ca="center">
            <p>21</p>
         </c>
         <c ca="center">
            <p>79.78%</p>
         </c>
         <c ca="center">
            <p>86.13%</p>
         </c>
         <c ca="center">
            <p>1744</p>
         </c>
         <c ca="center">
            <p>171</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>ALLPATHS-LG</p>
         </c>
         <c ca="center">
            <p>204</p>
         </c>
         <c ca="center">
            <p>42.5</p>
         </c>
         <c ca="center">
            <p>34.4</p>
         </c>
         <c ca="center">
            <p>37</p>
         </c>
         <c ca="center">
            <p>6</p>
         </c>
         <c ca="center">
            <p>81.49%</p>
         </c>
         <c ca="center">
            <p>81.22%</p>
         </c>
         <c ca="center">
            <p>183</p>
         </c>
         <c ca="center">
            <p>21</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Bambus2</p>
         </c>
         <c ca="center">
            <p>177</p>
         </c>
         <c ca="center">
            <p>93.2</p>
         </c>
         <c ca="center">
            <p>12.8</p>
         </c>
         <c ca="center">
            <p>363</p>
         </c>
         <c ca="center">
            <p>5</p>
         </c>
         <c ca="center">
            <p>48.65%</p>
         </c>
         <c ca="center">
            <p>46.21%</p>
         </c>
         <c ca="center">
            <p>129</p>
         </c>
         <c ca="center">
            <p>48</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>CABOG</p>
         </c>
         <c ca="center">
            <p>322</p>
         </c>
         <c ca="center">
            <p>20.2</p>
         </c>
         <c ca="center">
            <p>17.9</p>
         </c>
         <c ca="center">
            <p>24</p>
         </c>
         <c ca="center">
            <p>10</p>
         </c>
         <c ca="center">
            <p>92.55%</p>
         </c>
         <c ca="center">
            <p>85.21%</p>
         </c>
         <c ca="center">
            <p>310</p>
         </c>
         <c ca="center">
            <p>12</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>MSR-CA</p>
         </c>
         <c ca="center">
            <p>395</p>
         </c>
         <c ca="center">
            <p>22.1</p>
         </c>
         <c ca="center">
            <p>19.1</p>
         </c>
         <c ca="center">
            <p>32</p>
         </c>
         <c ca="center">
            <p>10</p>
         </c>
         <c ca="center">
            <p>93.35%</p>
         </c>
         <c ca="center">
            <p>90.55%</p>
         </c>
         <c ca="center">
            <p>363</p>
         </c>
         <c ca="center">
            <p>32</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>SGA</p>
         </c>
         <c ca="center">
            <p>3066</p>
         </c>
         <c ca="center">
            <p>4.5</p>
         </c>
         <c ca="center">
            <p>2.9</p>
         </c>
         <c ca="center">
            <p>4</p>
         </c>
         <c ca="center">
            <p>4</p>
         </c>
         <c ca="center">
            <p>97.23%</p>
         </c>
         <c ca="center">
            <p>94.56%</p>
         </c>
         <c ca="center">
            <p>2758</p>
         </c>
         <c ca="center">
            <p>308</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>SOAPdenovo</p>
         </c>
         <c ca="center">
            <p>204</p>
         </c>
         <c ca="center">
            <p>131.7</p>
         </c>
         <c ca="center">
            <p>14.3</p>
         </c>
         <c ca="center">
            <p>406</p>
         </c>
         <c ca="center">
            <p>8</p>
         </c>
         <c ca="center">
            <p>70.86%</p>
         </c>
         <c ca="center">
            <p>70.75%</p>
         </c>
         <c ca="center">
            <p>134</p>
         </c>
         <c ca="center">
            <p>70</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>Velvet</p>
         </c>
         <c ca="center">
            <p>583</p>
         </c>
         <c ca="center">
            <p>15.7</p>
         </c>
         <c ca="center">
            <p>14.5</p>
         </c>
         <c ca="center">
            <p>27</p>
         </c>
         <c ca="center">
            <p>8</p>
         </c>
         <c ca="center">
            <p>94.41%</p>
         </c>
         <c ca="center">
            <p>92.37%</p>
         </c>
         <c ca="center">
            <p>545</p>
         </c>
         <c ca="center">
            <p>38</p>
         </c>
      </r>
      <r>
         <c ca="center">
            <p>CloudBrush</p>
         </c>
         <c ca="center">
            <p>661</p>
         </c>
         <c ca="center">
            <p>12.8</p>
         </c>
         <c ca="center">
            <p>12.7</p>
         </c>
         <c ca="center">
            <p>10</p>
         </c>
         <c ca="center">
            <p>2</p>
         </c>
         <c ca="center">
            <p>96.21%</p>
         </c>
         <c ca="center">
            <p>95.85%</p>
         </c>
         <c ca="center">
            <p>567</p>
         </c>
         <c ca="center">
            <p>94</p>
         </c>
      </r>
   </tblbdy></tbl>
<p>As described in <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>, a more aggressive assembler is prone to generate more segmental indels as it strives to maximize the length of its contigs, while a conservative assembler minimizes errors at the expense of contig size. We observed that the SGA assemblies have the fewest errors of misjoins and indels of &gt; 5 bp, but have the shortest N50 (Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr>). CloudBrush generated the second fewest number of errors, but led to a longer N50, which identified CloudBrush as a conservative assembler that could still assemble longer contigs.</p>
<p>A caveat on the use of the assembly precision and recall for contigs is required. When misjoined errors occur in a very long contig, the whole contig will be invalidated, and the precision and recall will obviously decrease in proportion to the contig length. By contrast, when misjoined errors occur in a shorter contig, the precision and recall may only decrease slightly. We observed that SGA and CloudBrush produced the highest precisions and recalls (Tables <tblr tid="T5">5</tblr> and <tblr tid="T6">6</tblr>), indicating that the contigs generated will have very few artificial breakpoints generated by assemblers; moreover, it will reduce the noisy interrupts in the subsequent genome annotation and comparative genomic analysis. It is noteworthy that some assemblers e.g., Bambus2 <abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp> and SOAPdenovo <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp>, have lower precision and recall due to the fact that their misjoined errors and longer indels occur in longer contigs.</p>
</sec>
<sec>
<st>
<p>Run time analysis</p>
</st>
<p>To evaluate the performance of our approach, we performed CloudBrush analysis on three different sizes of Hadoop clusters using machines leased from the hicloud <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp>. The three clusters consisted of 20, 50, and 80 nodes, respectively. Each node had 2 virtual CPUs (each one is equivalent to 1 GHz2007 Xeon processor) and 4 GB of RAM. We used the dataset D3 of <it>C. elegans </it>as the benchmark to analyze the runtime of CloudBrush. CloudBrush is counted separately in two phases: Graph Construction and Graph Simplification. We observed that the Graph Construction is the primary bottleneck of CloudBrush with 20, 50, or 80 nodes (Figure <figr fid="F10">10</figr>). However, with an increase in the number of nodes, the computation time of Graph Construction decreases substantially, while the runtime of Graph Simplification decreases only slightly. Using 20 nodes as a baseline, when the number of nodes is increased 2.5-fold, the construction time is decreased 2.3-fold and the simplification time is decreased by 1.3-fold. When the number of nodes increases 4-fold, the reductions in runtime are 3.2- and 1.5-fold for the construction and simplification, respectively. The experiments show that Graph Construction tended to possess superior scalability in MapReduce.</p>
<fig id="F10"><title><p>Figure 10</p></title><caption><p>Runtime analysis of Dataset D3 (<it>C. elegans</it>) by CloudBrush</p></caption><text>
   <p><b>Runtime analysis of Dataset D3 (<it>C. elegans</it>) by CloudBrush</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-10"/></fig>
</sec>
</sec>
<sec>
<st>
<p>Discussion and conclusions</p>
</st>
<p>With the rapid growth of sequence data, genome assembly remains one of the most challenging computational problems in genomics. String graph-based approaches have the benefits of read coherence <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp>, less memory requirement, and successful experience in analyzing Sanger sequence data <abbrgrp>
<abbr bid="B23">23</abbr>
</abbrgrp>. In this report, we identify several types of structural defects in string graphs resulting from sequencing errors and short repeats. To remedy the structural defects in string graphs, we developed the EA algorithm that utilizes information from the consensus of graphical neighbors. To validate the effectiveness of the EA algorithm, we used simulated data to define four types of edges and a braid index to help evaluate the structural defects in string graphs. The experimental results show that the EA algorithm efficiently minimizes structural defects in string graphs. Thus far, the EA algorithm is not suitable for studies on SNPs, because it only removes the edges. We suggest that correcting the edges with sequence logos will maintain information for SNP analysis; this is the subject of a future study.</p>
<p>To demonstrate the validity of CloudBrush, we used GAGE benchmarks <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp> to compare CloudBrush with other state-of-the-art assembly tools. The evaluation results show that CloudBrush is a conservative assembler that nevertheless can generate precise contigs that avoid error propagation in downstream analysis with moderate N50 contig lengths. We also tested the scalability of CloudBrush using three different sizes of hadoop clusters to assemble ~7-Gbp data of the <it>C. elegans </it>dataset on a hicloud&#8482; computing service <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp>. The study results show that the stage of graph construction is the primary performance bottleneck and its scalability in the MapReduce framework is quite impressive.</p>
<p>In future studies, we will incorporate the scaffolding issue and mate-pair analysis into the MapReduce pipeline. Combining state-of-the-art error correction and our edge analysis is another subject worthy of investigation. We believe that CloudBrush will achieve a better contig N50 with fewer misjoin errors if these former two issues are resolved. Adapting the pipeline toward third generation sequencing technologies is also an important direction of investigation.</p>
</sec>
<sec>
<st>
<p>Methods</p>
</st>
<p>We previously described a string-graph base assembly algorithm using MapReduce called CloudBrush <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>. The framework of MapReduce can easily be implemented as a modular pipeline, allowing it to be easily extended when improved algorithms have been developed. In this study, we have expanded on CloudBrush by revising its pipeline and adding an EA algorithm. We introduced the principle of the graph processing in MapReduce and the pipeline of CloudBrush. It is noteworthy that the code is written in Java and readers may refer to <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp> for further details concerning the implementation of the procedures in the MapReduce framework.</p>
<sec>
<st>
<p>Distributed graph processing in MapReduce</p>
</st>
<p>Genome assembly has been modelled as a graph-theoretic problem. Graph models of particular interests include de Bruijn and string graphs in either directed or bidirected forms. Here we use bidirected string graph to model the genome assembly problem.</p>
<p>In a bidirected string graph, nodes represent reads and edges represent the overlaps between reads. To model the double-stranded nature of DNA, a read can be interpreted in either forward or reverse-complement directions. For each edge that represents an ordered pair of nodes with overlapping reads, four possible types exist, according to the directions of the two reads: forward-forward, reverse-reverse, forward-reverse, and reverse-forward. The type attribute is incorporated into each edge of the bidirected string graph. It is noteworthy that traversing the bidirected string graph should follow a consistent rule, i.e., the directions of in-links and out-links of the same node should be consistent. In other words, the read of a specific node can only be interpreted in a unique direction in one path of traversal.</p>
<p>The MapReduce framework <abbrgrp>
<abbr bid="B16">16</abbr>
<abbr bid="B17">17</abbr>
</abbrgrp> use <it>key-value </it>pairs as the only data type to distribute the computations. To manipulate a bidirected string graph in MapReduce, we use a <it>node adjacency list </it>to represent the graph, which stores <it>node id </it>(i.e., the identifier of a node) as the <it>key</it>, and <it>node data structure </it>as the <it>value. Node data structure </it>contains features of the node as well as a list of its outgoing edges and their features. The <it>node adjacency list </it>is a compact representation and allows easy traversal along the outgoing links. In MapReduce, a basic unit of computations is usually localized to a node's internal state and its neighbors in the graph. The results of computations on a node are emitted as <it>values</it>, each <it>keyed </it>with the identification of a neighbor node. Conceptually, we can think of this process as "passing" the results of computation along out-links. In the reducer, the algorithm receives all partial results having the same destination <it>node id</it>, and performs the computation. Subsequently, the data structure corresponding to each node is updated and written back to distributed file systems.</p>
</sec>
<sec>
<st>
<p>CloudBrush: string graph assembly using MapReduce</p>
</st>
<p>Since Edge Adjustment can effectively and efficiently manage the complex graph structures (see Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr>), we remove the path search and SNEA operation modules, which were used to manage braid structures and were the scalability bottleneck in the previous version. Thus, the new pipeline of CloudBrush is summarized as follows: First, we constructed the string graph in four steps: retaining non-redundant reads as vertices, finding overlaps between reads, performing edge adjustment, and removing redundant transitive edges. Second, we simplified the string graph by compressing non-branching paths, removing tips and bubbles using algorithms similar to those used by Contrail <abbrgrp>
<abbr bid="B10">10</abbr>
</abbrgrp>, and reusing Edge Adjustment as an option to simplify the graph further. Figure <figr fid="F11">11</figr> displays the workflow of CloudBrush.</p>
<fig id="F11"><title><p>Figure 11</p></title><caption><p>Workflow of CloudBrush assembler with Edge Adjustment</p></caption><text>
   <p><b>Workflow of CloudBrush assembler with Edge Adjustment</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-11"/></fig>
</sec>
<sec>
<st>
<p>Graph construction in MapReduce</p>
</st>
<sec>
<st>
<p>1. Retaining non-redundant reads as vertices</p>
</st>
<p>A sequence read may have several redundant copies in the dataset by oversampling in Solexa or SOLiD sequencing. The first step in graph construction is to merge redundant copies of the same read into a single node. We implemented a distributed prefix tree in MapReduce to extend Edena's prefix-tree approach <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>2. Finding pairwise overlaps between reads</p>
</st>
<p>Read-read overlaps are basic clues in connecting reads to contigs; however, finding overlaps between reads is often the most computationally intensive step in string graph-based assemblies. To find all the pairs of read-read overlaps, we adopted a prefix-and-extend strategy to speed up construction of the string graph <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>. The strategy consists of two phases, the prefix phase and the extend phase. In the prefix phase, a pair of reads is reported if the prefix of one of the reads exactly matches a substring of the other read at the given seed length. The pair is then said to have a "brush." In the extend phase, pairs of reads having a brush are further validated starting from the brush. If the exact match extends to one end of the second read, then an edge containing the two nodes of the two reads is created.</p>
</sec>
<sec>
<st>
<p>3. Edge Adjustment</p>
</st>
<p>After finding overlaps as edges, we used the EA algorithm on the graph structure. To perform the EA algorithm in the MapReduce framework, we passed the neighbors' edges for each node r<it>
<sub>i </sub>
</it>such that r<it>
<sub>i </sub>
</it>knows all of the neighboring nodes in the reducer. Once a node possesses all of the neighbors' information, the EA algorithm can easily compute the consensus sequence from the neighbors' content and perform the edge adjustment as described in Results sections. Figure <figr fid="F12">12</figr> shows the pseudo code of the Edge Adjustment algorithm in MapReduce version. It is noteworthy that, in MapReduce framework, each node computes its own consensus sequence in parallel.</p>
<fig id="F12"><title><p>Figure 12</p></title><caption><p>The pseudo code of the EA algorithm in MapReduce version</p></caption><text>
   <p><b>The pseudo code of the EA algorithm in MapReduce version</b>.</p>
</text><graphic file="1471-2164-13-S7-S28-12"/></fig>
</sec>
<sec>
<st>
<p>4. Reducing transitive edges</p>
</st>
<p>After the EA algorithm, the graph still has superfluous edges due to oversampling in sequencing. Consider two paths of consecutively overlapping nodes r<it>
<sub>a</sub>
</it>&#8594;r<it>
<sub>b</sub>
</it>&#8594;r<it>
<sub>c </sub>
</it>and r<it>
<sub>a</sub>
</it>&#8594;r<it>
<sub>c</sub>
</it>; r<it>
<sub>a</sub>
</it>&#8594;r<it>
<sub>c </sub>
</it>is transitive because it spells the same sequence as r<it>
<sub>a</sub>
</it>&#8594;r<it>
<sub>b</sub>
</it>&#8594;r<it>
<sub>c</sub>
</it>, but ignores the middle node r<it>
<sub>b</sub>
</it>.</p>
<p>To perform the transitive reduction in the MapReduce framework, we passed the neighbors' edges for each node r<it>
<sub>i </sub>
</it>such that r<it>
<sub>i </sub>
</it>knows all the neighboring nodes in the reducer. Different from de Bruijn graphs, the overlap size information is attached to the edge of our bidirected string graph. Therefore, we can sort neighbors by overlap size and remove transitive edges in order.</p>
</sec>
</sec>
<sec>
<st>
<p>Graph simplification in MapReduce</p>
</st>
<p>After constructing the string graph, we used several techniques to simplify the graph, including path compression, tip and bubble removal, and low coverage node removal. Path compression is used to merge a chain of nodes, each having one in-link and one out-link along a specific strand direction into a single node. After path compression, tips and bubbles are easily recognized locally. Our MapReduce implementation of path compression, tip and bubble removal, and low coverage node removal are similar to that of Contrail <abbrgrp>
<abbr bid="B10">10</abbr>
</abbrgrp>, except that we add an additional field of overlap size for the data structure of message passing between nodes tailed for the string graphs. Additionally, we provide an option to reuse the EA algorithm in graph simplification. In this study, we only performed the EA algorithm on nodes whose neighbors were dead ends of the graph; more broadly, the EA algorithm can be performed iteratively until no dead-end neighbors can be removed.</p>
</sec>
</sec>
<sec>
<st>
<p>Competing interests</p>
</st>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>YJC and CCC were equal contributors in developing the whole idea and writing the manuscript. CLC and JMH were leaders of the team and participated in the design of the study and revising the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>The authors wish to thank Chunghwa Telecom Co. and National Communication Project of Taiwan for providing the cloud computing resources and the technical supports they provided. They wish to thank Jazz Yao-Tsung Wang at the National Center for High-Performance Computing for his help with the efficient deployment of Hadoop clusters. YJC, CCC, and JMH were partially supported by National Science Council grant NSC 99-2321-B-001-025-.</p>
<p>This article has been published as part of <it>BMC Genomics </it>Volume 13 Supplement 7, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/bmcgenomics/supplements/13/S7</url>.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>The case for cloud computing in genome informatics</p></title><aug><au><snm>Stein</snm><fnm>LD</fnm></au></aug><source>Genome Biology</source><pubdate>2010</pubdate><volume>11</volume><fpage>207</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2010-11-5-207</pubid><pubid idtype="pmcid">2898083</pubid><pubid idtype="pmpid" link="fulltext">20441614</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Assembly algorithms for next-generation sequencing data</p></title><aug><au><snm>Miller</snm><fnm>JR</fnm></au><au><snm>Koren</snm><fnm>S</fnm></au><au><snm>Sutton</snm><fnm>G</fnm></au></aug><source>Genomics</source><pubdate>2010</pubdate><volume>95</volume><fpage>315</fpage><lpage>327</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.ygeno.2010.03.001</pubid><pubid idtype="pmcid">2874646</pubid><pubid idtype="pmpid" link="fulltext">20211242</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Fragment assembly with double-barreled data</p></title><aug><au><snm>Pevzner</snm><fnm>P</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Waterman</snm><fnm>M</fnm></au></aug><source>Proceedings of the National Academy of Sciences</source><pubdate>2001</pubdate><volume>98</volume><issue>17</issue><fpage>9748</fpage><lpage>9753</lpage><xrefbib><pubid idtype="doi">10.1073/pnas.171285098</pubid></xrefbib></bibl><bibl id="B4"><title><p>Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs</p></title><aug><au><snm>Zerbino</snm><fnm>D</fnm></au><au><snm>Birney</snm><fnm>E</fnm></au></aug><source>Genome Research</source><pubdate>2008</pubdate></bibl><bibl id="B5"><title><p>Short read fragment assembly of bacterial genomes</p></title><aug><au><snm>Chaisson</snm><fnm>MJ</fnm></au><au><snm>Pevzner</snm><fnm>PA</fnm></au></aug><source>Genome Research</source><pubdate>2008</pubdate><volume>18</volume><fpage>324</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.7088808</pubid><pubid idtype="pmcid">2203630</pubid><pubid idtype="pmpid" link="fulltext">18083777</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>High-Quality Draft Assemblies of Mammalian Genomes from Massively Parallel Sequence Data</p></title><aug><au><snm>Gnerre</snm><fnm>S</fnm></au><au><snm>MacCallum</snm><fnm>I</fnm></au><au><snm>Przybylski</snm><fnm>D</fnm></au><au><snm>Ribeiro</snm><fnm>FJ</fnm></au><au><snm>Burton</snm><fnm>JN</fnm></au><au><snm>Walker</snm><fnm>BJ</fnm></au><au><snm>Sharpe</snm><fnm>T</fnm></au><au><snm>Hall</snm><fnm>G</fnm></au><au><snm>Shea</snm><fnm>TP</fnm></au><au><snm>Sykes</snm><fnm>S</fnm></au><au><snm>Berlin</snm><fnm>AM</fnm></au><au><snm>Aird</snm><fnm>D</fnm></au><au><snm>Costello</snm><fnm>M</fnm></au><au><snm>Daza</snm><fnm>R</fnm></au><au><snm>Williams</snm><fnm>L</fnm></au><au><snm>Nicol</snm><fnm>R</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Nusbaum</snm><fnm>C</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au><au><snm>Jaffe</snm><fnm>DB</fnm></au></aug><source>PNAS</source><pubdate>2011</pubdate><volume>108</volume><fpage>1513</fpage><lpage>1518</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.1017351108</pubid><pubid idtype="pmcid">3029755</pubid><pubid idtype="pmpid" link="fulltext">21187386</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>ABySS: A parallel assembler for short read sequence data</p></title><aug><au><snm>Simpson</snm><fnm>JT</fnm></au><au><snm>Wong</snm><fnm>K</fnm></au><au><snm>Jackman</snm><fnm>SD</fnm></au><au><snm>Schein</snm><fnm>JE</fnm></au><au><snm>Jones</snm><fnm>SJM</fnm></au><au><snm>Birol</snm><fnm>&#304;</fnm></au></aug><source>Genome Research</source><pubdate>2009</pubdate><volume>19</volume><fpage>1117</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.089532.108</pubid><pubid idtype="pmcid">2694472</pubid><pubid idtype="pmpid" link="fulltext">19251739</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>De novo assembly of human genomes with massively parallel short read sequencing</p></title><aug><au><snm>Li</snm><fnm>R</fnm></au><au><snm>Zhu</snm><fnm>H</fnm></au><au><snm>Ruan</snm><fnm>J</fnm></au><au><snm>Qian</snm><fnm>W</fnm></au><au><snm>Fang</snm><fnm>X</fnm></au><au><snm>Shi</snm><fnm>Z</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Li</snm><fnm>S</fnm></au><au><snm>Shan</snm><fnm>G</fnm></au><au><snm>Kristiansen</snm><fnm>K</fnm></au><au><snm>Li</snm><fnm>S</fnm></au><au><snm>Yang</snm><fnm>H</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au></aug><source>Genome Research</source><pubdate>2010</pubdate><volume>20</volume><fpage>265</fpage><lpage>272</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.097261.109</pubid><pubid idtype="pmcid">2813482</pubid><pubid idtype="pmpid" link="fulltext">20019144</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>IDBA-A Practical Iterative de Bruijn Graph De Novo Assembler</p></title><aug><au><snm>Peng</snm><fnm>Y</fnm></au><au><snm>Leung</snm><fnm>H</fnm></au><au><snm>Yiu</snm><fnm>S</fnm></au><au><snm>Chin</snm><fnm>F</fnm></au></aug><source>Research in Computational Molecular Biology (RECOMB 2010)</source><pubdate>2010</pubdate><fpage>426</fpage><lpage>440</lpage></bibl><bibl id="B10"><title><p>Contrail: Assembly of Large Genomes using Cloud Computing</p></title><aug><au><snm>Schatz</snm><fnm>M</fnm></au><au><snm>Sommer</snm><fnm>D</fnm></au><au><snm>Kelley</snm><fnm>D</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au></aug><url>http://contrail-bio.sf.net/</url></bibl><bibl id="B11"><title><p>The fragment assembly string graph</p></title><aug><au><snm>Myers</snm><fnm>E</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><fpage>ii79</fpage><lpage>ii85</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti1114</pubid><pubid idtype="pmpid" link="fulltext">16204131</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer</p></title><aug><au><snm>Hernandez</snm><fnm>D</fnm></au><au><snm>Francois</snm><fnm>P</fnm></au><au><snm>Farinelli</snm><fnm>L</fnm></au><au><snm>Osteras</snm><fnm>M</fnm></au><au><snm>Schrenzel</snm><fnm>J</fnm></au></aug><source>Genome Research</source><pubdate>2008</pubdate><volume>18</volume><fpage>802</fpage><lpage>809</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.072033.107</pubid><pubid idtype="pmcid">2336802</pubid><pubid idtype="pmpid" link="fulltext">18332092</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>Parallel short sequence assembly of transcriptomes</p></title><aug><au><snm>Jackson</snm><fnm>B</fnm></au><au><snm>Schnable</snm><fnm>P</fnm></au><au><snm>Aluru</snm><fnm>S</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2009</pubdate><volume>10</volume><fpage>S14</fpage><xrefbib><pubidlist><pubid idtype="pmcid">2762063</pubid><pubid idtype="pmpid" link="fulltext">19828074</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Efficient De Novo Assembly of Large Genomes Using Compressed Data Structures</p></title><aug><au><snm>Simpson</snm><fnm>JT</fnm></au><au><snm>Durbin</snm><fnm>R</fnm></au></aug><source>Genome Res</source><pubdate>2012</pubdate><volume>22</volume><fpage>549</fpage><lpage>556</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.126953.111</pubid><pubid idtype="pmcid">3290790</pubid><pubid idtype="pmpid" link="fulltext">22156294</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs</p></title><aug><au><snm>Chang</snm><fnm>Y-J</fnm></au><au><snm>Chen</snm><fnm>C-C</fnm></au><au><snm>Chen</snm><fnm>C-L</fnm></au><au><snm>Ho</snm><fnm>J-M</fnm></au></aug><source>Proceedings of IEEE International Conference on Cloud Computing (CLOUD 2012)</source><publisher>Hawaii, USA</publisher><pubdate>2012</pubdate></bibl><bibl id="B16"><title><p>MapReduce: Simplified data processing on large clusters</p></title><aug><au><snm>Dean</snm><fnm>J</fnm></au><au><snm>Ghemawat</snm><fnm>S</fnm></au></aug><source>Communications of the ACM</source><pubdate>2008</pubdate><volume>51</volume><fpage>107</fpage><lpage>113</lpage></bibl><bibl id="B17"><title><p>Hadoop: The Definitive Guide</p></title><aug><au><snm>White</snm><fnm>T</fnm></au></aug><source>O'Reilly Media</source><pubdate>2009</pubdate><xrefbib><pubid idtype="pmpid" link="fulltext">23175732</pubid></xrefbib></bibl><bibl id="B18"><title><p>GAGE: A Critical Evaluation of Genome Assemblies and Assembly Algorithms</p></title><aug><au><snm>Salzberg</snm><fnm>SL</fnm></au><au><snm>Phillippy</snm><fnm>AM</fnm></au><au><snm>Zimin</snm><fnm>A</fnm></au><au><snm>Puiu</snm><fnm>D</fnm></au><au><snm>Magoc</snm><fnm>T</fnm></au><au><snm>Koren</snm><fnm>S</fnm></au><au><snm>Treangen</snm><fnm>TJ</fnm></au><au><snm>Schatz</snm><fnm>MC</fnm></au><au><snm>Delcher</snm><fnm>AL</fnm></au><au><snm>Roberts</snm><fnm>M</fnm></au><au><snm>Mar&#231;ais</snm><fnm>G</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au><au><snm>Yorke</snm><fnm>JA</fnm></au></aug><source>Genome Res</source><pubdate>2012</pubdate><volume>22</volume><fpage>557</fpage><lpage>567</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.131383.111</pubid><pubid idtype="pmcid">3290791</pubid><pubid idtype="pmpid" link="fulltext">22147368</pubid></pubidlist></xrefbib></bibl><bibl id="B19"><title><p>Hicloud computer-as-a-service (CaaS)</p></title><url>http://hicloud.hinet.net/</url></bibl><bibl id="B20"><title><p>Enhancing De Novo Transcriptome Assembly by Incorporating Multiple Overlap Sizes</p></title><aug><au><snm>Chen</snm><fnm>C-C</fnm></au><au><snm>Lin</snm><fnm>W-D</fnm></au><au><snm>Chang</snm><fnm>Y-J</fnm></au><au><snm>Chen</snm><fnm>C-L</fnm></au><au><snm>Ho</snm><fnm>J-M</fnm></au></aug><source>ISRN Bioinformatics</source><pubdate>2012</pubdate><volume>2012</volume><fpage>1</fpage><lpage>9</lpage></bibl><bibl id="B21"><title><p>A greedy algorithm for aligning DNA sequences</p></title><aug><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Schwartz</snm><fnm>S</fnm></au><au><snm>Wagner</snm><fnm>L</fnm></au><au><snm>Miller</snm><fnm>W</fnm></au></aug><source>Journal of Computational Biology</source><pubdate>2000</pubdate><volume>7</volume><fpage>203</fpage><lpage>214</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/10665270050081478</pubid><pubid idtype="pmpid" link="fulltext">10890397</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Bambus 2: scaffolding metagenomes</p></title><aug><au><snm>Koren</snm><fnm>S</fnm></au><au><snm>Treangen</snm><fnm>TJ</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au></aug><source>Bioinformatics</source><pubdate>2011</pubdate><volume>27</volume><fpage>2964</fpage><lpage>2971</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btr520</pubid><pubid idtype="pmcid">3198580</pubid><pubid idtype="pmpid" link="fulltext">21926123</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>Assembly of large genomes using second-generation sequencing</p></title><aug><au><snm>Schatz</snm><fnm>MC</fnm></au><au><snm>Delcher</snm><fnm>AL</fnm></au><au><snm>Salzberg</snm><fnm>SL</fnm></au></aug><source>Genome Research</source><pubdate>2010</pubdate></bibl></refgrp>
</bm></art>