<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1471-2105-13-189</ui><ji>1471-2105</ji><fm><dochead>Methodology article</dochead><bibl><title><p>AGORA: Assembly Guided by Optical Restriction Alignment</p></title><aug><au id="A1"><snm>Lin</snm><mi>C</mi><fnm>Henry</fnm><insr iid="I1"/><email>henrylin@umiacs.umd.edu</email></au><au id="A2"><snm>Goldstein</snm><fnm>Steve</fnm><insr iid="I2"/><insr iid="I3"/><insr iid="I4"/><email>sgoldstein@wisc.edu</email></au><au id="A3"><snm>Mendelowitz</snm><fnm>Lee</fnm><insr iid="I1"/><insr iid="I5"/><email>lmendelo@math.umd.edu</email></au><au id="A4"><snm>Zhou</snm><fnm>Shiguo</fnm><insr iid="I2"/><insr iid="I3"/><insr iid="I4"/><email>szhou@wisc.edu</email></au><au id="A5"><snm>Wetzel</snm><fnm>Joshua</fnm><insr iid="I6"/><email>jlwetzel@cs.princeton.edu</email></au><au id="A6"><snm>Schwartz</snm><mi>C</mi><fnm>David</fnm><insr iid="I2"/><insr iid="I3"/><insr iid="I4"/><email>dcschwartz@wisc.edu</email></au><au id="A7" ca="yes"><snm>Pop</snm><fnm>Mihai</fnm><insr iid="I1"/><email>mpop@umiacs.umd.edu</email></au></aug><insg><ins id="I1"><p>Center for Bioinformatics and Computational Biology, University of Maryland-College Park, College Park, MD, USA</p></ins><ins id="I2"><p>Laboratory for Molecular and Computational Genomics, University of Wisconsin-Madison, Madison, WI, USA</p></ins><ins id="I3"><p>Laboratory of Genetics, University of Wisconsin-Madison, Madison, WI, USA</p></ins><ins id="I4"><p>Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA</p></ins><ins id="I5"><p>Applied Mathematics and Scientific Computation Program, University of Maryland-College Park, College Park, MD, USA</p></ins><ins id="I6"><p>Department of Computer Science, and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA</p></ins></insg><source>BMC Bioinformatics</source><issn>1471-2105</issn><pubdate>2012</pubdate><volume>13</volume><issue>1</issue><fpage>189</fpage><url>http://www.biomedcentral.com/1471-2105/13/189</url><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-13-189</pubid><pubid idtype="pmpid">22856673</pubid></pubidlist></xrefbib></bibl><history><rec><date><day>28</day><month>3</month><year>2012</year></date></rec><acc><date><day>28</day><month>6</month><year>2012</year></date></acc><pub><date><day>2</day><month>8</month><year>2012</year></date></pub></history><cpyrt><year>2012</year><collab>Lin et al.; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt><abs><sec><st><p>Abstract</p></st><sec><st><p>Background</p></st><p>Genome assembly is difficult due to repeated sequences within the genome, which create ambiguities and cause the final assembly to be broken up into many separate sequences (contigs). Long range linking information, such as mate-pairs or mapping data, is necessary to help assembly software resolve repeats, thereby leading to a more complete reconstruction of genomes. Prior work has used optical maps for validating assemblies and scaffolding contigs, after an initial assembly has been produced. However, optical maps have not previously been used within the genome assembly process. Here, we use optical map information within the popular de Bruijn graph assembly paradigm to eliminate paths in the de Bruijn graph which are not consistent with the optical map and help determine the correct reconstruction of the genome.</p></sec><sec><st><p>Results</p></st><p>We developed a new algorithm called AGORA: Assembly Guided by Optical Restriction Alignment. AGORA is the first algorithm to use optical map information directly within the de Bruijn graph framework to help produce an accurate assembly of a genome that is consistent with the optical map information provided. Our simulations on bacterial genomes show that AGORA is effective at producing assemblies closely matching the reference sequences.</p><p>Additionally, we show that noise in the optical map can have a strong impact on the final assembly quality for some complex genomes, and we also measure how various characteristics of the starting de Bruijn graph may impact the quality of the final assembly. Lastly, we show that a proper choice of restriction enzyme for the optical map may substantially improve the quality of the final assembly.</p></sec><sec><st><p>Conclusions</p></st><p>Our work shows that optical maps can be used effectively to assemble genomes within the de Bruijn graph assembly framework. Our experiments also provide insights into the characteristics of the mapping data that most affect the performance of our algorithm, indicating the potential benefit of more accurate optical mapping technologies, such as nano-coding.</p></sec></sec></abs></fm><bdy><sec><st><p>Background</p></st><p>Although next generation genome sequencing approaches have improved greatly over the last decade, genome sequencing and assembly still relies primarily on shotgun sequencing <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>. Genome assembly, the process of reconstructing the original genome sequence from sequence reads, is made difficult by the fact that the most commonly used sequencing technologies only produce reads between 35 base pairs (bp) and 1 kilo base pairs (kbp) long. Repetitive sequences longer than read lengths lead to ambiguities in the assembly, and additional information from paired-end reads <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> is required to resolve those ambiguities. However, information from paired-end reads is often still insufficient for a comprehensive reconstruction of the original genome sequence <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>.</p><p>Genome assembly is aided by Optical Mapping--a single molecule system <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp> for the construction of genome-wide ordered restriction maps through the assembly of (400&#8211;500 kbp) genomic DNA, restriction digested and mapped <it>in situ.</it> The optical mapping system provides estimates on the locations of restriction-enzyme recognition sequences within a genome. Although optical maps have been used previously to provide a means for scaffolding and validation, in addition to discernment of structural variants <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B11">11</abbr></abbrgrp>, optical map data is commonly used only <it>after</it> a nascent sequence is produced <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> by a genome assembler.</p><p>Here, we explore an alternative approach for genome assembly leveraging optical map data within the popular de Bruijn graph assembly paradigm, developing an algorithm we call AGORA: Assembly Guided by Optical Restriction Alignments. We analyze the advantages of utilizing AGORA with optical map information in constructing accurate and comprehensive assemblies. Our algorithm and analysis present the first results showing the benefits of using optical maps <it>within</it> the de Bruijn graph assembly paradigm.</p><p>Initial simulations show that our algorithm is effective at providing comprehensive assemblies of bacterial genomes, given an optical map with simulated errors and an error-free de Bruijn graph with k-mer size 100. The majority of our assemblies match the original reference sequences very closely. We also measure how the complexity of a genome's repeat structure, reflected in characteristics of the de Bruijn graph, impact AGORA's assembly accuracy. In addition, we investigate how optical mapping error and the choice of restriction enzyme can affect the quality of the final sequence assembly. Moreover, we verify that AGORA works with an experimentally determined optical map from the <it>Yersinia pestis</it> KIM genome <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Finally, we also explore the applicability of our methods to assembly graphs produced from real sequence reads with errors, and provide a comparison of our results to what can be achieved through the use of mate-pairs (as described in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>).</p><sec><st><p>Optical mapping</p></st><p>The Optical Mapping system was first described in 1993 <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp> as a single molecule platform capable of whole genome analysis and as a way to quickly construct physical maps to aid in genome assembly. Optical Mapping produces ordered restriction maps constructed from individual molecules (Rmaps), comprised of an ordered list of restriction fragments identified within each molecule after digestion with a restriction enzyme. The construction of a genome-wide optical map employs assembly techniques akin to those used for sequence assembly <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B16">16</abbr></abbrgrp>, modified to account for error in the Rmaps <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr></abbrgrp>. The resulting genome-wide optical map produced by this process provides a globally ordered list of restriction fragment sizes across the entire genome.</p><p>Previously, algorithms have been developed <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp> to use optical maps to verify and scaffold contigs (partial segments of genome sequence). This scaffolding and validation process is done by first computing for each contig an <it>in silico</it> map, which is an ordered restriction map (represented as an ordered list of fragment sizes) constructed computationally by finding all occurrences of the restriction enzyme recognition sequence within each contig. The size and order of the fragments within the <it>in silico</it> map are then compared to the sequence of fragments within the optical map of the genome (in a manner analogous to sequence alignment <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>), with the goal of assigning the contig to a single location within the optical map. Recent work by Nagarajan, et al. <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> employs a dynamic programming algorithm to align contigs to an optical map to form a scaffold for the contigs. The validation of contigs is similarly performed by comparing an <it>in silico</it> map of each contig with an experimentally determined optical map.</p></sec><sec><st><p>De Bruijn graph assembly</p></st><p>In this paper, we explore the benefits of using optical maps within the de Bruijn graph genome assembly framework first proposed by Pevzner et al. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. A de Bruijn graph is a graph whose nodes correspond to k-mers (sequences of length k) and edges correspond to (k&#8201;+&#8201;1)-mers; an edge may join two nodes if one of the nodes is a prefix of the edge and the other is a suffix. In the context of genome assembly, a node is created for each k-mer in the set of reads and an edge for each (k&#8201;+&#8201;1)-mer. In this formulation, genome assembly is reduced to finding a &#8220;Chinese postman path&#8221; <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>, a path through the de Bruijn graph that visits all edges at least once, which represents the true genome sequence. A full description of this approach is beyond the scope of our paper. Readers interested in more details should refer to <abbrgrp><abbr bid="B25">25</abbr><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>.</p><p>Practical implementations of the de Bruijn graph assembly paradigm have been used successfully in practice <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>, and must tackle two major challenges: the presence of sequencing errors, which induce false k-mers in the graph, and the presence of repeats. Due to repeats, the number of Chinese postman paths in the de Bruijn graph can be exponential in the number of nodes and edges <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>, making it infeasible to identify the one path that correctly matches the sequence of the genome being assembled. Furthermore, imposing additional constraints on the reconstruction of the genome leads to computationally intractable formulations (see, e.g., <abbrgrp><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>).</p><p>In practice, implementations of this approach forgo the ultimate goal of correctly reconstructing the entire genome sequence and instead attempt to reconstruct a collection of contigs, which generally represent repeat-free sub-paths in the graph. Once these segments have been constructed, additional information from paired-end read data is typically used to resolve repeats and generate scaffolds.</p><p>Although paired-end information is generally used only after an initial assembly is produced, recently, Narzisi and Mishra proposed a new algorithm SUTTA <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>, which uses paired-end read information within the genome assembly process. The algorithm uses pair-end information to prioritize a greedy branch and bound traversal of the assembly graph according to paired-end constraints, thereby resolving repeats and potentially generating longer contigs. They suggest that optical mapping information could be used in a similar way, but do not provide the details of such an implementation. Here, we design and implement the first algorithm that uses optical mapping data during assembly, employing a framework similar in spirit to the one used by SUTTA, along with several additional improvements. We also use AGORA to explore the effect of the other parameters, such as noise in the optical mapping process, on our ability to effectively reconstruct the sequence of bacterial genomes.</p></sec><sec><st><p>Overview of AGORA</p></st><p>As outlined above, genome assembly can be effectively formulated as the search for a path within a de Bruijn graph that &#8220;spells&#8221; the same sequence as the genome being assembled. Optical map information can guide the search for this correct path by eliminating alternate paths that are not consistent with the optical map. To guide the search, an <it>in silico</it> map of the sequence corresponding to a partially completed path can be compared to the optical map. If the two maps disagree, we can discard the path as incorrect. As a result, we can quickly prune the set of possible paths, and find a Chinese postman path matching the optical map, which is likely to represent the true reconstruction of the genome. Although imposing map-based constraints on the traversals of the graph leads to computationally intractable problems similar to the Longest Path Problem (a well known problem in computational graph theory, see e.g., <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>) and the Edge Disjoint Paths Problem <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, we show that appropriately chosen heuristics lead to a practical implementation that solves the map-guided assembly problem effectively for bacterial genomes.</p><p>A key idea for making our search tractable in practice is the identification of edges within the de Bruijn graph which only match at one location in the genome optical map. These <it>landmark</it> edges seed our search, and dramatically reduce the number of paths that need to be investigated. After identifying landmark edges, we then proceed to search for paths connecting pairs of consecutive landmark edges, ensuring that these paths are consistent with the optical map. Although finding a suitable path between consecutive landmark edges may still require exponential time, our experiments on bacterial genomes show that the search process between landmark edges is generally solvable in practice.</p><p>To search for paths between landmark edges, we use a refined version of depth first search. As the depth first search proceeds, we check if the <it>in silico</it> map of the current path matches the optical map, and if so, we proceed with the depth first search. Otherwise, we backtrack and proceed along a different path until we find a path to the next landmark matching the optical map. With a few additional modifications to the algorithm to improve efficiency (described in the Methods section), AGORA was generally able to find a path in the de Bruijn graph with a sequence and corresponding <it>in silico</it> map consistent with the optical map. Although there may be multiple paths in the de Bruijn graph that yield a sequence with an <it>in silico</it> map matching the genome optical map, our simulations show that these paths typically yield very similar sequences, differing only in the reconstruction of small complex repeat regions.</p></sec></sec><sec><st><p>Results and discussion</p></st><sec><st><p>Experimental setup</p></st><p>We analyzed the performance of AGORA on 369 sequenced bacterial genomes, using error-free de Bruijn graphs generated from the complete genome sequences as previously described in <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> and optical maps simulated from the sequences. In addition, we also tested AGORA on a published optical map of the <it>Y. pestis</it> KIM genome <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. Note that the error-free de Bruijn graph of a genome sequence of order k is identical to the de Bruijn graph constructed from a collection of error-free sequence reads where every k-mer in the genome is covered by at least one read. The de Bruijn graph of each sequence was simplified by replacing unipaths (a path in which all of the nodes have in-degree&#8201;= out-degree&#8201;=&#8201;1 <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B32">32</abbr></abbrgrp>) with a single edge representing the longer sequence, along with other de Bruijn graph simplifications, which preserve all the information relevant for genome reconstruction from the original de Bruijn graph (see <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> for further details on the simplification procedures). Moreover, we collapsed parallel edges with greater than 99% sequence similarity, as long as the difference in the sequences did not create or remove any restriction sites (see Methods for more details).</p><p>To simulate optical maps from a genome sequence, we first compute an <it>in silico</it> map of the sequence and then perturb the fragments within this map by sampling from an error distribution. We modeled three different error levels -- <it>high</it><it>medium</it> and <it>low</it> --- and simulated one optical map from each of these distributions to measure the effect of optical mapping error on assembly quality. Although the error simulation is a simple process which may not capture the full characteristics of experimentally generated optical maps (see Methods for details), the results nonetheless show the impact of noise on the final assembly quality. The high error setting has characteristics matching the maximum fragment sizing error and maximum size of small fragments lost observed in the experimental <it>Y. pestis</it> KIM optical map, while the low error setting corresponds to what might be achievable with the new nano-coding technology <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. The low error setting can noticeably improve the performance of our algorithm since it does not remove small fragments from the optical map. A more precise description of the three levels of noise used in our simulations can be found in the Methods section.</p><p>After running AGORA, we used four different metrics to measure the quality of the final path through the de Bruijn graph and the sequence associated with it. (See Methods for detailed descriptions of these measures.) Our first metric, <it>sequence correctness</it>, roughly corresponds to the percentage of the final sequence that was assembled in the correct order. Our second metric, <it>edge correctness</it>, is the number of graph edges placed in the correct order divided by the total number of edges in the de Bruijn graph. The last two metrics measuring the final <it>N50 size</it> and <it>number of contigs</it> produced by our algorithm are computed after breaking the reconstructed sequence (from the path found by AGORA) wherever an error occurs, and treating sequence segments between errors as independent contigs produced by our assembler. This approach is consistent with the one used by Salzberg et al. <abbrgrp><abbr bid="B41">41</abbr></abbrgrp> in the context of assembly evaluation. These final contig statistics are then compared against the original N50 size and number of contigs that would arise if one were to treat each edge in the starting de Bruijn graph as a separate contig.</p></sec><sec><st><p>Assembly of bacterial genomes with simulated optical maps</p></st><p>We start by measuring the performance of AGORA on assembling 369 bacterial genomes, providing as input their simplified de Bruijn graphs of order k&#8201;=&#8201;100, which are equivalent to the graphs that can be obtained in an error-free sequencing experiment generating reads longer than 100&#8201;bp, covering each k-mer of length 100 in the genome. For each genome, we computationally generated BamHI (recognition sequence G^GATCC) <it>in silico</it> maps for simulating optical maps. Statistics summarizing the number of restriction sites in the <it>in silico</it> map of each genome and the characteristics of the de Bruijn graphs of the genomes are shown in Table <tblr tid="T1">1</tblr>.</p><table id="T1"><title><p>Table 1</p></title><caption><p><b>Statistics of the de Bruijn graphs and optical maps used in our simulations</b></p></caption><tgroup align="left" cols="5"><colspec align="left" colname="c1" colnum="1" colwidth="1*"/><colspec align="left" colname="c2" colnum="2" colwidth="1*"/><colspec align="left" colname="c3" colnum="3" colwidth="1*"/><colspec align="left" colname="c4" colnum="4" colwidth="1*"/><colspec align="left" colname="c5" colnum="5" colwidth="1*"/><thead valign="top"><row rowsep="1"><entry colname="c1"/><entry colname="c2"><p><b>Min</b></p></entry><entry colname="c3"><p><b>Median</b></p></entry><entry colname="c4"><p><b>Mean</b></p></entry><entry colname="c5"><p><b>Max</b></p></entry></row></thead><tfoot><p>The de Bruijn graphs for 369 bacterial genomes were generated with k-mer size 100 from the known sequences from <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> (without errors and without bubble collapsing), and the N50 size was computed for each genome, treating each edge in the de Bruijn graph as a separate contig. The row &#8220;Restriction Sites&#8221; refers to the number of cuts within the genome when using the restriction enzyme BamHI.</p></tfoot><tbody valign="top"><row rowsep="1"><entry colname="c1"><p>Nodes</p></entry><entry colname="c2"><p>1</p></entry><entry colname="c3"><p>35</p></entry><entry colname="c4"><p>63.63</p></entry><entry colname="c5"><p>1,023</p></entry></row><row rowsep="1"><entry colname="c1"><p>Edges</p></entry><entry colname="c2"><p>3</p></entry><entry colname="c3"><p>110</p></entry><entry colname="c4"><p>324.4</p></entry><entry colname="c5"><p>14,251</p></entry></row><row rowsep="1"><entry colname="c1"><p>N50 Size (kbp)</p></entry><entry colname="c2"><p>14</p></entry><entry colname="c3"><p>212.1</p></entry><entry colname="c4"><p>419.2</p></entry><entry colname="c5"><p>3,587</p></entry></row><row rowsep="1"><entry colname="c1"><p>Genome Length (Mbp)</p></entry><entry colname="c2"><p>0.34</p></entry><entry colname="c3"><p>2.91</p></entry><entry colname="c4"><p>3.2</p></entry><entry colname="c5"><p>9.14</p></entry></row><row rowsep="1"><entry colname="c1"><p>Restriction Sites</p></entry><entry colname="c2"><p>6</p></entry><entry colname="c3"><p>334</p></entry><entry colname="c4"><p>491.7</p></entry><entry colname="c5"><p>9,668</p></entry></row></tbody></tgroup></table><p>Table <tblr tid="T1">1</tblr> provides some indication of the complexity of the genomes in our test data set, as measured by their corresponding de Bruijn graphs. The number of nodes in each de Bruijn graph roughly represents the number of distinct repeat sequences longer than 100&#8201;bp occurring in the genome, while the number of edges roughly represents the number of times those repeated sequences occur in the genome. Genomes with more nodes and edges in the de Bruijn graph are generally more difficult to assemble, since they contain more repeat sequences.</p><p>To determine how well we could assemble the 369 bacterial genomes with the help of optical maps, we ran our algorithm on each de Bruijn graph and optical maps simulated with the three different error settings described above. We then measured the quality of our assemblies based on sequence correctness, edge correctness, N50 size, and number of contigs. The results are aggregated in Figure <figr fid="F1">1</figr>, and per-genome information is provided in Additional file <supplr sid="S1">1</supplr>.</p><suppl id="S1"><title><p>Additional file 1</p></title><text><p><b>Per-genome results of assembling each genome.</b> An excel spreadsheet detailing the results of assembling 369 bacterial genomes with AGORA, given optical maps simulated with three different error rates.</p></text><file name="1471-2105-13-189-S1.xls">
   <p>Click here for file</p>
</file></suppl><fig id="F1"><title><p>Figure 1</p></title><caption><p>Assessment of the quality of bacterial genome assemblies</p></caption><text>
   <p><b>Assessment of the quality of bacterial genome assemblies.</b> Measurements of the quality of assemblies produced by our algorithm on 369 bacterial genomes under three different optical map error rates. In each boxplot, we extend the whiskers beyond the upper and lower quartiles for 1.5 times the interquartile range, and omit outliers beyond the whiskers. (<b>a</b>) Sequence correctness of the assemblies, measuring the percent of the genome that was correctly assembled. (76, 72, and 65 outliers are not shown the high, medium, and low error bars, respectively.) Over &#190; of the genomes are assembled with greater than 98% sequence correctness, even in the high error setting. (<b>b</b>) Percent of edges assembled in the correct order by our algorithm on the 369 genomes, over three error rates. The percent of edges correct is generally lower than the sequence correctness percentages, but the difference is mostly due to short edges misplaced by the algorithm. (26, 33, and 37 outliers are not shown in the high, medium, and low error settings, respectively.) (<b>c</b>) N50 size of the final contigs produced by our algorithm, after breaking genomic segments at assembly errors, normalized by genome size. (54 outliers were omitted from the first bar, measuring assembly without a map.) (<b>d</b>) Number of contigs that would be produced with no optical map (and only the de Bruijn graph), and with optical maps simulated with three different levels of noise. (We omit 42, 53, 50, and 45 outliers in the no map, high error, medium error, and low error settings, respectively.) We see a substantial improvement in both the final number of contigs and the final N50 size, when given an optical map with any one of the three error rates.</p>
</text><graphic file="1471-2105-13-189-1"/></fig><p>As we can see in Figure <figr fid="F1">1</figr>a, AGORA assembles over &#190; of all genomes with greater than 98% sequence correctness for all three error settings. The mean sequence correctness in the high, medium, and low error settings were 89.2%, 91.9%, and 95.9%, respectively. The means were lower than the median sequence correctness values due to a few very complex genomes for which the algorithm could only assemble a small fraction of the genome, producing outliers which are not shown in the Figure <figr fid="F1">1</figr>. When measuring the number of edges assembled in the correct order as shown in Figure <figr fid="F1">1</figr>b, the edge correctness percentages are lower than the sequence correctness percentages, primarily because edges with short sequences (typically under 1 kbp in length) may be misplaced by our algorithm due to a lack of restriction sites. Although AGORA may misplace 10%-20% of the de Bruijn graph edges, these edges typically contribute to less than 2% of the genome assembled, as indicated by the sequence correctness boxplot shown in Figure <figr fid="F1">1</figr>a.</p><p>In Figure <figr fid="F1">1</figr>c and <figr fid="F1">1</figr>d, we plot statistics on the final N50 size and number of contigs that would result if we were to break the final path produced by AGORA wherever a mistake is made, and compare these values with the initial quality of the assembly before using mapping data. We can see in the figures that our algorithm substantially improves the N50 size and number of contigs, even after errors are accounted for. When measuring the overall improvement in N50 size we found that, in the median case, the N50 size increased by a factor of between 3.61 and 4.09, while the mean improvement was between 5.44 and 5.77, depending on the level of mapping error simulated. Similarly, the number of contigs decreased by a factor of between 3.48 and 5.15 in the median case, and the mean improvement was between 6.67 and 10.74. In addition, we found that we had assembled 43, 52, and 69 genomes perfectly into a single contig representing the entire genome sequence in the high, medium, and low settings, respectively.</p><p>AGORA finished in under one minute for &#190; of instances, while the longest runtime was around 20 minutes. Note that we forced the algorithm to skip to the next landmark if no path could be found within one minute (see Methods for more details), since our tests required running the algorithm more than 1,000 times. We do not expect this time limitation to significantly affect the median and quartile statistics, as only 18.1% percent of genomes had any regions skipped. Those genomes were generally complex genomes with assembly quality in the lowest quartile, and additional running time did not improve their assembly quality significantly.</p><p>In addition to the aggregate statistics shown in Figure <figr fid="F1">1</figr>, we also individually compared the original and final N50 sizes produced by our algorithm for each genome using the medium error setting. In Figure <figr fid="F2">2</figr>, we plot a point for each genome at its original and final N50 size, after normalizing by genome length. We see that most genomes have substantial improvement in N50 size, with some genomes even having normalized N50 size starting below 20% and improving to nearly 100% after assembly. Some genomes with normalized N50 size starting below 20% did not improve much, mostly because the corresponding graphs contained many short edges without any restriction sites, making it difficult to place those edges on the optical map and rule out incorrect paths in the de Bruijn graph.</p><fig id="F2"><title><p>Figure 2</p></title><caption><p>Improvement in normalized N50 size after assembly</p></caption><text>
   <p><b>Improvement in normalized N50 size after assembly.</b> For each of our 369 bacterial genomes, we plot the initial normalized N50 size (x axis) relative to the normalized N50 size after assembly (y axis) when provided a simulated optical map with the medium error rate, as described in the Methods section. The N50 sizes are normalized by dividing by the genome length. Most genomes exhibit substantial improvement in the normalized N50 size with the exception of complex genomes (with low initial normalized N50 size), and some simple genomes (with initial N50 size already close to the entire genome size).</p>
</text><graphic file="1471-2105-13-189-2"/></fig><p>As the normalized N50 size did not seem to predict very well how accurately we could assemble the final genome, we performed further statistical analyses to test whether other factors were more correlated with the final assembly quality. We computed Spearman&#8217;s rank correlation coefficient between sequence correctness and various de Bruijn graph characteristics. We found that the sequence correctness obtained by AGORA had the highest correlation with the average edge length of the de Bruijn graph among all the characteristics we measured. The correlations are: genome size (&#8722;0.04), normalized N50 size (0.61), N50 size (0.69), number of edges (&#8722;0.75), average number of restriction sites per edge (0.76), and average edge length (0.83).</p><p>It is not surprising that average edge length, average number of restriction sites per edge, and N50 size have a very strong correlation with sequence correctness as this implies the corresponding de Bruijn graph has long edges which are likely to contain multiple restriction sites. These long edges are easier to place unambiguously along the map and can be used to rule out incorrect de Bruijn graph paths very effectively.</p><p>It is important to note that the average edge length and number of edges are strongly anti-correlated (&#8722;0.90 Spearman&#8217;s coefficient) due to the fact that the genome lengths in our dataset are within a fairly narrow range of 1&#8211;5 Mbp (mega base pairs). Given our data, we cannot fully distinguish between the impact of long edges versus fewer edges (lower complexity) on our ability to reconstruct a genome. Genome length also has very low correlation with the sequence correctness of the assembly, but more testing needs to be done on larger and more complex genomes in order to better determine the factors that most influence the quality of genome assembly.</p><p>We also directly plotted sequence correctness versus the average edge length in each de Bruijn graph over all three error rates (Figure <figr fid="F3">3</figr>) and observed that sequence correctness generally increases with average edge length. Almost all genomes with average edge length greater than 10 kbp can be assembled with accuracy over 98%, even when relying on maps with the highest error rate, while we have mixed results for genomes with shorter average edge length. Also, note the impact of error level in the optical map on the assembly accuracy is less than 2% for genomes with average edge length greater than 10 kbp (which includes 79.9% of our genomes), but has a greater impact on graphs with shorter edges, where the final sequence correctness differs by as much as 40 percentage points between the high and low error settings. Additionally, we find that most genomes with a starting N50 size larger than approximately 50 kbp also yield map-guided assemblies with greater than 98% accuracy. (See Additional file <supplr sid="S2">2</supplr>.)</p><suppl id="S2"><title><p>Additional file 2</p></title><text><p><b>Figure showing the impact of starting N50 size of the de Bruijn graph on sequence correctness.</b> A plot showing the sequence correctness of 369 bacterial genome assemblies by AGORA versus the starting N50 size of their de Bruijn graphs, under three different optical map error rates. Genomes with starting N50 size greater than 50 kbp are generally assembled with higher than 98% correctness over all three error rates, while the results are mixed for genomes with lower starting N50 size.</p></text><file name="1471-2105-13-189-S2.png">
   <p>Click here for file</p>
</file></suppl><fig id="F3"><title><p>Figure 3</p></title><caption><p>Impact of average edge length of de Bruijn graph on sequence correctness of assembly</p></caption><text>
   <p><b>Impact of average edge length of de Bruijn graph on sequence correctness of assembly.</b> A plot showing the sequence correctness of 369 bacterial genome assemblies versus the average edge length of their starting de Bruijn graphs, under three different optical map error rates. Genomes with average edge length greater than 10 kbp are generally assembled with near perfect correctness over all three error rates, while the results are mixed for genomes with shorter average edge lengths. For genomes with average edge length below 10 kbp, correctness may improve by as much as 40% when moving from the high error to low error setting, highlighting the potential benefits of more accurate mapping technologies.</p>
</text><graphic file="1471-2105-13-189-3"/></fig></sec><sec><st><p>Assembly of <it>Y. pestis</it> KIM with previously published optical map</p></st><p>We also evaluated the performance of AGORA on the assembly of the genome of <it>Y. pestis</it> KIM (NCBI accession NC_004088) with a PvuII (recognition sequence CAG^CTG) optical map experimentally determined in <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. In addition to the experimental optical map, we also ran AGORA on optical maps simulated for PvuII with the same three levels of noise mentioned previously. Since <it>Y. pestis</it> is a complex genome containing many repeats (primarily IS elements), we provided AGORA with a de Bruijn graph produced with a larger k-mer size of 500 in the initial experiment. Subsequent experiments for k-mer size 100 yield lower quality assemblies, as described in the next section. AGORA took under 5 minutes to find a path matching the genome optical map for each error rate (without having to skip any regions between landmarks due to the 1&#8201;minute timeout). The resulting reconstruction of the genome matched the correct sequence with accuracy between 86.74% and 99.13%, depending on the error rate used (see Table <tblr tid="T2">2</tblr>).</p><table id="T2"><title><p>Table 2</p></title><caption><p><b>Statistics on the assembly of</b><b><it>Y. Pestis</it></b><b>KIM with optical maps of different error rates</b></p></caption><tgroup align="left" cols="6"><colspec align="left" colname="c1" colnum="1" colwidth="1*"/><colspec align="left" colname="c2" colnum="2" colwidth="1*"/><colspec align="left" colname="c3" colnum="3" colwidth="1*"/><colspec align="left" colname="c4" colnum="4" colwidth="1*"/><colspec align="left" colname="c5" colnum="5" colwidth="1*"/><colspec align="left" colname="c6" colnum="6" colwidth="1*"/><thead valign="top"><row rowsep="1"><entry colname="c1"/><entry colname="c2"><p><b>Sequence Correct</b></p></entry><entry colname="c3"><p><b>Edges Correct</b></p></entry><entry colname="c4"><p><b>Landmarks</b></p></entry><entry colname="c5"><p><b>Final Contigs</b></p></entry><entry colname="c6"><p><b>N50 Size</b></p></entry></row></thead><tfoot><p>A summary of the results of our algorithm on assembling <it>Y. Pestis</it> KIM, when given a de Bruijn graph with k-mer size 500, and a simulated optimal or experimental optical map from <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>.</p></tfoot><tbody valign="top"><row rowsep="1"><entry colname="c1"><p>Low Error</p></entry><entry colname="c2"><p>99.13%</p></entry><entry colname="c3"><p>192/199</p></entry><entry colname="c4"><p>64</p></entry><entry colname="c5"><p>12</p></entry><entry colname="c6"><p>1,190,834</p></entry></row><row rowsep="1"><entry colname="c1"><p>Med Error</p></entry><entry colname="c2"><p>97.57%</p></entry><entry colname="c3"><p>188/199</p></entry><entry colname="c4"><p>38</p></entry><entry colname="c5"><p>20</p></entry><entry colname="c6"><p>905,369</p></entry></row><row rowsep="1"><entry colname="c1"><p>High Error</p></entry><entry colname="c2"><p>90.52%</p></entry><entry colname="c3"><p>169/199</p></entry><entry colname="c4"><p>25</p></entry><entry colname="c5"><p>81</p></entry><entry colname="c6"><p>776,452</p></entry></row><row rowsep="1"><entry colname="c1"><p>Map from <abbrgrp><abbr bid="B13">13</abbr></abbrgrp></p></entry><entry colname="c2"><p>86.74%</p></entry><entry colname="c3"><p>149/199</p></entry><entry colname="c4"><p>26</p></entry><entry colname="c5"><p>80</p></entry><entry colname="c6"><p>405,321</p></entry></row></tbody></tgroup></table><p>Optical mapping information substantially improves the initial N50 size of 62,865&#8201;bp (computed from the de Bruijn graph of this genome with k-mer size 500) by a factor of between 6.4 and 18.9 depending on the quality of the optical map. The number of contigs is correspondingly reduced by a factor of between 2.45 and 16.6. While the maximum fragment sizing error and maximum size of small fragments lost in the high error optical map simulation match the values observed the experimentally produced optical map <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> (10% and 2 kbp sizing error and loss of fragments smaller than 2 kbp), AGORA generates a slightly worse assembly when guided by the experimental map. This indicates that the simple heuristic procedure we used to simulate noise may not adequately match the precise characteristics of the noise seen in experimentally determined optical maps (see Methods for more details). Nonetheless, our results still show the potential impact of noise in the optical on the quality of the final assembly.</p><p>To further assess the quality of the assemblies, we used Mummer <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> to compare the sequences produced by our algorithm to the known genome sequence. Figure <figr fid="F4">4</figr>a illustrates our previous analysis showing that the sequence generated by AGORA matches the original sequence with greater than 99% accuracy, when given an optical map with low noise. The line along the diagonal indicates sequence that correctly matches the true genome, while only 7 errors can be seen at the locations marked by small circles, which occur due to short misplaced edges. In Figure <figr fid="F4">4</figr>b, we see that there are more errors in the assembly built using the experimental optical map. The gaps in the line along the diagonal in Figure <figr fid="F4">4</figr>b indicate roughly 13% of the genome is not correctly assembled by our algorithm. The longest regions of incorrectly assembled sequence occur in portions of the genome where there are few restriction sites.</p><fig id="F4"><title><p>Figure 4</p></title><caption><p>Mummerplot comparison of assemblies produced with low error and experimental optical map</p></caption><text>
   <p><b>Mummerplot comparison of assemblies produced with low error and experimental optical map.</b> Two dot plots generated by Mummerplot comparing the known genome sequence of <it>Y. pestis</it> KIM to the sequence assembly produced by our algorithm, when given an optical map with low error added (<b>a</b>), and when using the experimental optical map from <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> (<b>b</b>).</p>
</text><graphic file="1471-2105-13-189-4"/></fig><p>In these regions, AGORA picks a single path among several possible paths which may match the optical map, possibly leading to errors in the reconstruction. For example, the largest erroneous gap shown in the lower left of Figure <figr fid="F4">4</figr>b occurs within a 110 kbp genomic region that contains only two restriction fragments of size 40 kbp and 70 kbp, respectively. Within the same region, genomic repeats lead to a fragmentation of the de Bruijn graph resulting in a collection of short edges without any restriction site information, and one edge which contains a single restriction site. The difference in performance on the low error optical map and the experimental optical map highlights the potential benefit of developing higher resolution and more accurate mapping technologies (such as nano-coding <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>). Alternatively, additional mate-pair information (providing short-range information) along with an optical may also help resolve ambiguities in regions with few restriction sites.</p></sec><sec><st><p>Effect of restriction enzyme choice on assembly quality</p></st><p>We further examined the effect of using different restriction enzymes on the quality of the assembly that can be produced by our algorithm for the <it>Y. pestis</it> KIM genome. We generated a de Bruijn graph with k-mer size 100 for the sequence of <it>Y. pestis</it> KIM, and then computed the <it>in silico</it> map of the genome for each of 102 different restriction enzymes. We then simulated noisy optical maps by adding three different levels of noise to each <it>in silico</it> map, as described previously. In Figure <figr fid="F5">5</figr>, we plot the number of restriction sites versus the sequence correctness achieved by our algorithm using the optical map for each enzyme at the three different optical map error rates.</p><fig id="F5"><title><p>Figure 5</p></title><caption><p>Impact of restriction enzyme choice on assembly quality</p></caption><text>
   <p><b>Impact of restriction enzyme choice on assembly quality.</b> The choice of restriction enzymes can impact the correctness of the assembly. Each point represents the sequence correctness of an assembly of <it>Y. pestis</it> KIM when given a de Bruijn graph of k-mer size 100 and an optical map of low, medium, or high error rate. The vertical line in the picture indicates the number of restriction sites for the enzyme PvuII used to construct the experimental optical map of this genome, and the colored circles represent the correctness that can be achieved under the three error rates for the PvuII enzyme. The red, blue, and green filled squares to the right of the vertical line, indicate an improvement of between 7.7% and 30.1% in the final sequence correctness that can be achieved when choosing a better restriction enzyme in the high, medium, and low error settings, respectively.</p>
</text><graphic file="1471-2105-13-189-5"/></fig><p>The vertical line in Figure <figr fid="F5">5</figr> corresponds to the number of restriction sites for the enzyme PvuII used to construct the experimental optical map of this genome <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. The circles drawn on the line represent the quality of the corresponding assemblies with a PvuII map under different mapping error rates. Although we are able to assemble the genome with 90.3% sequence correctness in the low error setting, the medium and high error settings only assemble with 68.3% and 58.6% sequence correctness, respectively (using the experimental map only yields 48.6% accuracy). Figure <figr fid="F5">5</figr> illustrates that restriction enzymes that cut more frequently can yield better assemblies. A HindIII (recognition sequence A^ACGTT) optical map with 1,566 restriction sites achieves 99.8% sequence correctness in the low error setting and 98.4% in the medium error setting, as indicated by the green blue squares in Figure <figr fid="F5">5</figr>, respectively. In the high error setting, we can achieve 66.3% sequence correctness (shown as the red square) with a BSrGI (recognition sequence T^GTACA) optical map with 573 restriction sites. Over the three cases, we can improve the accuracy by between 7.7% and 30.1% by choosing an appropriate restriction enzyme.</p><p>Figure <figr fid="F5">5</figr> also shows the dependence between the frequency with which an enzyme cuts and the quality of the resulting assembly. In the low error setting, assembly accuracy generally increases with the density of restriction sites on the optical restriction map, although this is not true for the medium and high error rates where the performance of the algorithm starts decreasing beyond a certain cut frequency. This phenomenon can be explained by the loss of more small fragments as cut frequency increases, and the increased difficulty of finding landmark edges when there are many smaller fragments of roughly the same size. In the high error setting, we note that restriction enzymes with around 500 recognition sites yield assemblies with the highest sequence correctness for <it>Y. pestis</it> KIM.</p><p>The strong dependence of the quality of assembly on the restriction enzyme used highlights the need for choosing an appropriate enzyme. Running preliminary lab experiments to digest the genome with different enzymes can be used to find an enzyme which cuts the genome at an appropriate frequency (in the case of <it>Y. pestis</it>, the ideal restriction enzyme yields an average fragment size of roughly 10 kbp). Alternatively, generating preliminary sequence data and building a corresponding de Bruijn graph, can also help estimate the cut frequency of various restriction enzymes.</p></sec><sec><st><p>Optical maps versus mate-pairs</p></st><p>The use of mate-pairs to guide the assembly process was previously studied by Wetzel et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> using the same genomes used in our study. A direct comparison to the full results presented previously is difficult to perform as our goal here is the reconstruction of a single contig spanning an entire chromosome, while the work of Wetzel et al. is focused on the resolution of individual repeats (and the corresponding reduction in the complexity of the assembly graph) using mate-pair information. Furthermore, mate-pairs and optical maps provide complementary types of information: mate-pairs provide local information and are most effective in the short range (as shown, e.g., in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>) where the optical mapping resolution may be limited, while optical maps provide global information and are particularly effective in the long range (10s-100s of kbp, ranges for which mate-pair libraries are difficult to generate). To demonstrate the complementary strengths of these technologies, we highlight a couple genomes analyzed both with mate-pairs in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> and with optical maps in our study.</p><p>First, <it>Rhodospirillum rubrum</it> ATCC 11170 (NCBI accession NC_007643) was completely and correctly resolved by AGORA in our study, but mate-pair based analyses were unable to fully resolve this genome even when trying different combinations of library sizes. We applied the mate-pair repeat resolution approach described in the work of Wetzel et al. <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> using both the tuned library mixture of sizes 477 and 6047 (see <abbrgrp><abbr bid="B4">4</abbr></abbrgrp> for details on how the library sizes were chosen), and the &#8216;standard&#8217; combinations of 2kbp&#8201;+&#8201;8kbp, or 2kbp&#8201;+&#8201;35kbp. Note that it is possible that some combination of two or more mate-pair libraries could have resolved this genome, as we have not exhaustively explored all possible combinations of mate-pair libraries. However, in practical terms, it is unlikely that a lab interested in solving the <it>Rhodospirillum</it> genome would attempt multiple library preparations in hopes of finding the perfect combination for this genome.</p><p>A second example is the genome of <it>Streptococcus agalactiae</it> NEM316 (NCBI accession NC_004368) which contains a 47 kbp-long plasmid-like repeat (pNEM316-1) occurring three times within the main chromosome <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>. Resolving this repeat would require mate-pairs longer than 47 kbp, which are beyond the sizes routinely generated, especially in the context of next generation sequencing technologies (fosmid libraries only extend to ~40 kbp).</p></sec><sec><st><p>Real assembly graphs</p></st><p>Our results have focused on running our proof-of-principle algorithm on ideal de Bruijn graphs obtained from error-free sequencing data. The application of AGORA to data from real sequencing experiments is the object of future work and beyond the scope of this paper. However, it is natural to ask whether our algorithms can feasibly be extended to real datasets. To address this question we focused on sequencing data available for the <it>Yersinia pestis</it> KIM genome, specifically a 454 dataset (SRA accession SRX012379). We assembled these reads using Newbler [Roche] and explored the structure of the resulting contig graph (available from the 454ContigGraph.txt file produced by Newbler).</p><p>We compared the Newbler graph to the ideal de Bruijn graphs of order 100 and 500, as the average length of the 454 reads falls between these values at 438&#8201;bp. The Newbler assembly resulted in 283 contigs with an N50 size of 38,282 while the order 500 graph had 199 contigs with an N50 size of 62,865&#8201;bp and the order 100 graph had 648 contigs with an N50 size of 38,786&#8201;bp. Thus, in broad terms, the real assembly graph has similar characteristics to the perfect de Bruijn graphs in our experiments.</p><p>More relevant to our study is the question of whether landmark edges can be easily found in the Newbler graphs. The AGORA algorithm critically depends on our ability to find edges that have a unique placement along the optical map. According to this criterion, the Newbler contig graph is also roughly similar to the simulated graphs. Specifically we find 15 landmarks in the Newbler assembly, compared to 15 and 26 landmarks in the order 100 and 500 de Bruijn graphs, respectively. We also aligned the Newbler contigs using the more complex dynamic programming algorithm described in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> and identified 22 landmarks, indicating that the use of already existing optical map alignment algorithms will be effective in extending the AGORA algorithm to real sequencing data.</p></sec></sec><sec><st><p>Conclusions</p></st><p>We have presented a computational framework that allows optical mapping data to be used during the genome assembly process. Our work demonstrates the potential of this approach in improving the assembly of bacterial genomes. With optical maps, over &#190; of our bacterial genomes were assembled with over 98% accuracy, and even the complex genome of <it>Y. pestis</it> KIM could be assembled with sequence correctness between 86.74% and 99.13%, depending on the quality of the reference optical map. Moreover, for the bacterial genomes in our test data set, in the median case we could improve on the N50 size by a factor of between 6.4 and 18.9 and reduce the number of contigs by a factor of between 6.67 and 10.74 over what could be achieved with sequence data alone.</p><p>Our initial study also allowed us to explore the effect of experimental parameters on the usefulness of mapping data. We demonstrated substantially improved quality of assembly when using high quality optical maps, highlighting the value of continued improvements in this technology (such as the nano-coding approach <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>). In addition, we showed that the choice of restriction enzyme significantly affects assembly quality, indicating the benefits of preliminary analysis to determine a suitable restriction enzyme before constructing an optical map.</p><p>The results we have shown are only a first step towards developing a map-guided genome assembler. AGORA has only been tested on error-free assembly data and will need to be adapted to handle the characteristics of assembly graphs derived from real sequencing data. The heuristics used to speed up the alignment process may not be effective in the context of a combination of realistic sequencing and mapping error profiles. A practical implementation of our approach may need to rely on a variant of the dynamic programming alignment algorithm described in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> with additional heuristics or the use of parallel/high-performance architectures. Additionally, it may be useful to develop methods to detect regions of the assembly where multiple paths may match the optical map, and exclude those regions from the final assembly to avoid introducing errors.</p><p>Finally, a promising area of future research involves the combination of mapping and mate-pair data. These types of information offer complementary strengths &#8211; long-range structural information from optical maps, and short-range links from the mate-pair data &#8211; which can be leveraged to overcome our difficulty in resolving genomic regions that are sparsely sampled by the restriction map.</p></sec><sec><st><p>Methods</p></st><sec><st><p>Optical mapping error simulations</p></st><p>To describe our experiments precisely, we need to formally describe the various types of noise that add error to the optical map, and how we simulate noisy optical maps for use in our experiments. In general, optical maps may have three types of errors: fragment sizing error, small fragments missing, and restriction site errors. Fragment sizing error occurs because measuring the sizes of the Rmap fragments is performed using optical techniques that associate restriction fragment mass with fluorescence intensity. Small fragments can be missing from the optical map due to desorption. Restriction site errors refer to missing or added restriction sites on the genome optical map, which can be caused by errors in the physical process or in the image processing.</p><p>To simulate optical maps for our experiments, we start by computing an <it>in silico</it> map for each genome, and then add noise to the <it>in silico</it> map to simulate fragment sizing error and small fragments lost. In our experiments, we did not extensively test restriction site errors as they are fairly rare in a finished optical occurring at around 2% of restriction sites <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. However, we do simulate the loss of small fragments according to a small fragment threshold &#956;&#8201;&#8805;&#8201;0, as well as fragment sizing error according to two parameters &#945;&#8201;&#8805;&#8201;1 and &#946;&#8201;&#8805;&#8201;0. Using fluorescence intensity to estimate restriction fragment length leads to an error proportional to the length of the fragment, which we characterize with a multiplicative error parameter (&#945;). Smaller fragments have different factors contributing to their error profile, however, which we characterize with an additive error parameter (&#946;). (For more details on optical mapping error models, see <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B44">44</abbr></abbrgrp>.)</p><p>Given an <it>in silico</it> map and the parameters described above, we start by deterministically removing all fragments of size less than &#956;, which is the worst case for small fragment loss, as some small fragments are retained in practice.</p><p>Next to simulate fragment sizing error for parameters &#945;, &#946;, and &#956;, we add a random amount of noise to the remaining <it>in silico</it> fragments, so that an <it>in silico</it> fragment of size S may produce an optical map fragment of size between a lower bound L&#8201;=&#8201;max(S/&#945; &#8211; &#946;, &#956;) and upper bound U&#8201;=&#8201;&#945;S&#8201;+&#8201;&#946;. For each fragment of size S, we substitute a fragment of length S&#8201;+&#8201;&#1013;, where &#1013; is Gaussian noise with mean 0 and standard deviation (U-L)/4. If S&#8201;+&#8201;&#1013;&#8201;&lt;&#8201;L or S&#8201;+&#8201;&#1013;&#8201;&gt;&#8201;U, then we substitute L or U, respectively.</p><p>The three parameters &#945;, &#946;, and &#956; are then used to model the three levels of noise used for our experiments. In the <b>low error</b> setting, we set &#945;&#8201;=&#8201;1.01, &#946;&#8201;=&#8201;100&#8201;bp, and &#956;&#8201;=&#8201;0 (which did not allow for any small fragments to be lost). In the <b>medium error</b> setting, we set &#945;&#8201;=&#8201;1.05, &#946;&#8201;=&#8201;1000&#8201;bp, and &#956;&#8201;=&#8201;1000&#8201;bp. In the <b>high error</b> setting, we set &#945;&#8201;=&#8201;1.10, &#946;&#8201;=&#8201;2000&#8201;bp, and &#956;&#8201;=&#8201;2000&#8201;bp. The high error setting has bounds on the maximum sizing error and maximum size of small fragments lost, corresponding to the values observed in the published optical map of the <it>Y. pestis</it> KIM genome <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>: up to 10% (&#945;&#8201;=&#8201;1.10) multiplicative and 2000&#8201;bp additive fragment sizing error; in addition, small fragments up to size 2000&#8201;bp were lost (and no restriction site errors were observed). The low error rate setting, which did not allow for small fragments to be lost (&#956;&#8201;=&#8201;0), may eventually be achievable using the nano-coding system currently being developed <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>. Note that our error simulation does not fully capture all the factors that affect the quality of experimental optical maps, which causes differences between the performance of our algorithm when applied to simulated and experimental optical maps. We opted for a simplified model in order to enable the detailed simulations described in our paper; however we plan to further investigate more realistic error models in future work.</p></sec><sec><st><p>AGORA algorithm</p></st><p>High level pseudocode illustrating the basic idea of the AGORA algorithm is provided below. A more detailed explanation describing additional improvements to the basic algorithm is given in the following sections. The source code for AGORA is provided as Additional file <supplr sid="S3">3</supplr>, along with code needed to run our experiments. AGORA takes as input two data structures: OpMap &#8211; an ordered list of fragment sizes representing the optical map; and Edges &#8211; a list of de Bruijn graph edges with their corresponding sequences.</p><suppl id="S3"><title><p>Additional file 3</p></title><text><p><b>AGORA source code.</b> A zip file containing the code used to generate the results presented in our paper.</p></text><file name="1471-2105-13-189-S3.zip">
   <p>Click here for file</p>
</file></suppl></sec><sec><st><p>AGORA(OpMap, Edges)</p></st><p>Set <it>LandmarkEdges</it> = FindLandmarkEdges(<it>Opmap</it>, <it>Edges</it>)</p><p>Sort <it>LandmarkEdges</it> in order of their position on the optical map</p><p>For circular genomes, add a copy of the first landmark edge to the end of <it>LandmarkEdges</it></p><p>Set <it>CurrentEdge</it> to be NULL_POINTER</p><p>Set <it>CurrentPath</it> to be the empty path</p><p>Push the first edge of <it>LandmarkEdges</it> onto the top of <it>EdgeStack</it>, a stack of edges to be explored in the DFS</p><p><b>For each pair of consecutive edges (E</b><sub><b>1</b></sub><b>, E</b><sub><b>2</b></sub><b>) in</b><b><it>LandmarkEdges</it></b></p><p indent="1">// Perform a depth first search from <it>E</it><sub><it>1</it></sub> until <it>E</it><sub><it>2</it></sub> is</p><p indent="1">// reached with a path matching the optical map</p><p indent="1"><b>While (</b><b><it>CurrentEdge</it></b><b>!=</b><b><it>E</it></b><sub><b><it>2</it></b></sub><b>)</b></p><p indent="2"><it>CurrentEdge</it> = Pop top element of <it>EdgeStack</it></p><p indent="2"><b>If</b> (<it>CurrentEdge</it> == NULL_POINTER) <b>then</b></p><p indent="3">Backtrack by removing last edge from <it>CurrentPath</it></p><p indent="2"><b>Else</b></p><p indent="3"><b>If</b> the <it>in silico</it> map of <it>CurrentPath</it> + <it>CurrentEdge</it> matches the optical map <b>then</b></p><p indent="3"><it>CurrentPath</it> = <it>CurrentPath</it> + <it>CurrentEdge</it></p><p indent="3">Push NULL_POINTER onto <it>EdgeStack</it> for backtracking</p><p indent="3">Push each edge outgoing from the end of <it>CurrentEdge</it> onto <it>EdgeStack</it></p><p indent="3"><b>EndIf</b></p><p indent="2"><b>EndIf</b></p><p indent="1"><b>EndWhile</b></p><p><b>EndFor</b></p><p><b>EndProgram</b></p><sec><st><p>Finding landmark edges</p></st><p>The first step of AGORA computes landmark edges, which are edges in the graph that have a unique placement within the reference optical map. These landmark edges are found by computing an <it>in silico</it> map from the sequence of each edge, and checking if the <it>in silico</it> map can be placed at exactly one location by attempting to align the <it>in silico</it> map starting from each fragment in the genome optical map. We implemented a simple greedy algorithm to align an <it>in silico</it> map to an optical map alignment, although a more precise dynamic programming algorithm was described previously in <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>. We used a heuristic approach instead of the more accurate alignment algorithm in order to speed up landmark computation. The dynamic programming algorithm has run-time proportional to the fourth-power of the number of fragments being aligned.</p><p>Our greedy alignment algorithm simply compares <it>in silico</it> fragments to optical map fragments in order, allowing for a size mismatch within the bounds specified by the (&#945;, &#946;) parameters, and allowing fragments of size smaller than &#956; to be missing. (In the experiments, the &#945;, &#946;, and &#956; parameters are set according to the values used to simulate the optical maps, or in the case of the <it>Y. Pestis</it> KIM experimental optical map, we set &#945;&#8201;=&#8201;1.10, &#946;&#8201;=&#8201;2,000&#8201;bp, and &#956;&#8201;=&#8201;2,000&#8201;bp).</p><p>The greedy algorithm does not allow restriction site errors in the optical map alignment when determining landmark edges. Although restriction site errors may occur in practice, we do not allow restriction site errors when determining landmark edges to limit ambiguous placements and help ensure that all our landmark edges have correct placements on the genome optical map. This greedy alignment algorithm has a linear run-time, which was made even more efficient by saving the alignments of previous edges in the path and progressively aligning new edges added to the path during the depth-first search process.</p><p>In case no landmark edges can be found, we search for a <it>landmark pair</it> &#8211; a pair of consecutive edges whose combined sequence and corresponding <it>in silico</it> map has exactly one valid alignment to a single location in the optical map. An alignment is valid if the sizes of consecutive restriction fragments within the optical map and the <it>in silico</it> map are approximately matched, modulo sizing errors and potential loss of small fragments. If a landmark pair is found, we use it to start our depth first search instead (not shown in the pseudocode).</p></sec><sec><st><p>Landmark to landmark path search</p></st><p>After determining landmark edges, we search for paths connecting consecutive landmark edges, starting with the landmark with the earliest placement within the reference optical map. (Although our bacterial genomes are circular, the experimental and simulated optical maps are provided to the algorithm as an ordered list of fragment sizes, which is used to define the earliest placement in the optical map.) After the search reaches the last landmark edge, we search for a path connecting the last landmark edge to first landmark edge to finish the path, since we were assembling circular bacterial genomes. In case only one landmark edge can be found, we search for a path from the one landmark edge back to itself. If no landmark edges can be found, but a landmark pair can be found, then we attempt to find a path from the landmark pair back to itself (otherwise we return with no path found).</p><p>To find a path between pairs of consecutive landmark edges (or from a landmark edge or landmark pair back to itself), we rely on a depth-first search algorithm, pruning the search space according to the following criteria. An edge can only be used to extend the current depth-first search path if the <it>in silico</it> map of the path matches the optical map, and three additional conditions hold (not shown in the psuedocode):</p><p indent="1">1. the edge is not currently used in the current path (if multiple edges have been collapsed into a single edge, we ensure an edge is not used more times than its multiplicity);</p><p indent="1">2. the <it>in silico</it> map alignment of CurrentPath to the optical map does not extend past the first restriction site of the alignment of the next consecutive landmark edge;</p><p indent="1">3. the edge currently being added to the depth first search has not been explored more than 500 times previously, while being aligned at the same optical map location (this step avoids repeatedly exploring very many similar paths within a highly complex region of the genome).</p><p>In AGORA&#8217;s depth first search implementation, we explore edges in decreasing order of length (exploring edges with the longest sequence length first). Their longer length often makes those edges the easiest to accurately place along the optical map.</p></sec><sec><st><p>Modifications to improve efficiency</p></st><p>In preliminary tests, we found the depth first search can incorrectly traverse an edge in the path between two landmark edges early in the search process, which prevents the correct path from being found between subsequent landmark edges without substantial backtracking. When no path can be found between two consecutive landmark edges without backtracking through previously explored landmark edges, we simply &#8216;restart&#8217; the search from the current landmark edge with the algorithm assuming that no edges have been traversed so far. The search should succeed this second time, since edges which may have been incorrectly used in the prior path are now available to be explored again.</p><p>Additionally, if the algorithm fails to find a path between landmarks within a preset amount of time (we used one minute in our simulations), we simply skip to the next landmark without attempting to reconstruct the region between the landmarks. In our experiments, we did not have to use this procedure often, but the additional check was useful for a small set of complex genomes to ensure completion within a reasonable amount of time.</p><p>It is important to note that the various heuristics described above, while dramatically improving the performance of our algorithm, lead to potential errors in the reconstruction, especially when using lower quality mapping data. We plan to explore the tradeoff between accuracy and performance in future work.</p></sec></sec><sec><st><p>Edge and sequence correctness metrics for measuring assembly quality</p></st><p>Before describing the edge correctness and sequence correctness metrics more precisely, it is important to note a significant difference between AGORA and typical genome assembly algorithms. Our algorithm seeks to construct a single contig representing the full genome of the organism being assembled, while accepting some errors, in contrast to most assemblers, which break the assembly into separate contigs to avoid assembly errors. As a result, traditional metrics of assembly quality do not directly apply in our case, and thus we propose the alternative metrics described below. In brief, we attempt to compare the traversal of the de Bruijn graph chosen by AGORA to the true traversal representing the correct genome sequence. We measure the concordance between these two paths in terms of both number of concordant edges and similarity between the reconstructed sequences. We term the two measures edge correctness and sequence correctness, respectively.</p><p>To compute the <it>edge correctness</it> measure, we start by matching the path found by AGORA to the correct path through the graph using a longest common subsequence algorithm. The edges not aligned by this algorithm correspond to errors in our reconstruction. The edge correctness metric overestimates the amount of error in the reconstruction. In many cases the errors correspond to short edges and thus do not significantly affect the overall correctness of the reconstructed sequence.</p><p>To account for this issue, we also computed a metric which we called <it>sequence correctness</it>, which weights the edge correctness metric by the actual length of the edges. More precisely, we implement a weighted longest common subsequence algorithm to identify the &#8216;heaviest&#8217; set of edges that match the correct path in the correct order. We then sum the length of these edges and divide by the total genome sequence length to obtain our sequence correctness metric.</p><p>One last caveat we should mention is that if we ever find two different edges between the same nodes in the de Bruijn graph with greater than 99% sequence similarity, then we treat them as if they were the same edge, as long as the sequence differences do not cause any change to their restriction sites. This procedure of collapsing similar edges is known as &#8220;bubble collapsing&#8221; and is useful for handling nearly equivalent edges within the de Bruijn graph. Such edges are impossible to disambiguate through optical mapping, and we ignore any errors we might make by swapping the order in which they are traversed. Note that even if we were to measure the additional differences in the sequence produced by AGORA that occur due to bubble collapsing which are ignored in the sequence correctness score, the overall decrease in the percent of sequence matching the true genome is at most 1%, since we only collapse bubbles that are at least 99% identical.</p></sec></sec><sec><st><p>Competing interests</p></st><p>The authors declare that they have no competing interests.</p></sec><sec><st><p>Authors&#8217; contributions</p></st><p>MP and HL designed the main idea for the algorithm with advice and suggestions from DCS and SG. HL implemented and ran the experiments with help from LM. SZ provided the optical map data. JW provided the de Bruijn graphs and helped compare the optical map results with the previous mate-pair experiments. MP and HL wrote the manuscript with edits and suggestions from DCS, SG, LM, and JW. All authors have read and approved the manuscript.</p></sec></bdy><bm><ack><sec><st><p>Acknowledgements</p></st><p>We would like to acknowledge Arthur Delcher for useful discussions. This work has been funded by grants from the NSF (DGE 1148900 to JW; IIS 081211 and IIS 1117247 to MP) and NIH (R01 HG000225 to DCS).</p></sec></ack><refgrp><bibl id="B1"><title><p>Nucleotide sequence of bacteriophage &#934;X174 DNA</p></title><aug><au><snm>Sanger</snm><fnm>F</fnm></au></aug><source>J Mol Biol</source><pubdate>1982</pubdate><volume>162</volume><fpage>729</fpage><lpage>773</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/0022-2836(82)90546-0</pubid><pubid idtype="pmpid" link="fulltext">6221115</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing</p></title><aug><au><snm>Staden</snm><fnm>R</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>1982</pubdate><volume>10</volume><fpage>4731</fpage><lpage>4751</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/10.15.4731</pubid><pubid idtype="pmcid">321125</pubid><pubid idtype="pmpid" link="fulltext">7133997</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Whole-genome random sequencing and assembly of Haemophilus influenzae Rd</p></title><aug><au><snm>Fleischmann</snm><fnm>R</fnm></au><au><snm>Adams</snm><fnm>M</fnm></au><au><snm>White</snm><fnm>O</fnm></au><au><snm>Clayton</snm><fnm>R</fnm></au><au><snm>Kirkness</snm><fnm>E</fnm></au><au><snm>Kerlavage</snm><fnm>A</fnm></au><au><snm>Bult</snm><fnm>C</fnm></au><au><snm>Tomb</snm><fnm>J</fnm></au><au><snm>Dougherty</snm><fnm>B</fnm></au><au><snm>Merrick</snm><fnm>J</fnm></au><etal/></aug><source>Science</source><pubdate>1995</pubdate><volume>269</volume><fpage>496</fpage><lpage>512</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.7542800</pubid><pubid idtype="pmpid" link="fulltext">7542800</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies</p></title><aug><au><snm>Wetzel</snm><fnm>J</fnm></au><au><snm>Kingsford</snm><fnm>C</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au></aug><source>BMC Bioinforma</source><pubdate>2011</pubdate><volume>12</volume><fpage>95</fpage><xrefbib><pubid idtype="doi">10.1186/1471-2105-12-95</pubid></xrefbib></bibl><bibl id="B5"><title><p>Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping</p></title><aug><au><snm>Schwartz</snm><fnm>D</fnm></au><au><snm>Li</snm><fnm>X</fnm></au><au><snm>Hernandez</snm><fnm>L</fnm></au><au><snm>Ramnarain</snm><fnm>S</fnm></au><au><snm>Huff</snm><fnm>E</fnm></au><au><snm>Wang</snm><fnm>Y</fnm></au></aug><source>Science</source><pubdate>1993</pubdate><volume>262</volume><fpage>110</fpage><lpage>114</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.8211116</pubid><pubid idtype="pmpid" link="fulltext">8211116</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>A microfluidic system for large DNA molecule arrays</p></title><aug><au><snm>Dimalanta</snm><fnm>ET</fnm></au><au><snm>Lim</snm><fnm>A</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Lamers</snm><fnm>C</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Forrest</snm><fnm>DK</fnm></au><au><snm>de Pablo</snm><fnm>JJ</fnm></au><au><snm>Graham</snm><fnm>MD</fnm></au><au><snm>Coppersmith</snm><fnm>SN</fnm></au><au><snm>Goldstein</snm><fnm>S</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Anal Chem</source><pubdate>2004</pubdate><volume>76</volume><fpage>5293</fpage><lpage>5301</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1021/ac0496401</pubid><pubid idtype="pmpid" link="fulltext">15362885</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Single-molecule approach to bacterial genomic comparisons via optical mapping</p></title><aug><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Kile</snm><fnm>A</fnm></au><au><snm>Bechner</snm><fnm>M</fnm></au><au><snm>Place</snm><fnm>M</fnm></au><au><snm>Kvikstad</snm><fnm>E</fnm></au><au><snm>Deng</snm><fnm>W</fnm></au><au><snm>Wei</snm><fnm>J</fnm></au><au><snm>Severin</snm><fnm>J</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Forrest</snm><fnm>D</fnm></au><au><snm>Dimalanta</snm><fnm>ET</fnm></au><au><snm>Lamers</snm><fnm>C</fnm></au><au><snm>Burland</snm><fnm>V</fnm></au><au><snm>Blattner</snm><fnm>FR</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>J Bacteriol</source><pubdate>2004</pubdate><volume>186</volume><fpage>7773</fpage><lpage>7782</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1128/JB.186.22.7773-7782.2004</pubid><pubid idtype="pmcid">524920</pubid><pubid idtype="pmpid" link="fulltext">15516592</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Refinement of optical map assemblies</p></title><aug><au><snm>Valouev</snm><fnm>A</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au></aug><source>Bioinformatics (Oxford, England)</source><pubdate>2006</pubdate><volume>22</volume><fpage>1217</fpage><lpage>1224</lpage><xrefbib><pubid idtype="doi">10.1093/bioinformatics/btl063</pubid></xrefbib></bibl><bibl id="B9"><title><p>An algorithm for assembly of ordered restriction maps from single DNA molecules</p></title><aug><au><snm>Valouev</snm><fnm>A</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2006</pubdate><volume>103</volume><fpage>15770</fpage><lpage>15775</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0604040103</pubid><pubid idtype="pmcid">1635078</pubid><pubid idtype="pmpid" link="fulltext">17043225</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>Alignment of optical maps</p></title><aug><au><snm>Valouev</snm><fnm>A</fnm></au><au><snm>Li</snm><fnm>L</fnm></au><au><snm>Liu</snm><fnm>Y-CC</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au><au><snm>Yang</snm><fnm>Y</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au></aug><source>J Comput Biol</source><pubdate>2006</pubdate><volume>13</volume><fpage>442</fpage><lpage>462</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/cmb.2006.13.442</pubid><pubid idtype="pmpid" link="fulltext">16597251</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>High-resolution human genome structure by single-molecule analysis</p></title><aug><au><snm>Teague</snm><fnm>B</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au><au><snm>Goldstein</snm><fnm>S</fnm></au><au><snm>Potamousis</snm><fnm>K</fnm></au><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Reslewic</snm><fnm>S</fnm></au><au><snm>Sarkar</snm><fnm>D</fnm></au><au><snm>Valouev</snm><fnm>A</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Kidd</snm><fnm>JM</fnm></au><au><snm>Kohn</snm><fnm>S</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Lamers</snm><fnm>C</fnm></au><au><snm>Forrest</snm><fnm>D</fnm></au><au><snm>Newton</snm><fnm>MA</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au><au><snm>Kent-First</snm><fnm>M</fnm></au><au><snm>Surti</snm><fnm>U</fnm></au><au><snm>Livny</snm><fnm>M</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2010</pubdate><volume>107</volume><fpage>10848</fpage><lpage>10853</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0914638107</pubid><pubid idtype="pmcid">2890719</pubid><pubid idtype="pmpid" link="fulltext">20534489</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Scaffolding and validation of bacterial genome assemblies using optical restriction maps</p></title><aug><au><snm>Nagarajan</snm><fnm>N</fnm></au><au><snm>Read</snm><fnm>TD</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><fpage>1229</fpage><lpage>1235</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn102</pubid><pubid idtype="pmcid">2373919</pubid><pubid idtype="pmpid" link="fulltext">18356192</pubid></pubidlist></xrefbib></bibl><bibl id="B13"><title><p>A whole-genome shotgun optical map of Yersinia pestis strain KIM</p></title><aug><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Deng</snm><fnm>W</fnm></au><au><snm>Anantharaman</snm><fnm>TS</fnm></au><au><snm>Lim</snm><fnm>A</fnm></au><au><snm>Dimalanta</snm><fnm>ET</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au><au><snm>Wu</snm><fnm>T</fnm></au><au><snm>Chunhong</snm><fnm>T</fnm></au><au><snm>Creighton</snm><fnm>R</fnm></au><au><snm>Kile</snm><fnm>A</fnm></au><au><snm>Kvikstad</snm><fnm>E</fnm></au><au><snm>Bechner</snm><fnm>M</fnm></au><au><snm>Yen</snm><fnm>G</fnm></au><au><snm>Garic-Stankovic</snm><fnm>A</fnm></au><au><snm>Severin</snm><fnm>J</fnm></au><au><snm>Forrest</snm><fnm>D</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Lamers</snm><fnm>C</fnm></au><au><snm>Perna</snm><fnm>NT</fnm></au><au><snm>Burland</snm><fnm>V</fnm></au><au><snm>Blattner</snm><fnm>FR</fnm></au><au><snm>Mishra</snm><fnm>B</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Appl Environ Microbiol</source><pubdate>2002</pubdate><volume>68</volume><fpage>6321</fpage><lpage>6331</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1128/AEM.68.12.6321-6331.2002</pubid><pubid idtype="pmcid">134435</pubid><pubid idtype="pmpid" link="fulltext">12450857</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><title><p>Optical mapping: a novel, single-molecule approach to genomic analysis</p></title><aug><au><snm>Samad</snm><fnm>A</fnm></au><au><snm>Huff</snm><fnm>EF</fnm></au><au><snm>Cai</snm><fnm>W</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Genome Res</source><pubdate>1995</pubdate><volume>5</volume><fpage>1</fpage><lpage>4</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.5.1.1</pubid><pubid idtype="pmpid" link="fulltext">8717049</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Optical mapping and its potential for large-scale sequencing projects</p></title><aug><au><snm>Aston</snm><fnm>C</fnm></au><au><snm>Mishra</snm><fnm>B</fnm></au><au><snm>Schwartz</snm><fnm>D</fnm></au></aug><source>Trends Biotechnol</source><pubdate>1999</pubdate><volume>17</volume><fpage>297</fpage><lpage>302</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/S0167-7799(99)01326-8</pubid><pubid idtype="pmpid" link="fulltext">10370237</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Whole-genome shotgun optical mapping of deinococcus radiodurans</p></title><aug><au><snm>Lin</snm><fnm>J</fnm></au></aug><source>Science</source><pubdate>1999</pubdate><volume>285</volume><fpage>1558</fpage><lpage>1562</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.285.5433.1558</pubid><pubid idtype="pmpid" link="fulltext">10477518</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Genomics via optical mapping. II: ordered restriction maps</p></title><aug><au><snm>Anantharaman</snm><fnm>TS</fnm></au><au><snm>Mishra</snm><fnm>B</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>J Comput Biol</source><pubdate>1997</pubdate><volume>4</volume><fpage>91</fpage><lpage>118</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/cmb.1997.4.91</pubid><pubid idtype="pmpid">9228610</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><aug><au><snm>Anantharaman</snm><fnm>T</fnm></au><au><snm>Mishra</snm><fnm>B</fnm></au><au><snm>Schwartz</snm><fnm>D</fnm></au></aug><source>Genomics via optical mapping. III: Contiging genomic DNA. International Conference on Intelligent Systems for Molecular Biology (ISMB)</source><pubdate>1999</pubdate><fpage>18</fpage><lpage>27</lpage></bibl><bibl id="B19"><aug><au><snm>Antoniotti</snm><fnm>M</fnm></au><au><snm>Anantharaman</snm><fnm>T</fnm></au><au><snm>Paxia</snm><fnm>S</fnm></au><au><snm>Mishra</snm><fnm>B</fnm></au></aug><source>Genomics via Optical Mapping IV: Sequence Validation via Optical Map Matching</source><pubdate>2001</pubdate></bibl><bibl id="B20"><title><p>Shotgun optical mapping of the entire Leishmania major Friedlin genome</p></title><aug><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Kile</snm><fnm>A</fnm></au><au><snm>Kvikstad</snm><fnm>E</fnm></au><au><snm>Bechner</snm><fnm>M</fnm></au><au><snm>Severin</snm><fnm>J</fnm></au><au><snm>Forrest</snm><fnm>D</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Anantharaman</snm><fnm>TS</fnm></au><au><snm>Myler</snm><fnm>P</fnm></au><au><snm>Vogt</snm><fnm>C</fnm></au><au><snm>Ivens</snm><fnm>A</fnm></au><au><snm>Stuart</snm><fnm>K</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Mol Biochem Parasitol</source><pubdate>2004</pubdate><volume>138</volume><fpage>97</fpage><lpage>106</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1016/j.molbiopara.2004.08.002</pubid><pubid idtype="pmpid" link="fulltext">15500921</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Whole-genome shotgun optical mapping of Rhodospirillum rubrum</p></title><aug><au><snm>Reslewic</snm><fnm>S</fnm></au><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Place</snm><fnm>M</fnm></au><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Briska</snm><fnm>A</fnm></au><au><snm>Goldstein</snm><fnm>S</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Forrest</snm><fnm>D</fnm></au><au><snm>Lim</snm><fnm>A</fnm></au><au><snm>Lapidus</snm><fnm>A</fnm></au><au><snm>Han</snm><fnm>CS</fnm></au><au><snm>Roberts</snm><fnm>GP</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Appl Environ Microbiol</source><pubdate>2005</pubdate><volume>71</volume><fpage>5511</fpage><lpage>5522</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1128/AEM.71.9.5511-5522.2005</pubid><pubid idtype="pmcid">1214604</pubid><pubid idtype="pmpid" link="fulltext">16151144</pubid></pubidlist></xrefbib></bibl><bibl id="B22"><title><p>Validation of rice genome sequence by optical mapping</p></title><aug><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Bechner</snm><fnm>MC</fnm></au><au><snm>Place</snm><fnm>M</fnm></au><au><snm>Churas</snm><fnm>CP</fnm></au><au><snm>Pape</snm><fnm>L</fnm></au><au><snm>Leong</snm><fnm>SA</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Forrest</snm><fnm>DK</fnm></au><au><snm>Goldstein</snm><fnm>S</fnm></au><au><snm>Livny</snm><fnm>M</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>BMC Genomics</source><pubdate>2007</pubdate><volume>8</volume><fpage>278</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2164-8-278</pubid><pubid idtype="pmcid">2048515</pubid><pubid idtype="pmpid" link="fulltext">17697381</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>A single molecule scaffold for the maize genome</p></title><aug><au><snm>Zhou</snm><fnm>S</fnm></au><au><snm>Wei</snm><fnm>F</fnm></au><au><snm>Nguyen</snm><fnm>J</fnm></au><au><snm>Bechner</snm><fnm>M</fnm></au><au><snm>Potamousis</snm><fnm>K</fnm></au><au><snm>Goldstein</snm><fnm>S</fnm></au><au><snm>Pape</snm><fnm>L</fnm></au><au><snm>Mehan</snm><fnm>MR</fnm></au><au><snm>Churas</snm><fnm>C</fnm></au><au><snm>Pasternak</snm><fnm>S</fnm></au><au><snm>Forrest</snm><fnm>DK</fnm></au><au><snm>Wise</snm><fnm>R</fnm></au><au><snm>Ware</snm><fnm>D</fnm></au><au><snm>Wing</snm><fnm>RA</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au><au><snm>Livny</snm><fnm>M</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>PLoS Genet</source><pubdate>2009</pubdate><volume>5</volume><fpage>e1000711</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1371/journal.pgen.1000711</pubid><pubid idtype="pmcid">2774507</pubid><pubid idtype="pmpid" link="fulltext">19936062</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>Dynamic programming algorithms for restriction map comparison</p></title><aug><au><snm>Huang</snm><fnm>X</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au></aug><source>Bioinformatics</source><pubdate>1992</pubdate><volume>8</volume><fpage>511</fpage><lpage>520</lpage><xrefbib><pubid idtype="doi">10.1093/bioinformatics/8.5.511</pubid></xrefbib></bibl><bibl id="B25"><title><p>An Eulerian path approach to DNA fragment assembly</p></title><aug><au><snm>Pevzner</snm><fnm>PA</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Waterman</snm><fnm>MS</fnm></au></aug><source>Proc Natl Acad Sci</source><pubdate>2001</pubdate><volume>98</volume><fpage>9748</fpage><lpage>9753</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.171285098</pubid><pubid idtype="pmcid">55524</pubid><pubid idtype="pmpid" link="fulltext">11504945</pubid></pubidlist></xrefbib></bibl><bibl id="B26"><title><p>Matching, Euler tours and the Chinese postman</p></title><aug><au><snm>Edmonds</snm><fnm>J</fnm></au><au><snm>Johnson</snm><fnm>EL</fnm></au></aug><source>Math Program</source><pubdate>1973</pubdate><volume>5</volume><fpage>88</fpage><lpage>124</lpage><xrefbib><pubid idtype="doi">10.1007/BF01580113</pubid></xrefbib></bibl><bibl id="B27"><title><p>Velvet: algorithms for de novo short read assembly using de Bruijn graphs</p></title><aug><au><snm>Zerbino</snm><fnm>DR</fnm></au><au><snm>Birney</snm><fnm>E</fnm></au></aug><source>Genome Res</source><pubdate>2008</pubdate><volume>18</volume><fpage>821</fpage><lpage>829</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.074492.107</pubid><pubid idtype="pmcid">2336801</pubid><pubid idtype="pmpid" link="fulltext">18349386</pubid></pubidlist></xrefbib></bibl><bibl id="B28"><title><p>ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads</p></title><aug><au><snm>Maccallum</snm><fnm>I</fnm></au><au><snm>Przybylski</snm><fnm>D</fnm></au><au><snm>Gnerre</snm><fnm>S</fnm></au><au><snm>Burton</snm><fnm>J</fnm></au><au><snm>Shlyakhter</snm><fnm>I</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Malek</snm><fnm>J</fnm></au><au><snm>McKernan</snm><fnm>K</fnm></au><au><snm>Ranade</snm><fnm>S</fnm></au><au><snm>Shea</snm><fnm>TP</fnm></au><au><snm>Williams</snm><fnm>L</fnm></au><au><snm>Young</snm><fnm>S</fnm></au><au><snm>Nusbaum</snm><fnm>C</fnm></au><au><snm>Jaffe</snm><fnm>DB</fnm></au></aug><source>Genome Biol</source><pubdate>2009</pubdate><volume>10</volume><fpage>R103</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2009-10-10-r103</pubid><pubid idtype="pmcid">2784318</pubid><pubid idtype="pmpid" link="fulltext">19796385</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>ABySS: a parallel assembler for short read sequence data</p></title><aug><au><snm>Simpson</snm><fnm>JT</fnm></au><au><snm>Wong</snm><fnm>K</fnm></au><au><snm>Jackman</snm><fnm>SD</fnm></au><au><snm>Schein</snm><fnm>JE</fnm></au><au><snm>Jones</snm><fnm>SJM</fnm></au><au><snm>Birol</snm><fnm>I</fnm></au></aug><source>Genome Res</source><pubdate>2009</pubdate><volume>19</volume><fpage>1117</fpage><lpage>1123</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.089532.108</pubid><pubid idtype="pmcid">2694472</pubid><pubid idtype="pmpid" link="fulltext">19251739</pubid></pubidlist></xrefbib></bibl><bibl id="B30"><title><p>SOAP: short oligonucleotide alignment program</p></title><aug><au><snm>Li</snm><fnm>R</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Kristiansen</snm><fnm>K</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au></aug><source>Bioinformatics (Oxford, England)</source><pubdate>2008</pubdate><volume>24</volume><fpage>713</fpage><lpage>714</lpage><xrefbib><pubid idtype="doi">10.1093/bioinformatics/btn025</pubid></xrefbib></bibl><bibl id="B31"><title><p>SOAP2: an improved ultrafast tool for short read alignment</p></title><aug><au><snm>Li</snm><fnm>R</fnm></au><au><snm>Yu</snm><fnm>C</fnm></au><au><snm>Li</snm><fnm>Y</fnm></au><au><snm>Lam</snm><fnm>T-W</fnm></au><au><snm>Yiu</snm><fnm>S-M</fnm></au><au><snm>Kristiansen</snm><fnm>K</fnm></au><au><snm>Wang</snm><fnm>J</fnm></au></aug><source>Bioinformatics (Oxford, England)</source><pubdate>2009</pubdate><volume>25</volume><fpage>1966</fpage><lpage>1967</lpage><xrefbib><pubid idtype="doi">10.1093/bioinformatics/btp336</pubid></xrefbib></bibl><bibl id="B32"><title><p>ALLPATHS: de novo assembly of whole-genome shotgun microreads</p></title><aug><au><snm>Butler</snm><fnm>J</fnm></au><au><snm>MacCallum</snm><fnm>I</fnm></au><au><snm>Kleber</snm><fnm>M</fnm></au><au><snm>Shlyakhter</snm><fnm>IA</fnm></au><au><snm>Belmonte</snm><fnm>MK</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au><au><snm>Nusbaum</snm><fnm>C</fnm></au><au><snm>Jaffe</snm><fnm>DB</fnm></au></aug><source>Genome Res</source><pubdate>2008</pubdate><volume>18</volume><fpage>810</fpage><lpage>820</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.7337908</pubid><pubid idtype="pmcid">2336810</pubid><pubid idtype="pmpid" link="fulltext">18340039</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>High-quality draft assemblies of mammalian genomes from massively parallel sequence data</p></title><aug><au><snm>Gnerre</snm><fnm>S</fnm></au><au><snm>Maccallum</snm><fnm>I</fnm></au><au><snm>Przybylski</snm><fnm>D</fnm></au><au><snm>Ribeiro</snm><fnm>FJ</fnm></au><au><snm>Burton</snm><fnm>JN</fnm></au><au><snm>Walker</snm><fnm>BJ</fnm></au><au><snm>Sharpe</snm><fnm>T</fnm></au><au><snm>Hall</snm><fnm>G</fnm></au><au><snm>Shea</snm><fnm>TP</fnm></au><au><snm>Sykes</snm><fnm>S</fnm></au><au><snm>Berlin</snm><fnm>AM</fnm></au><au><snm>Aird</snm><fnm>D</fnm></au><au><snm>Costello</snm><fnm>M</fnm></au><au><snm>Daza</snm><fnm>R</fnm></au><au><snm>Williams</snm><fnm>L</fnm></au><au><snm>Nicol</snm><fnm>R</fnm></au><au><snm>Gnirke</snm><fnm>A</fnm></au><au><snm>Nusbaum</snm><fnm>C</fnm></au><au><snm>Lander</snm><fnm>ES</fnm></au><au><snm>Jaffe</snm><fnm>DB</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2011</pubdate><volume>108</volume><fpage>1513</fpage><lpage>1518</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.1017351108</pubid><pubid idtype="pmcid">3029755</pubid><pubid idtype="pmpid" link="fulltext">21187386</pubid></pubidlist></xrefbib></bibl><bibl id="B34"><title><p>Assembly complexity of prokaryotic genomes using short reads</p></title><aug><au><snm>Kingsford</snm><fnm>C</fnm></au><au><snm>Schatz</snm><fnm>MC</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au></aug><source>BMC Bioinforma</source><pubdate>2010</pubdate><volume>11</volume></bibl><bibl id="B35"><title><p>Parametric complexity of sequence assembly: theory and applications to next generation sequencing</p></title><aug><au><snm>Nagarajan</snm><fnm>N</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au></aug><source>J Comput Biol</source><pubdate>2009</pubdate><volume>16</volume><fpage>897</fpage><lpage>908</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/cmb.2009.0005</pubid><pubid idtype="pmpid" link="fulltext">19580519</pubid></pubidlist></xrefbib></bibl><bibl id="B36"><title><p>Computability of models for sequence assembly</p></title><aug><au><snm>Medvedev</snm><fnm>P</fnm></au><au><snm>Georgiou</snm><fnm>K</fnm></au><au><snm>Myers</snm><fnm>G</fnm></au><au><snm>Brudno</snm><fnm>M</fnm></au></aug><source>WABI</source><pubdate>2007</pubdate><fpage>289</fpage><lpage>301</lpage></bibl><bibl id="B37"><title><p>Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons</p></title><aug><au><snm>Narzisi</snm><fnm>G</fnm></au><au><snm>Mishra</snm><fnm>B</fnm></au></aug><source>Bioinformatics</source><pubdate>2011</pubdate><volume>27</volume><fpage>153</fpage><lpage>160</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btq646</pubid><pubid idtype="pmpid" link="fulltext">21088026</pubid></pubidlist></xrefbib></bibl><bibl id="B38"><title><p>On approximating the longest path in a graph</p></title><aug><au><snm>Karger</snm><fnm>D</fnm></au><au><snm>Motwani</snm><fnm>R</fnm></au><au><snm>Ramkumar</snm><fnm>GD</fnm></au></aug><source>Algorithmica</source><pubdate>1993</pubdate><volume>18</volume><fpage>421</fpage><lpage>432</lpage></bibl><bibl id="B39"><title><p>NP-completeness of some edge-disjoint paths problems</p></title><aug><au><snm>Vygen</snm><fnm>J</fnm></au></aug><source>Discret Appl Math</source><pubdate>1995</pubdate><volume>61</volume><fpage>83</fpage><lpage>90</lpage><xrefbib><pubid idtype="doi">10.1016/0166-218X(93)E0177-Z</pubid></xrefbib></bibl><bibl id="B40"><title><p>A single-molecule barcoding system using nanoslits for DNA analysis</p></title><aug><au><snm>Jo</snm><fnm>K</fnm></au><au><snm>Dhingra</snm><fnm>DM</fnm></au><au><snm>Odijk</snm><fnm>T</fnm></au><au><snm>de Pablo</snm><fnm>JJ</fnm></au><au><snm>Graham</snm><fnm>MD</fnm></au><au><snm>Runnheim</snm><fnm>R</fnm></au><au><snm>Forrest</snm><fnm>D</fnm></au><au><snm>Schwartz</snm><fnm>DC</fnm></au></aug><source>Proc Natl Acad Sci U S A</source><pubdate>2007</pubdate><volume>104</volume><fpage>2673</fpage><lpage>2678</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0611151104</pubid><pubid idtype="pmcid">1815240</pubid><pubid idtype="pmpid" link="fulltext">17296933</pubid></pubidlist></xrefbib></bibl><bibl id="B41"><title><p>GAGE: A critical evaluation of genome assemblies and assembly algorithms</p></title><aug><au><snm>Salzberg</snm><fnm>SL</fnm></au><au><snm>Phillippy</snm><fnm>AM</fnm></au><au><snm>Zimin</snm><fnm>AV</fnm></au><au><snm>Puiu</snm><fnm>D</fnm></au><au><snm>Magoc</snm><fnm>T</fnm></au><au><snm>Koren</snm><fnm>S</fnm></au><au><snm>Treangen</snm><fnm>T</fnm></au><au><snm>Schatz</snm><fnm>MC</fnm></au><au><snm>Delcher</snm><fnm>AL</fnm></au><au><snm>Roberts</snm><fnm>M</fnm></au><au><snm>Marcais</snm><fnm>G</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au><au><snm>Yorke</snm><fnm>JA</fnm></au></aug><source>Genome Res</source><pubdate>2011</pubdate><volume>22</volume><fpage>557</fpage><lpage>567</lpage></bibl><bibl id="B42"><title><p>Versatile and open software for comparing large genomes</p></title><aug><au><snm>Kurtz</snm><fnm>S</fnm></au><au><snm>Phillippy</snm><fnm>A</fnm></au><au><snm>Delcher</snm><fnm>AL</fnm></au><au><snm>Smoot</snm><fnm>M</fnm></au><au><snm>Shumway</snm><fnm>M</fnm></au><au><snm>Antonescu</snm><fnm>C</fnm></au><au><snm>Salzberg</snm><fnm>SL</fnm></au></aug><source>Genome Biol</source><pubdate>2004</pubdate><volume>5</volume><fpage>R12</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2004-5-2-r12</pubid><pubid idtype="pmcid">395750</pubid><pubid idtype="pmpid" link="fulltext">14759262</pubid></pubidlist></xrefbib></bibl><bibl id="B43"><title><p>Genome sequence of Streptococcus agalactiae, a pathogen causing invasive neonatal disease</p></title><aug><au><snm>Glaser</snm><fnm>P</fnm></au><au><snm>Rusniok</snm><fnm>C</fnm></au><au><snm>Buchrieser</snm><fnm>C</fnm></au><au><snm>Chevalier</snm><fnm>F</fnm></au><au><snm>Frangeul</snm><fnm>L</fnm></au><au><snm>Msadek</snm><fnm>T</fnm></au><au><snm>Zouine</snm><fnm>M</fnm></au><au><snm>Couv&#233;</snm><fnm>E</fnm></au><au><snm>Lalioui</snm><fnm>L</fnm></au><au><snm>Poyart</snm><fnm>C</fnm></au><au><snm>Trieu-Cuot</snm><fnm>P</fnm></au><au><snm>Kunst</snm><fnm>F</fnm></au></aug><source>Mol Microbiol</source><pubdate>2002</pubdate><volume>45</volume><fpage>1499</fpage><lpage>1513</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1046/j.1365-2958.2002.03126.x</pubid><pubid idtype="pmpid" link="fulltext">12354221</pubid></pubidlist></xrefbib></bibl><bibl id="B44"><title><p>On the Analysis of Optical Mapping Data</p></title><aug><au><cnm>Deepayan Sarkar</cnm></au></aug><source>Thesis</source><pubdate>2006</pubdate></bibl></refgrp></bm></art>