Abstract
Background
Nextgeneration sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as matepairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of matepair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that matepairs improve assemblies, and a variety of assemblers use matepairs in the context of repeat resolution, the effectiveness of matepairs in this context has not been systematically evaluated in previous literature.
Results
We show that, in highcoverage prokaryotic assemblies, libraries of short matepairs (about 46 times the readlength) more effectively disambiguate repeat regions than the libraries that are commonly constructed in current genome projects. We also demonstrate that the best assemblies can be obtained by 'tuning' matepair libraries to accommodate the specific repeat structure of the genome being assembled  information that can be obtained through an initial assembly using unpaired reads. These results are shown across 360 simulations on 'ideal' prokaryotic data as well as assembly of 8 bacterial genomes using SOAPdenovo. The simulation results provide an upperbound on the potential value of matepairs for resolving repeated sequences in real prokaryotic data sets. The assembly results show that our method of tuning matepairs exploits fundamental properties of these genomes, leading to better assemblies even when using an off theshelf assembler in the presence of basecall errors.
Conclusions
Our results demonstrate that dramatic improvements in prokaryotic genome assembly quality can be achieved by tuning matepair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way sequencing projects are designed. We propose that a twotiered approach  first generate an assembly of the genome with unpaired reads in order to evaluate the repeat structure of the genome; then generate the matepair libraries that provide most information towards the resolution of repeats in the genome being assembled  is not only possible, but likely also more costeffective as it will significantly reduce downstream manual finishing costs. In future work we intend to address the question of whether this result can be extended to larger eukaryotic genomes, where repeat structure can be quite different.
Background
Nextgeneration sequencing platforms such as Illumina, ABI Solid, and 454 allow genomes to be sequenced more quickly and at a lower cost than ever before. However, as sequencing technology has improved, the wealth of available data has not made genome assembly easier. In fact, the level of difficulty has increased, as the cost savings provided by new technologies are accompanied by a reduction in achievable readlength. Assembly software is now faced with the task of assembling reads ranging from approximately 35 to 500 nucleotides, in contrast to the 8002000 nucleotide reads previously generated by the traditional Sanger technology. Additionally, the error characteristics of the newer technologies are not as well known as they were for traditional Sanger sequencing. De novo assemblies (even in the case of prokaryotic genomes) are highly fragmented [1]. In addition, advances in sequencing technologies have not been mirrored by corresponding improvements in finishing  a time, labor, and costintensive process aimed at reconstructing a complete, gapless, genome sequence from a fragmented assembly. As a result, the majority of genomes remain in an incomplete 'draft' state, hampering studies that rely on longrange genome structure information (e.g., analysis of operon/regulon structure, or largescale genomic variation).
Genomic repeats  DNA segments repeated in nearlyidentical form throughout a genome  are the main reason why genome assemblers cannot automatically reconstruct complete genome sequences from modern sequencing data. Repeats have a number of biological causes, including transposable sequence elements (which can often be highly abundant in a genome), prophages, highlyconserved gene clusters (e.g., the Ribosomal RNA operons in bacteria), or large segmental duplications, and can vary from short simple sequence repeats (e.g., AAAAA, ACACAC) to long stretches of DNA (thousands to tens of thousands of basepairs) that are highly or completely identical. In the context of inexpensive nextgeneration sequencing and assembly, some of the original pitfalls of genome assembly, such as coverage gaps and confusion caused by read errors, can often be obviated by simply sequencing the genome to very high coverage levels. Repeated sequences, on the other hand, cannot be avoided by simply oversequencing, and lead to considerable ambiguity in the reconstruction of a genome, providing limits on the length of the contiguous DNA segments (contigs) that can be correctly reconstructed using unpaired reads [2].
Graphtheoretic models of genome assembly provide an effective framework for analyzing the impact of repeats on the complexity of genome assembly [3]. The genome assembly problem is commonly formulated as finding a constrained path through an appropriatelydefined graph. Throughout this article we will rely on an Eulerian formulation [4] that reduces the assembly problem to finding a Chinese Postman path/tour (a minimum length path through the graph that covers all edges) within a de Bruijn graph (to be defined later). Under this formulation, repeats appear as forks in the graph that make it difficult to select the graph traversal that corresponds to the correct reconstruction of the genome from among an exponential (in the number of repeats) number of possible Chinese Postman paths. Without additional information, genome assemblers can only correctly reconstruct relatively short, unordered segments of the genome (corresponding to those sections of the graph which contain no forks).
In addition to the collection of reads, most sequencing technologies also produce pairwise constraints on the placement (approximate distance and relative orientation) of these reads along the genome  matepair information. This information can further constrain the possible traversals of the assembly graph, thereby allowing longer segments of the genome to be unambiguously reconstructed. Matepair information has been a critical component of most genome projects, starting with Haemophilus influenzae [5]  the first freeliving organism to be fully sequenced. Most genome assemblers now include modules that can use matepair information for scaffolding and repeat resolution (Celera Assembler [6], Velvet [7], Euler [3], Arachne [8], and ALLPATHS [9] to name just a few), and standalone scaffolding tools such as Bambus [10] allow the incorporation of matepair data into virtually any assembly [11]. Furthermore, matepairs are frequently used for the purpose of assembly validation in tools such as BACCardi [12], Consed [13], and Hawkeye [14].
Most of the research on the use of matepair information in genome projects has focused primarily on the use of this information to achieve longrange connectivity by spanning long repeats and gaps in the assembly due to insufficient sequencing coverage. As a result, substantial efforts have been focused on the development of robust protocols for constructing longrange (810 kbp or longer) matepair libraries. Here, we explore a complementary purpose for matepairs  the automatic resolution of repeats during assembly. We argue that, due to affordable highthroughput sequencing technologies, coverage gaps are far less frequent than they used to be, and assembly fragmentation is primarily caused by repeats. Thus improvements in repeat resolution algorithms will translate into substantial improvements in the quality of the resulting assemblies.
Resolving repeats with matepairs
All previously published techniques for repeat resolution rely on the same basic observation: If a unique path in the assembly graph can be found that connects the sequences at the ends of a matepair and the length of this path matches the approximately known matepair size, then this path can be inferred to be correct; i.e., the path represents a partial traversal of the graph that is consistent with the correct reconstruction of the genome being assembled. Since the problem of finding a path of predefined length through a graph is NPhard [15], the algorithms used during assembly rely on various heuristics for efficiently finding paths that support matepair information.
The EULER assembler [3] greedily finds paths whose lengths are consistent with the size of matepairs (the authors indicate that such paths can be easily found for a majority of matepairs), then converts these paths into artificial long reads that can be processed through the Eulerian superpath algorithm [4]. Conflicts between paths through the graph are resolved by prioritizing paths on the basis of their support (number of matepairs that confirm a given path) [16]. The Velvet [7,17] and ALLPATHS [9] assemblers take into account the uniqueness of nodes in the assembly graph. Specifically, matepair links are considered only if they are anchored in contigs that correspond to nonrepeated sequences in the genome (usually determined through depth of coverage statistics).
Analyses of the effectiveness of such algorithms have typically assumed the parameters of the sequencing experiment to be fixed; i.e., the goal is to build the best assembly possible given the types of data commonly generated in current sequencing projects. An exception is the study by Chaisson et al. [16] where the authors evaluate the effect of read length and read quality on the ability to reconstruct a genome. In other words, they assume one can tune the read length generated by a sequencing machine (which is a realistic assumption for the Illumina technology) and estimate whether the assembly improves, and by how much, if the length of reads is increased. Such analyses are critical for providing a scientific basis for picking the optimal tradeoff between sequencing cost (which increases with the read length) and quality of assembly.
In our work, we pose a complementary question: How useful are matepairs for resolving repeats in de novo assemblies created from shortreads? Which types of matepair libraries most effectively resolve repeats and minimize the amount of manual finishing needed to complete the genome (given that there is some restriction on the number of matepair libraries that can be costeffectively produced in a real sequencing experiment)? Similar questions have been addressed by recent publications in the context of sequence alignment. Chikhi et al. [18] evaluate how read length affects genome resequencing experiments, and Bashir et al. [19] developed a method to identify the matepair sizes that are optimal for detecting structural variation through mapping. However, the question of how effective matepairs are for the resolution of repeats and how this parameter of the sequencing experiment can be adjusted to improve assemblies has not previously been analyzed in a systematic fashion.
Measuring assembler performance
Metrics commonly used for comparing the quality of genome assemblies are primarily focused on statistics derived from the global distribution of contig sizes. Statistics such as the number of contigs, average contig size, and N50 contig size are frequently reported in the literature. (N50 contig size is the size c for which 50% of the bases in the genome are contained in contigs of size at least c.)
When the correct answer is known (e.g., reassembly of a known genome), one can also record the number of errors found in the assembly. An alternative approach is proposed by Chaisson et al. [16] where they compare the results of an assembly to a theoretical optimal: the best assembly that can be reconstructed without errors from a genome, given the repeat graph (see Methods) of that genome.
In our study we are specifically targeting this theoretical optimum: in an idealized setting (perfect sequencing data), what is the best possible assembly that can be obtained given the parameters of the sequencing process? While aiming for this theoretical optimum, we also maintain a dose of reality about certain aspects of sequencing projects, restricting ourselves to the use of only two matepair libraries (a fairly standard procedure due to costs) and use of readlengths that are reflective of inexpensive nextgeneration sequencing. Like Chaisson et al. [16] and Kingsford et al. [2], we start with the idealized repeat graph of a genome (see Methods), the structure of which is determined by the read length. Although we are attempting to maximize the size of contigs that can be unambiguously reconstructed from this graph, contig size statistics are difficult to compare across genomes and may not adequately describe the amount of repeat resolution that a set of matepairs provide. Therefore, we focus instead on a measure of assembly ambiguity that is directly related to unresolved repeats. Specifically, we measure the number of manual experiments that would be necessary to completely resolve the structure of a genome during finishing. Briefly, a path through the repeat graph can be uniquely determined by pairing up the edges adjacent to repeat nodes. This pairing is commonly determined during finishing through targeted PCR experiments.
We measure the usefulness of a matepair library in terms of the amount of manual finishing effort that can be saved through its use (see Methods for details). We show that, in the context of highdepth prokaryotic sequencing experiments, very short matepairs are more useful than long matepairs (sizes as high as 40,000 bp are often generated in sequencing projects) for resolving repeats, and that choosing matepair sizes based on the repeat structure (defined in Methods) of the assembly graph is a powerful approach for creating more complete assemblies. Our results hold across 360 'ideal' sequencing experiments (see following section) as well as when using an offtheshelf assembler in the presence of basecall errors (assembly of 8 bacterial genomes using SOAPdenovo [20]). These results can serve as a basis for developing new algorithms and sequencing strategies that will improve de novo assemblies. Due to computational efficiency considerations we limited our analysis to prokaryotic genomes. Whether or not our results can be extended to eukaryotes (whose repeat structure is often quite different from prokaryotes) remains an exercise for future work.
Results
We simulated ideal sequencing projects (no gaps, no errors, known matepair sizes) for 391 complete bacterial genomes using a range of read lengths and matepair sizes. Read lengths were chosen to be reflective of reads that can be affordably obtained using nextgeneration sequencing platforms. In addition, for 360 of these genomes, we performed a direct comparison of the amount of repeat resolution that can be obtained when applying information provided by long matepairs (libraries between several thousands to tens of thousands of basepairs are commonly generated in current sequencing experiments) vs. matepairs whose sizes are 'tuned' to the repeat structure of the genomes being assembled. Finally, we show that the results of our simulated comparison are recapitulated when assembling 8 bacterial genomes in the presence of basecall errors using an off theshelf assembler (SOAPdenovo).
We assume a refined version of the repeat graph for a genome; specifically, the graph that can be obtained through a series of lossless transformations on the original de Bruijn graph (see Methods). Since all of the transformations used simply convert unambiguous paths from the original graph into single nodes, the resulting simplified structure is equivalent (in terms of the spectrum of possible graph traversals) to the original graph. The simplified version of the graph is more compact and better highlights the complexity introduced by repeats, since every node in the simplified graph is either a repeat itself or is one of a set of possible nodes to traverse in between a pair of repeats. (For more information on this graph refinement process, see Methods and a more in depth description in an article by Kingsford et al. [2].)
Throughout most of our analysis, we consider kmer size (the size of sequences used in initial construction of the de Bruijn graph) and readlength to be interchangeable, denoting both with the letter k. Although readlength and kmer size are not typically the same in a real sequencing and assembly experiment (in reality longer reads are decomposed into shorter kmers), we consider the distinction to be irrelevant throughout much of the discussion of our simulations; the only exception being the discussion of SOAPdenovo assemblies. Since our 'idealized' graphs (see Methods) are constructed from complete genomes decomposed directly into kmers, we implicitly imagine that we have perfect reads of length k and that they can be used in their entirety during graph construction. Since a read of length k can provide a kmer of length at most k, ignoring the distinction still provides a valid (if perhaps lenient) upperbound on what one can expect to achieve if using reads of length k with a real assembler.
Our results begin with some theoretical analysis of the usefulness and limitations of matepairs and then proceed to an empirical investigation into the benefits of 'tuning' matepair libraries to target repeats of high complexity. It should be noted here that due to limitations of our current implementation to handle larger, eukaryotic de Bruijn graphs, we have restricted the scope of our results to cover only assembly of prokaryotic genomes. We intend to explore whether or not these results can be extended to eukaryotic assemblies in future work. A complete set of data files produced for this experiment is available at: http://www.cbcb.umd.edu/~wetzeljo/matePairs/ webcite.
Performance of shortestpath heuristic
Our study relies on a simple heuristic for the use of matepair information: matepairs are only used if an unambiguous shortest path through the assembly graph connects its endpoints and is consistent with the insert length. Medvedev et al. [21] use a similar shortestpath approach in their maximum likelihood genome assembler. While more elaborate approaches have been proposed (e.g. searching for the 'best' path from among multiple paths that connect the endpoints of a matepair [9]), such methods are computationally expensive. Furthermore, we find that the length of a shortest path connecting the two ends of a matepair exactly matches the insert length almost 90% of the time, even for relatively long (8 kbp) libraries and short reads (35 bp) (see Table 1). For longer reads and shorter matepairs, this statistic approaches 100%. The shortestpath heuristic is less effective only when using very long matepair inserts (35 kbp) with short reads (3550 bp). In this case, as few as 62% of the matepairs correspond to a shortest path.
Table 1. Matepair statistics
While a large fraction of matepairs could, in principle, be used during assembly through the shortestpath heuristic, the vast majority of matepairs do not span repeat nodes, and many cannot yield unambiguous paths (see Methods). Thus, most matepairs are not useful for repeat resolution. For example, on average only 0.5% of all matepairs from a 400 bp library could be used to resolve repeats (as outlined in Methods) in assemblies constructed from 100 bp reads (see Table 1). This result is unsurprising when the matepair length is short with respect to the readlength, as both ends of the matepair are usually found within the same node of the assembly graph. However, even in the case when matepairs are significantly longer than the read length, only a small fraction of the matepairs are ultimately usable (e.g., 6.77% when using 35 bp reads and matepair library of 8 kbp insert size). An intuitive explanation for this phenomenon is the observation that the genomic repeats that cause most of the complexity in a genome represent just a small fraction of the size of the genome. Only matepairs that span these, relatively rare, complexity hotspots are useful during assembly. While Table 1 highlights the matepair statistics discussed here, Table 2 provides a comprehensive set of statistics for all read/matepairlength combinations studied.
Table 2. More complete matepair statistics
As a corollary to the above observation, the value of matepairs is maximized only once sufficient coverage is achieved to ensure that virtually all resolvable repeats are adequately spanned. The necessary number of matepairs is a function of the genome size G and the kmer length k used for graph construction (roughly 10G/k; see Methods for details).
'Localized complexity' is common in shortread assembly graphs
A surprising result of our study has been the fact that, for the majority of the genomes studied, assembly ambiguity can only be decreased up to a certain point irrespective of depth of coverage and library size, i.e. matepair information appears to have limited value in resolving repeats. Closer inspection of the assembly graphs reveals a common motif that limits the applicability of matepair information. Specifically, we find pairs of repeat nodes (R_{1}, R_{2}), separated by two or more nondecision nodes of equal lengths. Frequently, chains of such patterns (commonly called 'bubbles') can be found in many genomes. Assembly bubbles are difficult to resolve with matepairs because multiple equallength paths can be found between the endpoints of matepairs that span the bubble structure. Thus matepairs which span bubbles provide no information on the order in which the intermediate nondecision nodes (between R_{1 }and R_{2}) need to be visited. An example is shown in Figure 1. These regions of 'localized complexity' can only be resolved by matepairs that tightly span exactly one of the two repeats. As we will show later, this phenomenon is common in bacterial genomes and highlights the need to tune matepair libraries to the specific repeat structure of each genome.
Figure 1. An assembly 'bubble'. An assembly 'bubble' that complicates repeat resolution with matepairs. Shaded nodes are nondecision nodes (in and outdegree equal to 1). The nodes R1 and R2 are decision nodes (repeats). There are two possible paths of the same length from one end of the matepair to the other (black nodes), leading to ambiguity in the graph traversal.
In order to estimate the extent of genomic complexity introduced by the bubble pattern described above, we define a measure of the localized complexity of a genome as follows. For each repeat node, v ∈ G, we define the node to be 'trivial' if all of its successor nodes have distinct lengths, and 'nontrivial' otherwise. The motivation for this nomenclature is that if v has multiple successors of the same length, it is likely that v fits the description of repeat R_{1 }in the above stated example of an assembly bubble, and will be difficult to resolve without targeting at least one matepair to barely span its length. On the other hand, if all successors of v are of different lengths, v cannot fit the description of R_{1 }in the above example, and the proper traversal order of v can likely be resolved using matepairs of arbitrary (longer) length. Thus we define the localized complexity of a genome, CStatistic, to be the percentage of the finishing complexity of the genome contained in nontrivial nodes (CStat(G) = 100 , where N is the set of nontrivial nodes in the assembly graph, S is the set of all nodes in the assembly graph, and C_{v }is the contribution of node v to the finishing complexity of the genome (see Methods for details on finishing complexity).
As seen in Figure 2, roughly 60% of the 35, 50, and 100mer graphs have a Cstatistic ranging between 60 and 90. In other words, in about 60% of the graphs created from short reads, 6090% of their total finishing complexity is contained within repeats that are difficult to resolve using matepair information. The amount of localized complexity appears to be significantly lower for graphs created from longer reads: for 250mer graphs, the average CStatistic was 48.58%, and for the 500mer graphs the average Cstatistic was 33.61%. This implies that longer reads may be critical if the goal of assembly is to reconstruct entire genomes in an automated fashion.
Figure 2. Cstatistic across 391 bacterial genomes. The percentage of shortread (35, 50, and 100mer) graphs with CStatistic in a particular range. Here we see that in about 60% of the graphs created from shortreads, 6090% of the finishing complexity is contained in repeats that are difficult to resolve using matepair information.
The existence of bubbles in assembly graphs has been noted before, and several approaches have been suggested for resolving such regions (see, e.g., Chaisson et al. [16]). Our results imply that any such methods will be of limited use unless the matepair libraries are tuned to match the repeat structure of the genome being assembled.
'Ideal' matepairs are short
We further explored the hypothesis that 'tuning' matepair libraries to accommodate a genome's specific repeat structure could lead to higher quality assemblies. All repeats in a genome were grouped according to size into bins of width 4k, where k is the read length. For each bin we calculated the fraction of the total finishing complexity (see Methods) of the genome that is due to the repeats from that bin. We then selected the two bins with the highest finishing complexities, and constructed matepair libraries that just span repeats in these bins by using inserts that were 3k longer than the average repeat size for each bin. We restrict our analysis to just two libraries for any given genome to match the setting commonly encountered in practice.
Matepairs selected in this fashion end up being roughly proportional in size to the read length, k, averaging from 4.5k to 6k. This is significantly shorter than commonly constructed longrange libraries (6,000 bp or longer).
As expected, the tuned insert length is greater for graphs with less localized complexity (lower CStatistic) than it is for graphs with greater amounts of localized complexity (higher CStatistic). For 35, 50 and 100mer graphs with a CStatistic of at least 50, the mean ideal insert size was consistently short (4.5k to 6k), while genomes with a lower CStatistic resulted in longer and more varied library sizes (see Figure 3). Increased variation in ideal matepair size for graphs with lower CStatistic can likely be attributed to there being fewer graphs in this category (compare Figures 2 and 3).
Figure 3. 'Ideal' matepair lengths across 391 bacterial genomes. Average 'tuned' insert sizes for 35, 50, and 100mer graphs, separated by CStatistic. As can be seen, those graphs with a CStatistic higher than 50 had a nearly uniform distribution and short insert sizes, while those with a lower CStatistic had longer inserts and higher variance. Error bars represent standard deviation.
These results are similar in spirit to, and complement, the EULERPCR algorithm proposed by Mulyukov et al. [22]. In EULERPCR, the assembler uses information about the repeat length (along with the known sequences on either side of the repeat) in order to minimize the number of primers and multiplex PCR experiments needed to resolve tangles in the graph. Essentially, the assembler is leveraging the repeat structure of the graph to find optimal sets of short matepairs to reduce manual finishing. In the context of producing matepair libraries via sequencing, we cannot target individual repeats, so we instead attempt to find the matepair sizes that are most likely to resolve a large fraction of a genome's repeats.
'Tuned' matepair libraries perform well in a simulated setting
To evaluate the effectiveness of the 'tuned' matepair libraries just described, we analyzed through simulations 360 bacterial genomes (each marked by an * in Additional file 1) across 5 different read lengths. (Several of the 391 genomes we originally analyzed required prohibitive computational resources and these genomes were excluded from the current analysis.) For each genome we compared the performance of the 'tuned' matepair libraries to a mixture of 2,000 and 8,000 bp libraries. The latter mixture deserves a brief explanation: the majority of genome projects (irrespective of sequencing technology, see e.g. [23,24] and jgi.doe.gov) rely on a mixture of library sizes, one of which is relatively long (possibly exceeding 10,000 bp and even as long as 40,00050,000 bp in the case of fosmid libraries). As we cannot feasibly test all possible combinations of library sizes, we chose a combination that is reasonable and roughly matches the 'typical' scenario used in Sanger sequencing projects. In both cases (tuned libraries or long libraries) the depth of coverage was the same, and was chosen to ensure that all repeats can be adequately spanned (see Methods). For each genome we recorded the finishing complexity (see Methods) before and after the incorporation of matepair information, as well as the localized complexity (CStatistic) of the genome.
Additional file 1. A list of the 391 genomes used in the study, along with their identifiers. The 360 genome subset of these that are discussed in the Results subsection regarding simulated comparison of 'tuned' vs standard matepair libraries are each marked with an asterisk (*).
Format: CSV Size: 15KB Download file
The tuned libraries, which generally consisted of very short matepairs (see previous section of Results), vastly outperformed the longrange libraries on graphs constructed from short reads (35, 50, and 100 bp). The average reduction in finishing complexity was 82.66% when using tuned libraries, in contrast to only 47.44% when using a combination of longrange libraries (see Figure 4).
Figure 4. Reduction in finishing complexity for 'tuned' vs. standard matepairs. Graphs on left (A) depict the mean percent reduction in finishing complexity on graphs constructed from a particular kmer size given a set of ideal ('tuned') matepair libraries (grouped according to the CStatistic). The ideal libraries are graphspecific, but the smaller of the two libraries averaged between 4.5 k and 6 k, where k is the original kmer size (read length) used to construct the graph. Graphs on right (B) depict the same statistics when using a mixture of two long libraries (2000 and 8000 bp).
For a combination of two long matepair libraries, one can see a clear inverse relationship between reduction in finishing complexity and CStatistic (two graphs on right in Figure 4). However, this relationship is greatly diminished for the tuned libraries (two graphs on left in Figure 4). This result is consistent with the observation that tuned library sizes are longer for the low complexity genomes, where chains of trivial repeats can often be resolved simultaneously by a single matepair.
While the results of simulations on 250mer graphs were very similar to the results of simulations on the shorterread graphs (tuned matepairs outperformed long matepairs), the performance improvement over our 'standard' library mixture decreases for 500mer graphs (see bottom two graphs in Figure 4). This can be attributed to the fact that the standard insert size of 2000 bp is short (exactly 4k) with respect to the original read length for the 500mer graphs, while it is long (exactly 8k) with respect to the original read length of the 250mer graphs. This result indicates that the need for tuning matepair lengths to the genome structure may be unique to the use of very short reads.
'Tuned' matepair libraries improve performance when using offtheshelf assembly software
In order to analyze the effectiveness of 'tuned' vs. long matepair libraries in a more realistic context, we created 30× coverage of 8 complete bacterial genomes (see Table 3) with 100 bp reads using MetaSim (see Methods). We then reassembled each genome given its set of reads and different combinations of long or tuned matepair libraries using SOAPdenovo v. 1.04 (see Methods). For each genome, one pair of tuned libraries was based on the 'ideal' lengths predicted by our 35mer graphs (since we used SOAPdenovo's maximum kmer length of 31 bp) and another pair was based on the ideal lengths predicted by our 100mer graphs (since our reads were 100 bp long). See Table 3 for ideal matepair lengths for each genome.
Table 3. Genomespecific matepair lengths used in SOAPdenovo assemblies
We found that in all but two of the cases, the pair of tuned libraries predicted by our 35mer graphs were most effective at improving assemblies in terms of increasing average and maximum contig size, producing fewer total contigs, and increasing N50 contig size (see Table 4). These libraries consisted of one library of length at most 165 bp and a second of less than 800 bp in all cases. For the two cases in which the tuned libraries did not perform best, a single 200 bp library performed best. However, when splitting readcoverage between libraries of size 200 and 6000 bp, performance always suffered in comparison to the 200 bp library alone. These findings highlight our previously stated result that shorter matepairs are generally more helpful in assembly of prokaryotes than are longer matepairs. Interestingly, the tuned libraries predicted by our 100mer graphs typically perform quite poorly despite the fact that the assembly is built from 100 bp reads. This indicates that despite the 'readthreading' procedure used by SOAPdenovo, information initially contained in a read is being lost when it is broken into shorter kmers (31 bp is the maximum kmer size for SOAPdenovo). There is little to no overlap between SOAPdenovo's assembly algorithms and the methods we have described for applying matepairs to ideal graphs in our simulations; yet tuned libraries perform better in both situations, indicating that they are capturing a fundamental property of the genome structure.
Table 4. Results of SOAPdenovo assemblies
Discussion
There are a few difficulties we encounter when attempting to apply the exact quantitative analysis we describe above (and in Methods) to real assembly graphs. First, while our ideal graphs are unambiguously directed, real assembly graphs are inherently bidirected since DNA is doublestranded. Therefore, it is not possible to directly transform graphs created from real data into unambiguously directed graphs so that the exact finishing complexity (by our definition) can be computed. Specifically, our quantitative analysis requires that we know the indegree and outdegree of a particular node, yet the direction in which the edges are traversed in a real graph is often not known until the assembly process. Additionally, since edge multiplicities in real sequencing experiments must be estimated based on various metrics such as depth of coverage and errors occur during sequencing, real assembly graphs are often disconnected and certainly not fully Eulerian, further limiting the exact analysis described.
Nonetheless, the strategy for choosing the 'tuned' matepair sizes should be applicable to real genome projects. Note that our strategy for tuning the library sizes requires information about the amount of genomic complexity implied by a particular set of repeats of similar size, i.e. it is not sufficient to simply estimate the number and size of repeats from a genome assembly. Instead, the assembly graph (commonly output by many modern assemblers) needs to be analyzed in more detail. As we already mentioned, real assembly graphs have a different structure than the graphs used in our simulation. In order to accommodate the nonEulerian nature of real assembly graphs, the methods we proposed can be converted to a Chinese Postman traversal setting, i.e., find a minimumlength tour of a nonEulerian graph that covers each edge at least once. An efficient algorithm for solving this problem on bidirected de Bruijn graphs has been recently published [25]. The resulting Chinese Postman tour implies an underlying Eulerian graph (which is implicitly constructed while solving the traversal problem) within which the algorithms described in our work can be applied. Specifically, within this implied graph we can compute, for each repeat, the corresponding finishing complexity which, together with the repeat size (already known from the original assembly graph), can be used as outlined in the Results (subsection titled "Ideal matepairs are short") to heuristically determine appropriate matepair sizes. In future work we plan to build a software package that performs this analysis for commonly used genome assemblers.
Conclusions
Our results demonstrate that dramatic improvements in the quality of prokaryotic genome assemblies can be achieved by tuning matepair sizes to the actual repeat structure of a genome, suggesting the possible need to change the way that such sequencing projects are designed. In many sequencing projects, library sizes are chosen based on the ease with which certain size libraries can be effectively constructed in the lab. Virtually always, the matepair libraries are constructed at the beginning of the project, before having an opportunity to evaluate the repeat structure of the genome being sequenced. This 'onestep' approach was unavoidable in the case of Sanger sequencing where matepairs were a byproduct of the sequencing process (sequencing would ultimately cost the same whether or not matepairs are generated).
In the case of nextgeneration sequencing technologies, however, the protocols were originally developed for creating unpaired reads, and the generation of matepairs generally increases the costs (in terms of prep time, reagents, and machine runtime) of sequencing. In this context, the twotiered approach we propose  first generate an assembly of the genome with unpaired reads and evaluate the repeat structure of the genome; then generate the matepair libraries that provide most information towards the resolution of repeats in the genome being assembled  is not only possible, but likely more costeffective in the long run. The tuned matepair libraries produced by this process will be more usable by the assembler, leading to better assemblies (longer correct contigs), which will dramatically reduce the cost of downstream manual finishing.
Across all of the genomes examined, as read lengths get shorter, the overall efficacy of matepairs in reducing finishing complexity is almost always diminished regardless of the amount of localized complexity present in the genome. This result is supported by the empirical observation that genome assemblies constructed from shortread data (e.g. ABI Solid or Illumina) are typically substantially more fragmented than those constructed with longer reads (454 or Sanger) [1], further underscoring the need to develop affordable sequencing technologies that generate long reads.
Many advances have been made in the field of automatic genome assembly in the past several years. Here we have shown that a novel, twotiered approach to sequencing projects (along with the development of technologies which can reliably and affordably produce matepairs of a desired size) can greatly improve automation. However, even under ideal circumstances (idealized graphs, errorfree reads, perfect coverage, and tuned matepair sizes), we still found ourselves unable to produce completely gapless assemblies in many cases. On average, approximately 17% of the original finishing complexity of a given genome still remained after applying matepairs. Thus, although the automation process continues to improve, the development of highthroughput and costeffective approaches for genome finishing will also be of great importance in the years to come.
Methods
Idealized repeat graphs
A commonly used formulation of the assembly problem, based on de Bruijn graphs, was introduced by Pevzner et al. [4]. Specifically, a de Bruijn graph of order k is a graph that contains a node for every (k  1)length string (referred to as a (k  1)mer) over a given alphabet, and that contains an edge for every pair of (k  1)mers that overlap by exactly (k  2) letters (the two nodes and the edge thus represent one of all possible kmers). In the context of assembly, we focus on the subgraph of the de Bruijn graph that corresponds to the kmers that are actually present in the genome being assembled (for simplicity we will refer to this subgraph as the de Bruijn graph as well).
We constructed de Bruijn graphs from the complete genome sequences of 391 prokaryotes (obtained from ftp://ftp.ncbi.nlm.nih.gov webcite; see Additional file 1 for genome identifiers) as follows. We first chose a kmer size, k (values 35, 50, 100, 250, 500 bp were used to be representative of readlengths achievable through a range of modern sequencing technologies). For each (k  1)mer present in the genome, a node was created (without repetition of nodes for repeated (k1)mers). For each kmer present in the genome, a directed edge was created connecting the node representing its first (k  1) letters to the node representing its last (k  1) letters. If a kmer was repeated in the genome, then a number of edges equal to the number of instances of the kmer were created. Thus we created 'ideal graphs' that are representative of perfect sequencing experiments (exactly one read for every klength substring in the genome and no sequencing errors). In a real sequencing experiment, due to the doublestranded nature of DNA, each read and its reverse complement must be considered, resulting in a bidirected graph. For the sake of simplicity, our graphs were constructed using only reads in the forward direction.
A series of lossless simplifications were applied to the resulting de Bruijn graphs as described by Kingsford et al. [2]. The first is a standard path compression in which if there exists a pair of nodes u and v, where u is the only predecessor of v, and v is the only successor of u, the nodes are combined into a single node with the same predecessors as u, and the same successors as v. The new node is labeled with the concatenated sequences of nodes u and v (accounting for the overlap between these sequences). A second type of compression is based on the idea that all areas of the de Bruijn graph whose cycle decomposition is a tree has a unique Eulerian tour, and can therefore be collapsed into a single node representing the sequence given by this traversal. The third technique iteratively 'unzips' halfdecision nodes (those nodes with only one predecessor but multiple successors, or vice versa) by duplicating them and then performing standard path compression on the resulting unambiguous areas of the graph traversal.
The simplified graphs are significantly smaller in terms of number of nodes and edges, yet the same set of Eulerian traversals exist before and after simplification since the transformations simply condense unambiguous paths into single nodes. Additionally, these graphs provide a very clear picture of the genome's 'repeat structure' (see Figure 5): Every node in the refined graph is either a repeat itself (has several pairs of in and out edges) or it is a nonrepeat (has exactly one pair of in and out edges) that resides sandwiched between two repeats. Thus the length of each repeat, as well as the number of times it appears in the genome, can be clearly seen in the refined graph.
Figure 5. A simplified de Bruijn graph. A small de Bruijn assembly graph after the simplification process is complete. The nodes R1, R2, R3, and R4 are repeats (decision nodes), and the shaded nodes are nondecision nodes. Note that nondecision nodes can only reside sandwiched between repeats.
Simulating matepairs
We simulated matepairs along the original genome sequence using the following procedure. Given a library size l, we chose a matepair specific size d at random from a Gaussian distribution with mean l and standard deviation 0.1l. A starting point s along the genome was selected uniformly at random, and two reads of length k (same as the value used in constructing the graph) were selected starting from positions s and s + d  k. The reads were identified within the graph through a simple lookup, and the corresponding nodes were paired. In simulations involving matepairs of varying lengths, an equal number of matepairs of each length was used.
Throughout the analysis we assume to know the exact length d for each matepair, while in a practical setting only an approximate distance between matedreads is known. Since there may be many paths of approximately the distance implied by a matepair, using our assumption can eliminate uncertainty as to whether a matepair implies a truly unambiguous path. Thus choosing not to model the uncertainty of matepair sizes allowed us to focus on understanding the effect of insert size on repeat resolution, rather than becoming stymied by ambiguity or resorting to heuristics aimed at choosing the 'mostlikely' of potentially many paths implied by the matepair. In practice, this approach also allows us to use a great many matepairs in our simulation that might not have been usable by a real assembler. Although we do not model uncertainty about matepair sizes, we do take into account the variability in matepair sizes within one library that occurs during real matepair creation by choosing the matepair size from a Gaussian distribution (as described above).
Repeat resolution through matepairs
We map each end of a matepair to a specific node in the graph. Because the graphs and matepairs are constructed from the same known sequence, the coordinates of each node in the assembly graph with respect to the reference genome is known, and this information is used to determine the placement of the mated reads within the graph. In a practical setting, this placement of mated reads can be determined by finding all good alignments (or some subset of the best of these alignments) of the mated sequences to nodes in the graph, then choosing a placement that most closely reflects the length of the insert based on some pathfinding heuristic.
We check the graph for a shortest path that both connects the endpoints of a matepair and is consistent with the exact insert length. If a matepair cannot be mapped to a shortest path in the graph it is eliminated from further consideration. Matepairs whose ends map to a same node were excluded from further analysis as they can clearly provide no additional connectivity information. Similarly, we excluded from further analysis those matepairs whose ends mapped to two adjacent nodes in the graph. Since such matepairs do not span repeats, we see no clear way to use them for repeat resolution. Additionally, if more than one shortest path connects the ends of a matepair, there is ambiguity as to which path is the correct one, so the matepair is not used. Essentially, only matepairs connected by a unique shortest path that is exactly consistent with the known insert length are used for repeat resolution.
Matepairs fitting the above criteria are considered useful for disambiguating repeats. Specifically, these matepairs indicate that the Eulerian tour through the graph that is consistent with the correct reconstruction of the genome must contain the shortest path connecting their endpoints, thereby eliminating the uncertainty introduced by decision nodes (forks) along these paths.
More sophisticated heuristics for using matepair information could be applied here (several have been previously proposed as outlined in the Background section). We rely on a shortest path heuristic in order to ensure computational tractability since, in the general case, finding a path of a given length within a graph is NPhard [15]. While this heuristic may fail in certain circumstances (i.e. a unique path, consistent with the matepair length, exists between the two ends yet it is longer than the shortest path), our results show that, in practice, the vast majority of matepairs have lengths consistent with a shortest path between the mated sequences (see Results). Thus more sophisticated heuristics should only have a limited impact. In addition, similar shortest path approaches have been previously used in genome assembly (see, e.g. Medvedev et al. [21]).
Finishing complexity of the graph
In order to quantitatively characterize the benefit of matepair information for repeat resolution, we define a new measure of assembly complexity. Our measure, which we call 'finishing complexity', estimates the number of manual fishing experiments that would be required in order to correctly resolve all repeats and reconstruct a complete and correct version of a genome given the current state of the assembly graph. By using such a measure we assume that the primary goal of the assembly process is to automatically reconstruct as complete a picture of the genome sequence as possible. This measure may not directly apply to other contexts; e.g., in a project that only attempts to correctly reconstruct the genes, rather than the full sequence, of an organism.
A straightforward strategy for manually resolving a repeat is to choose an arbitrary inedge of the repeat and try pairing it with each possible outedge using targeted experiments (e.g. through targeted PCR) until the correct pairing is found, at which point both the indegree and outdegree of the repeat are effectively reduced by 1. This process is then repeated for the remaining edges. The last pairing is implicit, as only one inedge and one outedge remain. Thus we define a node's finishing complexity to be , where a can be either the indegree or outdegree of the node. Since our graphs are Eulerian, the indegree and outdegree of any particular node are always equal. The total finishing complexity of the graph is simply the sum of the finishing complexities of all nodes.
Determining the optimal number of matepairs
Our results on 'localized complexity' (see Results) indicate that it is frequently necessary to target matepairs to tightly span repeats of a particular size if we wish to produce matepairs that can imply truly unambiguous paths. However, simply picking an insert size that can potentially span repeats of this size is not sufficient to ensure that (when matepairs are randomly generated) all such repeats will be disambiguated by the resulting library. It is also necessary to generate sufficient coverage such that each instance of every repeat of the targeted size is actually spanned by at least one matepair. This is especially important if very short matepairs are being used.
By applying Poisson statistics, we can see that a level of coverage of approximately 10 at the read level is sufficient to virtually ensure that at least one matepair will span every instance of each repeat in the genome. Here we assume that coverage of the genome is uniform, which is a reasonable assumption, although it may not hold true for all current data sets. Specifically, let X be a random variable denoting the number of matepair arrivals across a sequence of any given k nucleotides in the genome. Let λ be the mean arrival rate for matepairs (coverage), which can be achieved by creating λ(G/k) matepairs, where G is the length of the genome in nucleotides and k is the original kmer length used for construction of the graph. If we create 10(G/k) matepairs in total, then λ = 10, and the following holds:
Thus if we create 10(G/k) matepairs, we virtually guarantee at least one matepair arrival across any stretch of k nucleotides in the genome. Since no nodes in the graph are shorter than k, there should be at least one matepair starting in each node that directly leads into a repeat (with at least one additional matepair for each additional traversal of the node), and the determination of 'ideal' insert size (see Results) should guarantee that the other end of the matepair is just on the other side of the repeat (for the repeats of the size being targeted).
SOAPdenovo assemblies
For each of 8 bacterial genomes to be assembled (see Table 3), 30× coverage in 100 bp reads were created using MetaSim v. 0.9.1 with an empirical error model derived from the 80 bp model provided by MetaSim (assuming the last 21 bp have the same error characteristics). Several sets of matepairs were constructed for the various assemblies, chosen to be reflective of either standard libraries, 'ideal' libraries (see Results) for a 35mer graph, or ideal libraries for a 100mer graph. The matepair lengths were distributed normally around the mean with a standard deviation of 10%. SOAPdenovo v. 1.04 was run with option K31 (the largest kmer size allowed by SOAPdenovo) and configuration options reverse_seq = 0 (standard matepair orientation), asm_flags = 3 (try harder to build large contigs), pair_num_cutoff = 2 (minimum number of matepairs required to link contigs), map len = 60 (minimum length match between a read and a contig during scaffolding).
Authors' contributions
JW, CK, and MP all contributed ideas and participated in writing this article. MP conceived the study. JW and MP primarily designed and performed the experiments. CK contributed idealized graphs. All authors read and approved the final manuscript.
Acknowledgements
Funding This work was partially supported by NSF grant IIS0812111 to MP and CK, NSF grant 0849899 to CK, and NSF Award CCF0830569 to Rajiv Gandhi.
References

Alkan C, Sajjadian S, Eichler EE: Limitations of nextgeneration genome sequence assembly.
Nat Meth 2011, 8:6165. Publisher Full Text

Kingsford C, Schatz M, Pop M: Assembly complexity of prokaryotic genomes using short reads.
BMC Bioinformatics 2010, 11:21. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Pevzner P, Tang H: Fragment assembly with doublebarreled data.
Bioinformatics 2001, 17(suppl 1):S225233. PubMed Abstract  Publisher Full Text

Pevzner P, Tang H, Waterman M: An Eulerian Path Approach to DNA Fragment Assembly.
Proceedings of the National Academy of Sciences of the United States of America 2001, 98(17):97489753. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fleischmann R, Adams M, White O, Clayton R, Kirkness E, Kerlavage A, Bult C, Tomb J, Dougherty B, Merrick J, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocyne J, Scott J, Shirley R, Liu L, Glodek A, Kelley J, Jenny M, Weidman J, Phillips C, Spriggs T, Hedblom E, Cotton M, Utterback T, Hanna M, Nguyen D, Saudek D, Brandon R, Fine L, Fritchman J, Fuhrmann J, Geoghagen N, Gnehm C, McDonald L, Small K, Fraser C, Smith H, Venter J: Wholegenome Random Sequencing and Assembly of Haemophilus influenzae Rd.
Science 1995, 269(5223):496512. PubMed Abstract  Publisher Full Text

Myers E, Sutton G, Delcher A, Dew I, Fasulo D, Flanigan M, Kravitz S, Mobarry C, Reinert K, Remington K, Anson E, Bolanos R, Chou H, Jordan C, Halpern A, Lonardi S, Beasley E, Brandon R, Chen L, Dunn P, Lai Z, Liang Y, Nusskern D, Zhan M, Zhang Q, Zheng X, Rubin G, Adams M, Venter J: A Whole Genome Assembly of Drosophila.
Science 2000, 287(5461):21962204. PubMed Abstract  Publisher Full Text

Zerbino D, Birney E: Velvet: Algorithms for de Novo short read assembly using de Bruijn graphs.
Genome Research 2008, 18(5):821829. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Batzoglou S, Jaffe D, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov J, Lander E: ARACHNE: a whole genome shotgun assembler.
Genome Research 2002, 12:177189. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Butler J, MacCallum I, Kleber M, Belmonte ISM, Lander E, Nusbaum C, Jaffe D: ALLPATHS: De Novo assembly of wholegenome shotgun microreads.
Genome Research 2008, 18(5):810820. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pop M, Kosack D, Salzberg S: Hierarchical scaffolding with Bambus.
Genome Research 2004, 14:149159. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pop M: Genome assembly reborn: recent computational challenges.
Briefings in Bioinformatics 2009, 10(4):354366. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bartels D, Kespohl S, Albaum S, Drüke T, Goesmann A, Herold J, Kaiser O, Püler A, Pfeiffer F, Raddatz G, Stoye J, Meyer F, Schuster S: BACCardia tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison.
Bioinformatics 2005, 21:853859. PubMed Abstract  Publisher Full Text

Gordon D, Abaijian C, Green P: Consed: A graphical tool for sequence finishing.
Genome Research 1998, 8:195202. PubMed Abstract  Publisher Full Text

Schatz M, Phillipy A, Schneiderman B, Salzberg S: Hawkeye: an interactive visual analytics tool for genome assemblies.
Genome Biology 2007, 8:R34. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chaisson M, Brinza D, Pevzner P: De novo fragment assembly with short matepaired reads: Does the read length matter?
Genome Research 2009, 19:336346. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zerbino D, McEwen G, Margulies E, Birney E: Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet ShortRead de Novo Assembler.
PLoS one 2009, 4(12):e8407. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chikhi R, Lavenier D: Pairedend read length lower bounds for genome resequencing.
BMC Bioinformatics 2009, 10(Suppl 13):O2. BioMed Central Full Text

Bashir A, Bansal V, Bafna V: Designing deep sequencing experiments: structural variation, haplotype assembly, and transcript abundance.
BMC Genomics 2010, 11:385. PubMed Abstract  BioMed Central Full Text

Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J: De novo assembly of human genomes with massively parallel short read sequencing.
Genome Research 2010, 20(2):265272. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Medvedev P, Brudno M: Maximum likelihood genome assembly.
Journal of Computational Biology 2009, 16(8):11011116. PubMed Abstract  Publisher Full Text

Mulyukov Z, Pevzner P: EULERPCR: Finishing Experiments for Repeat Resolution.
Pacific Symposium on Biocomputing 2002, 199210. PubMed Abstract

Göker M, Held B, Lucas S, Nolan M, Yasawong M, Rio TD, Tice H, Cheng J, Bruce D, Detter J, Tapia R, Han C, Goodwin L, Pitluck S, Liolios K, Ivanova N, Mavromatis K, Mikhailova N, Pati A, Chen A, Palaniappan K, Land M, Hauser L, Chang Y, Jeffries C, Rohde M, Sikorski J, Pukall R, Woyke T, Bristow J, Eisen J, Markowitz V, Hugenholtz P, Kyrpides N, Klenk H, Lapidus A: Complete genome sequence of Olsenella uli type strain (VPI D76D27CT).

Wirth R, Sikorski J, Brambilla E, Misra M, Lapidus A, Copeland A, Nolan M, Lucas S, Chen F, Tice H, Cheng J, Han C, Detter J, Tapia R, Bruce D, Goodwin L, Pitluck S, Pati A, Anderson I, Ivanova N, Mavromatis K, Mikhailova N, Chen A, Palaniappan K, Bilek Y, Hader T, Land M, Hauser L, Chang Y, Jeffries C, Tindall B, Rohde M, Göker M, Bristow J, Eisen J, Markowitz V, Hugenholtz P, Kyrpides N, Klenk H: Complete genome sequence of Thermocrinis albus type strain (HI 11/12T).
Standards in Genomic Sciences 2010, 2(2):194. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kundeti V, Rajasekaran S, Dinh H: An Efficient Algorithm For Chinese Postman Walk on Bidirected de Bruijn Graphs.