Abstract
Background
The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.
We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software’s general performance.
Results
We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular standalone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single standalone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide.
Conclusion
We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.
Keywords:
Genome assembly; Scaffolding; Genome analysis; Mate pair nextgeneration sequencingBackground
Recent highthroughput sequencing (HTS) technologies are attractive for de novo assembly projects since they produce millions of short DNAsequences (referred to as reads) at low cost. However, these reads are only a couple of hundred base pairs long making it difficult for an assembler (e.g., [1,2]) to reconstruct the genome. As a result, the output of an assembly often consists of contigs, i.e., subsets of reads assembled into longer fragments of genomic sequence.
However, HTStechnologies provide protocols for creating read pairs that can be used to increase the contiguity of an assembly. We define a read pair as two reads that are sequenced at a known distance and orientation where the distance between the reads, is referred to as insert size. If the two reads within a read pair belong to different contigs c_{a} and c_{b}, a link is created between c_{a} and c_{b}, see Figure 1a. From this link, we can infer a relative order, orientation and distance between c_{a} and c_{b}.
Figure 1. Notation. a) A read pair with insert size x (unknown distance) aligns to two contigs c_{a} and c_{b}, thus creates a link between c_{a} and c_{b}. The read pair gives rise to observations o_{a},o_{b} and they are used to infer the unknown distance d. Distances for o_{a},o_{b},d and r are illustrated. b) Graph structure and notations of the scaffold graph . Two contigs c_{a} and c_{b} connected by an edge e created from alignments of read pairs.
The process of linking and ordering contigs is called scaffolding. In addition to paired reads, information such as reference sequences of related organisms [3], restriction maps [4] and RNAseq data [5], can be used for contig linking. However, reference based assembly is not applicable to most de novo sequencing projects, restriction maps are often not available, and RNAseq data only have coverage over genes and contains no information about distance between reads which makes contig placement ambiguous. This makes read pair information the most commonly used (and often also the only applicable) source of information for scaffolding.
Unfortunately, scaffolding with read pairs poses challenges: reads may create spurious links because of read errors, heterozygosity and the repeated nature of the genome, and these spurious links make ordering and orientations among the contigs ambiguous. Hence, the scaffolding problem can be summarized as detecting and utilizing the correct links in order to find a consistent ordering and orientation of the contigs. The existing formalizations of scaffolding have been proven to be NPcomplete, but it is still unclear if these formulations, even when finding the optimal solution with respect to the objective, solves the real (i.e. biological) problem. These approaches have focused on structural properties of the graph induced by contig links, with little emphasis on assessing correctness of individual links. Our approach focuses on removing incorrect links and employing sophisticated statistics to evaluate whether linking reads come from the underlying library distribution, or from misalignments. Only in a second step are structural properties used.
The following section discusses the formalization of scaffolding and related work, as well as gives an outline and motivation for our work. Our algorithm, realized in an implementation called BESST (Bias Estimating Stepwise Scaffolding Tool), is presented in detail in the Methods section. The algorithm scales well and is practical on very large and complex genomes, as proved by its use in the Picea abies genome project (20 Gbp) [6]. Furthermore, it excels at scaffolding with wider insert size distributions.
We present an evaluation of BESST against other popular standalone scaffolders on a large variety of datasets from GAGE [7]. Compared to previous assessments of novel scaffolding methodologies, the results obtained from our evaluation allows us to draw conclusions about the general performance of standalone scaffolders to a much higher extent. Another recent extensive evaluation of scaffolding tools is given in [8]. In our study we primarily compare standalone scaffolders because they have access to the same amount of information and are applicable in the same contexts (e.g. scaffolding with mate pair libraries that was not use in the original assembly). Nonetheless, we also include GAGE results on integrated scaffolders.
Our results indicate that no single scaffolder outperforms the others on all datasets although in total, BESST shows the most favorable results among standalone scaffolders. Furthermore, our algorithm outperforms other standalone scaffolders when the library insertsize distribution has a high standard deviation. Although there is wide performance variation around integrated scaffolders, overall, GAGE results demonstrate that AllpathsLG’s assemblies scaffolded with its integrated scaffolder have the highest quality.
The problem
Formalizing Scaffolding
As input for scaffolding we assume a set of contigs produced by an assembler and a number of read pairs from a read pair library that have been aligned to the contigs. These read pairs have an insert size distribution with mean μ and standard deviation σ. By aligning all reads in on we can define the graph as follows: Each contig gives rise to precisely two vertices c_{i,L} and c_{i,R} in where c_{i,L} denotes it’s 5’ end and c_{i,R} denotes it’s 3’ end (see Figure 1b). In a read pair, if aligns to precisely one contig c_{k} and aligns to precisely one contig c_{m}, with k ≠ m, this read pair induces a relative orientation and an approximate distance between c_{k} and c_{m}. This relationship is represented as an edge e, see Figure 1b. We let V and E denote the set of vertices and edges respectively in . Given , several formulations and methods have been proposed for scaffolding. We will discuss some of them below.
Problem formulations in related works
The scaffolding problem (SP) defined by Huson et al.[9] is a formulation that is commonly referred to. Using their notation, let be defined as above and let n links between two contigs induce a weight n on the edge e between these two contigs. Furthermore, let Φ : V → N be an ordering, orientation and distance map of , that is, an assignment of non negative integer coordinates to the vertices V in that preserves the contig lengths. Given such a mapping instance ϕ, [9] states that an edge e between c_{i} and c_{j} is consistent if c_{i} and c_{j} have the correct relative orientation (induced by aligned read pairs), and the distance between c_{i} and c_{j} is approximately correct. Here, approximately correct means that e suggests a distance between c_{i} and c_{j} that is less than μ + 3 σ, a heuristically chosen bound. If an edge does not satisfy these constraints, it is called inconsistent. Huson et al. [9] define SP to be the problem of finding a maximum weight consistent edge subset. SP has been used as foundation for a number of other works discussing scaffolding and proposed heuristics for solving it can generally be categorized as either “greedy” or “graphstructure” optimization algorithms.
Greedy algorithms proposed to solve SP include SSPACE and Bambus [10,11]. SSPACE extends scaffolds in a greedy fashion applying a heuristic stopping criterion. Bambus builds scaffolds greedily with heuristics to remove inconsistent link constraints.
Graphstructure optimization algorithms that have been proposed to solve the SP are for instance: SOPRA [12] formulates a global optimization problem for solving relative contig orientation (exact for simple regions while a simulated annealing approach is employed in more complex regions of the graph). In a second step, readpair distribution is used to determine the relative positions of contigs within a scaffold. If an inconsistency is found in the positioning step, the link causing the inconsistency is removed and the algorithm restarts at the orientation step. OPERA [13] builds scaffolds using the number of inconsistent edges p in a subgraph as a design criterion (the subgraph represents a potential scaffold). By treating p as fixed, they can obtain a polynomial time algorithm to find an optimal (with respect to a given p) solution to their slightly modified version of SP. The algorithm then tries all p starting from p = 0 and stops when a scaffold can be constructed. SLIQ [14] formulates a set of linear inequalities together with majority voting to predict placements of contigs. MIP Scaffolder [15] and GRASS [16] formulate SP as a mixed integer programming problem, but uses different techniques to find a solution. MIP Scaffolder resolves conflicting regions in the obtained MIP solution using heuristics such as removing edges that are stretched or contracted more than a given threshold. GRASS uses an ExpectationMaximization algorithm. The maximization step obtains degrees of penalties on contig links given fixed contig orientations. The penalties are set according to what magnitude the constraints for a link is violated. If a penalty is higher than a given threshold, the penalty of the link is “deactivated”, i.e., its constraints are not considered. The expectation step is used to obtain the expected contig orientation of links given (the “activated”) penalties set in the maximization step. Links that are activated in the final solution are used for scaffolding.
There are advantages and disadvantages with these two classes of methods. Algorithms that are solving a local problem using a greedy approach often have better runtime and scale well on larger genomes but use oversimplified methods to find a solution which may only work for some genomes. Graphstructure optimization methods are instead hindered by their time complexity for finding a solution. The runtime scales poorly and it is difficult to predict if such an algorithm will ever finish on a larger dataset (see section Results).
Additionally, current methods that use insert sizes of paired reads for contig placement are built on false assumptions as we have previously shown [17]. This can complicate scaffolding when libraries with large insertsize variation are used.
Link inconsistency detection
The methods previously described define SP similar to [9] with modifications on how to define a consistent edge. Different heuristics are used between the methods to obtain a solution to SP. Yet, a common denominator is that the number of links supporting an edge is used as an indicator of reliability; edges with many links are preferred and those with few links are avoided. This reasoning is intuitive, but fails to account for variation in link count due to contig features. Firstly, the number of links between two contigs depends on the real (i.e., biologically) distance between the two contigs and on their size [17]. Secondly, in SP we face structural features such as repeated regions, heterozygosity, and chimeric contigs. These features create clusters with reads being misaligned which cannot be seen as individual random events. It is our assumption that the number of random, nonstructural, misalignments caused by, e.g., sequencing errors are almost negligible compared to the structural misalignments. Link count is therefore a poor indicator of edge reliability.
We take a different approach to SP and, instead of link count, evaluate edges based on link statistics. When read pairs are mapped to contigs, are they placed on and connecting contigs in a reasonable way? In other words, we want to answer the question: given an edge e, is the cluster of readpairs forming e coming from the readpair library, or are they a consequence of a structural feature? If these reads together show similar properties as the read pair library we are scaffolding with (e.g., mean, standard deviation), the edge is more likely to be correct.
We propose an algorithm, BESST (BiasEstimating Stepwise Scaffolding Tool), that puts focus on analyzing the scaffold graph in local regions using statistics to filter out spurious edges created by structural errors. BESST starts scaffolding with contigs that meet a length criterion for the library (definition given in section Methods). It then continues with smaller contigs in an optional step. If several different pairedread libraries are used, BESST scaffolds with one library at a time in an increasing order of insert size of the library. Separating contigs with respect to size is mainly due to two reasons: (i) Links between large contigs make gap size estimation more stable (see [17]) giving a more robust statistical analysis. (ii) The gain in speed is significant since correct regions are simple path components in which are found by visiting each edge once, thus, the time complexity is O (E).
Results and discussion
De novo assembly validation is a task as difficult as de novo assembly itself. Recent evaluation efforts like GAGE [7] and Assemblathon [18] encountered several problems in identifying the best assembler. GAGE clearly demonstrated how the same assembler can have completely different performances (e.g., quality) even on similar datasets (e.g., bacterial genomes). This predicament was also supported in recent evaluation efforts [19,20]. Despite this, as noted by [21], all new published assemblers and scaffolders have been compared to the thenexisting tools highlighting better performances on a specific dataset using some specific metrics. We argue that evaluation of tools should be performed on multiple datasets and/or scenarios to avoid overgeneralization and confirmation bias. For standalone scaffolders without stated dependencies, it is advisable to test on output from several assemblers to investigate overall performance.
We have tried to address the above issues in our evaluation of BESST, using a wide range of different datasets and assemblers. BESST has been compared with three other stateoftheart scaffolders: OPERA, SOPRA, and SSPACE.
Datasets
Evaluation has been performed using the three GAGE datasets [7] which gave us the possibility to evaluate scaffolders on three highly different genomes: Staphylococcus aureus, Rhodobacter spaeroides, and Human chromosome 14 (hereafter referred to as Hs14). All three datasets have been sequenced with high coverage Illumina pairedend (i.e., PEreads) and matepairs (i.e., MPreads) libraries. Moreover each organism has been assembled with up to 8 different assemblers.
GAGE provides high quality MPlibraries with narrow insert size distributions with standard deviation lower than 10% of the mean. However, narrow insert size libraries cannot be obtained in assembly projects where only small amounts of DNA are available. The MP libraries obtained in these cases are wide and the standard deviation can be up to 50% of the mean. BESST uses a technique that works well for larger uncertainties in insert size as this was one of the design assumptions. Therefore we have included the MP library provided in [22] which is characterised by a large variation in insert size. We used picard [23] to estimate the mean and standard deviation of insert size to 2600 and 1250 base pairs respectively. This library will from now on be referred to as the “wide MP” library. An insert size histogram of this distribution is available in Additional file 1: Figure S2.
Additional file 1. Supplementary Material. Document with supplemental information. This document contains additional evaluation results and proof of Theorem 1.
Format: PDF Size: 545KB Download file
This file can be viewed with: Adobe Acrobat Reader
Evaluation
We scaffolded all 23 available (contig level) GAGEassemblies with BESST v1.0.4.2, and the standalone scaffolders OPERA v1.2, SOPRA v1.4.6, and SSPACEbasic v2.0 using both PE and MP libraries provided by GAGE. Results for assemblerintegrated scaffolders, as computed by GAGE, are also presented, but we primarily compare with the standalone scaffolders because they have access to the same amount of information as BESST and are applicable in similar situations. Note that in GAGE evaluation, Bambus2 was used both for contig and scaffold assembly (with unitigs provided by Celera Assembler).
All scaffolders were run with default parameters (see Additional file 1 for details) on a 1 TB RAM machine equipped with 24 CPUs. Read pairs were mapped to contigs using BWA v0.6.1 for BESST, OPERA, and SOPRA. SSPACEbasic is distributed with Bowtie, thus we used the included version of Bowtie (v0.12.5) for alignments with SSPACE. SSPACE also have a commercial version that supports alignments with BWA. The difference in read alignment method may have an impact on the scaffolding result but we did not investigate this. Out of the 124 scaffolding experiments, 117 successfully terminated within our runtime limit of 48 hours (OPERA and SOPRA were not able to scaffold the Hs14 dataset within this time limit in 3 and 4 cases respectively). Moreover, for the Rhodobacter genome, we also scaffolded the 8 available contiglevel assemblies employing the wide MP library. To summarize, a total of 156 scaffolding experiments have been run, and of these, 149 terminated within the runtime limit and were evaluated.
Each of the 149 results have been evaluated with GAGE validation scripts http://gage.cbcb.umd.edu/results/gagevalidation.tar.gz webcitefor scaffolds, using the available reference sequence. For each assembly, we used GAGE evaluation scripts to compute:
• Scaffold errors: number of indels, inversions, relocations, and translocations (as defined by [7]).
• Scaffold NG50: size of the longest scaffold such that the sum of the lengths of all scaffolds longer than it is at least half of the (known) reference genome size.
• Scaffold Esize: The expected scaffold size at a randomly chosen position on the genome (introduced and defined by [7]). The Esize is calculated as where L_{c} is the length of scaffold c and G is the genome length estimated by the sum of all scaffold lengths. Esize is computed similarly for contigs.
• Scaffold corrNG50: NG50 after scaffolds have been broken at every position a scaffold error is found.
• Scaffold corrEsize: Esize after scaffolds have been broken at every position a scaffold error is found.
Moreover, for each entry, we also compute:
• Number of initial contigs and number of produced scaffolds.
• Time used by the scaffolder (without considering time required to align reads).
NG50 is a common metric to evaluate an assembly, often offering a good indication of the connectivity as it gives the weighted median contig length. However, the size of one scaffold can be misleading as a measure of the general connectivity of an assembly (as discussed in [7]) Consider, for example, a simple case of two error free assemblies a and b of a 1000 bp genome. If assembly a has one contig of 499 bp and 5 contigs of 100 bp while assembly b has 10 contigs of 100 bp, both will have an NG50 of 100 bp. The measure will therefore not expose the difference in quality between a and b. However, the respective Esizes for assembly a and b are 299 and 100, and thus better capturing the average assembly fragmentation.
Results
Tables 1, 2 and 3 presents the scaffolding performances for high quality libraries provided by GAGE. With the evaluation metrics provided here, no standalone scaffolder is a clear winner (as expected [7,20]). In general, BESST produces favorable results on all of the organisms. Contrary to the results in [8], SOPRA does not perform well on the metrics provided by GAGE. The results for assemblerintegrated scaffolders, as computed by GAGE, are presented alongside the standalone scaffolders results. There is a large variation in performance of integrated scaffolders but in general, BESST fares well also here. We note that only AllpathsLG has better scaffolded assembly on all three GAGE datasets. Scaffolds from Bambus2 on S. aureus and SGA on Hs14 are two other instances where the integrated scaffolder outperforms the standalone ones.
Table 1. Staphylococcus aureus GAGE data
Table 2. Rhodobacter sphaeroides, GAGE data
Table 3. Hs14, GAGE data
High quality datasets where insert size variation does not deviate much from the mean are not always available (see Additional file 1 for a discussion of this). The wide MP library has higher variation, and thus, increases the difficulty in scaffolding by introducing more uncertainty for contig placement. Table 4 shows scaffolding results for the wide MP library. Considering this scenario, BESST is outperforming the other standalone scaffolders having the highest total connectivity whilst giving the fewest errors. OPERA shows a slightly higher connectivity in some cases yet produces 17 times more errors in total. Withstanding the SGA assembly, SOPRA shows few errors in all cases. Yet in most cases, SOPRA also shows extremely low connectivity close to the original contig assembly. Similarly, SSPACE is shown to produce few errors but also struggle with connectivity. As mentioned, larger variation in library insertsize introduces more uncertainty of distance estimates and placement of contigs. Thus, scaffolding becomes more error prone. We believe that our performance here is a consequence of BESST’s ability to infer correct linkstatistics despite wide library distribution.
Table 4. Rhodobacter sphaeroides on GAGE contig assemblies using the wide MP library
Tables 5, 6, 7 and 8 illustrates the types of errors that the standalone scaffolders make on the different data sets and Tables 9, 10, 11 and 12 shows runtime of the scaffolders excluding alignment time. An upper bound on runtime was set to 48 hours. BESST and SSPACE present good time scalability in contrast to SOPRA and OPERA which did not finish after 48 hours on 3 and 4 Hs14 instances respectively.
Table 5. Types of errors on S. aureus summed over all assemblies
Table 6. Types of errors on Rhodobacter sphaeroides summed over all assemblies
Table 7. Types of errors on Hs14 summed over all assemblies
Table 8. Types of errors on Rhodobacter sphaeroides with wide MP library summed over all assemblies
Table 9. Runtime for scaffolders on Staphylococcus aureus
Table 10. Runtime for scaffolders on Rhodobacter sphaeroides with GAGE data
Table 11. Runtime for scaffolders on Hs14 with GAGE data (upper bound time requirement was set to 48 hours)
Table 12. Runtime for scaffolders on Rhodobacter sphaeroides with wide mate pair library
Conclusion
We proposed a new algorithm, BESST, for the scaffolding problem. This algorithm works well on both small and large datasets. Moreover, we performed a large evaluation of our software against other popular standalone scaffolders. BESST places favorably compared to the other scaffolders on GAGE datasets and outperforms the other methods on libraries with a wide insertsize distribution.
Methods
Scaffolding of larger contigs
BESST works on a graph structure (as defined under Formalizing scaffolding in the Background section). We apply statistics to assess similarity of observed link distribution to the expected link distribution between contigs larger than μ+4σ: a value chosen so that it is very unlikely that a properly mapped read pair will span over a contig of such size. This means that a correctly assembled contig that is not a perfect repeat will have at most one true edge to a neighboring contig of this size. However, the graph structure created for contigs of this size is in practice often far from linear due to e.g. small repeated regions and chimeric regions and that is why we want a way to assess edge quality.
The assessment of edges are realized in a score designed to reflect how reads from a read pair library should be placed on contigs if they were in fact correctly assembled close to each other on the genome. It consists of two parts, a link variation score π_{σ} and a link dispersity score π_{ζ}, which we present in following subsections.
Link variation score (π_{σ}):
Let c_{i},c_{j} be two correctly assembled contigs at distance d away from each other on the genome (with d small enough for the readpair library to span). Reads linking c_{i} and c_{j} follow different distributions depending on the size of d, and a good estimate of d can be calculated using MLestimation with the tool GapEst introduced in [17]. Here, we go one step further and answer the question: Given μ,σ and obtained from links observed over c_{i} and c_{j}, what should the standard deviation of these links be? We denote this quantity with σ_{od} (standard deviation of observations given a gap size) to be consistent with the notation used for GapEst. Theorem 1 gives the theoretical expected value of σ_{od} which is dependent on the read length r and the length of the longer and shorter contig giving rise to the gap (denoted c_{min},c_{max}).
Theorem 1. Given μ,σ and d, σ_{od }is given by
where g (d) and q (d) are defined in Additional file 1.
Proof. Derivation shown in Additional file 1.
We now define π_{σ} as
This quantity is a measure of how far observed distances are from the theoretical distance. Note that 0≤π_{σ}≤1.
Link dispersity score (π_{ζ})
The other part of the scoring function is an indicator of how well dispersed links are over the contigs they are connecting, given an estimated gap d between them. This dispersity is scored by the two sample KolmogorovSmirnov statistic [24] that gives a measure of difference in distribution between two independent samples. Letting observations on a contig be a sample, the independence between samples comes from the fact that the aligner has no information about which contigs lie close together, thus links between contigs can be seen as two independent alignment events.By observing if link distribution is similar for two linked contigs, we can detect abnormal edges that might come from one or several smaller repeats residing within a contig (see Figure 2b). We call this score the dispersity score and it is calculated as follows. Before testing, translation and reflection of the observations are necessary (see Figure 2a).
Figure 2. Dispersity score. Illustration of dispersity measurement. Read pairs linking contigs c_{1} and c_{2} of lengths n and m respectively are transformed to data tested with the KStest. (a) Observations from contig c_{1} are translated and reflected on the xaxis while observations from contig c_{2} are translated. The two sample KS statistic will indicate high similarity in read distribution. (b) Strange placement of linked reads occur. Several explanations are possible. One possible explanation is that contig c_{2} is misassembled (chimeric) and another explanation is that c_{2} is a correctly assembled contig with small repeated regions solved on assembly level. The repeat might not be present in other contigs from the assembly and therefore, the alignments to these regions are reported as unique. Contig c_{2} is however not close to the to contig c_{1} on the genome and linked reads fail to place at the nonrepeated regions on c_{2}. The KS test will indicate low similarity
Let n read pairs link two contigs c_{1} and c_{2} where c_{1} is of length m. Recall that are the ith observations on c_{1} and c_{2} respectively. Let , where is the mean observation on c_{1} and similar for . Samples y^{i} are therefore the transformed observations as seen in Figure 2. The empirical distribution of samples is given by
where is the indicator function. Let and be the two empirical distributions for the samples on c_{1} respectively c_{2}, the twosample KS statistic of the observations is then obtained by
It follows that D_{n }∈ [0,1] since this quantity measures the largest distance between the two empirical cumulative distributions. The similarity score is defined as π_{ζ}= 1  D_{n}.
Scoring edges:
To sum up the two previous subsections, we first derived a score for the ratio of the expected to the observed standard deviation of distance for links spanning a gap. Secondly, we gave a similarity score of expected to observed dispersity of links spanning a gap using the twosample KS statistic. The total score of a gapedge is defined as
By definition, we have 0 ≤ π ≤ 2. We have used the heuristic cutoff 0.5 which means that if any of the two quantities deviates more than twice from the assumed value, the score is set to 0, i.e., the edge is discarded from the graph. π is only calculated on edges where c_{i},c_{j }> μ + 4 σ. That is, any vertex that has more than 2 neighbors in this subgraph is considered to be involved in a region with linking errors since by the constraints of the contig lengths, the library should not be able to link more than one such contig. The score is used to choose the edge with links that best resemble the library. If the two highest scores in such a region are close to each other (their ratio higher than 0.9 set heuristically), we chose to not make a decision. This can for instance occur from larger repeated contigs. This approach finds a mapping ϕ on , representing a scaffolding, in time.
Note about scoring edges
It might be inviting to start using pvalues that can be estimated from the distributions we have defined. This would lead to a statistical test for keeping or discarding edges in the scaffold graph. However, this is not suitable for the problem in hand. Fewer links between contigs gives more uncertainty (leading to volatile pvalues) and can lead to inability to discard many edges with low link support. Edges with many links would also be sensitive to smaller aberrations by increased sensitivity of statistical testing with larger sample sizes. In the case of multiple edges from a contig, we want to compare edges to see which observations have matching distributions. Comparing significance levels of pvalues to make this decision is bad practice since pvalues are nonlinear transforms of data that should only be interpreted under the null hypothesis. That is, the pvalue is a measure of evidence; it is not an estimate of effect size. Looking at the similarity ratio for π_{σ} and the KS statistic for π_{s} provides a measure that is robust to the number of links and can be used to measure the fit of data when a decision is needed.
Including smaller contigs
Small contigs are defined as having a length less than μ + 4 σ, i.e., all contigs not treated in the previous section. This limit varies with respect to the current library and is used to efficiently create linear scaffolds (as explained in previous section). There are limitations when scaffolding with contigs of size over a particular threshold. Firstly, one will have gaps in these scaffolds where shorter contigs could be placed. Secondly, several small contigs can occur between two large scaffolds making read pairs unable to link them together. We address this issue as follows.
In graph theory, a simple path in is a path without repeated vertices. Let a connection between two large contigs c_{a} and c_{b} in be defined as a simple path starting at c_{a} and ending at c_{b} with the rest of its vertices as small contigs. For a connection γ consisting of a sequence of n contigs {c_{1},…,c_{n}}, we define g(c_{i}) to be the number of links that goes from c_{i} to any other contig c_{j }∈ γ. Similarly, let b(c_{i}) be the number of links that go from c_{i} to any other contig c_{j }∉ γ. The notation of g and b are chosen to indicate “good” links and “bad” links respectively, for a given connection γ. The score of γ is defined as
i.e., the sum of differences of good and bad links for all contigs belonging in γ. An example region of connections is shown in Figure 3. To find connections between two large contigs c_{a},c_{b}, we use breadthfirst search in . If more than one connection is found, the highest scoring one is chosen if the score is positive.
Figure 3. Small contigs. An example of a region in containing smaller contigs. There are 5 possible paths to connect c_{1} and c_{6}. The highest scoring one is [c_{1},c_{4},c_{5},c_{6}], with , giving π_{p}= 34 and it is the selected path between c_{1} and c_{6}. In this path the good edges are represented as solid lines and bad edges are represented as dotted lines. A lowerscore alternative is [c_{1},c_{2},c_{4},c_{5},c_{6}] with (g(c_{i}),b(c_{i}))=(47,18), giving π_{p}= 29. c_{2} is a problematic contig that can be chimeric or consists of repeated sequence(s). The three remaining paths, all of them with negative score, are [c_{1},c_{2},c_{3},c_{8},c_{6}],[c_{1},c_{4},c_{6}],[c_{1},c_{2},c_{4},c_{6}].
If c_{a} and c_{b} is within an already created scaffold, the algorithm will look for paths with length less than d + 2 σ, where 2 σ allows for uncertainty of the estimate of d. If c_{a} and c_{b} are not within an already created scaffold, there is no distance constraint on the length of the connection. For dense regions in , there can be an exponential blowup in the number of possible connections. We have set a threshold limiting the breadthfirst search to 1000 iterations. This restricting threshold is motivated by two reasons. Firstly, dense regions are likely to be caused by spurious edges. Thus, paths created in these regions will often have a negative score. Secondly, we have seen from large datasets that correct connections tend to be short. This makes higher ordered layers in the breadth first search contain connections with negative score or paths that does not lead to a valid end vertex (to form a connection).
All connections with a positive score are used to improve the contiguity of the scaffolds. First, gaps within existing scaffolds are filled if a connection with a positive score has been found. In a second step, connections between scaffolds are considered. The extension starts with the highest scoring connection first and proceeds in a descending order of the score. If a contig is found in multiple connections with positive scores, it is only used in the one with the highest score.
Using multiple libraries
If given multiple libraries, BESST uses these libraries in an increasing order of library insert size. Scaffolds created in earlier steps are seen as contigs for the next library.
Implementation
BESST source code is available under the GNU GPL v3 license. It is implemented in Python using Networkx [25] graph library to represent the scaffolding graph and pysam [26] for parsing BAM files. As input BESST takes contigs in a FASTA file and the alignments of the paired reads to the contigs as sorted BAM files. BESST can use several pairedread libraries with different insert sizes. The main output consists of scaffolds in a FASTA file. If several libraries are used, a scaffold FASTA file is given as output in each scaffolding step. The output also consists of AGP and GFFfiles that contain information about the scaffolds, such as position of each contig in the scaffold and length of the gaps. The contigs that were classified as repeats are output in a separate FASTA file.
Preprocessing of
When initializing , BESST computes different statistics of the read library such as mean and standard deviation of insert size (μ,σ) and of the coverage (μ_{c},σ_{c}). Links with inconsistent insert size defined as o_{1}+ o_{2}> μ + lσ are not considered in the scaffolding since they are likely to be placed on chimeric contigs or misaligned. Here, l is a user defined constant which defaults to 6.
BESST removes the contigs that, based on coverage, behave as repeats. A contig c_{i} is classified as a repeat if coverage of c_{i} is larger than max {2 μ_{c},μ_{c}+ tσ_{c}}, where t is calculated with the same principle as computing k for π_{ζ}.
Data and optional parameters as input to BESST
BESST uses alignments of paired reads to contigs in format of sorted BAM files. A read aligner such as BWA or Bowtie [27,28] can be used to map the paired reads in forward reverse mode. We use those read pairs whose both ends map to a unique position in the collection of contigs.
Several parameters for BESST can optionally be set on the command line:
• μ and σ can be specified instead of being computed internally. This can be good if the assembly is very fragmented.
• Minimum number of links needed to create an edge (with 5 as default value).
• Coverage cutoff for repeat identification.
• Duplicate read remover (based on identical map positions of both fragments in a paired read).
• The inclusion of small contigs can be inactivated.
Availability
Implementation of BESST is provided at https://pypi.python.org/pypi/BESST webcite and https://github.com/ksahlin/BESST webcite.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
The contributors are listed in order of significance. Designing the algorithm: KS, BN and FV. Developed the program: KS. Performed the data analysis: FV, KS. Wrote the manuscript: KS, LA. Supervised the project: LA, BN, JL. All authors read and approved the final manuscript.
Acknowledgments
This work was in part funded by the Swedish Research Council (grant 20104634) and the Spruce Genome Project grant from the Knut and Alice Wallenberg Foundation.
We thank Lex Nederbragt and Ole Kristian Tørressen for comments and suggestions that greatly improved the manuscript and Carly Schott for help with editing.
References

Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

MacCallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J, McKernan K, Ranade S, Shea TP, Williams L, Young S, Nusbaum C, Jaffe DB: ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads.

Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies.

Nagarajan N, Read TD, Pop M: Scaffolding and validation of bacterial genome assemblies using optical scaffolding and validation of bacterial genome assemblies using optical restriction maps.

Mortazavi A, Schwarz E, Williams B, Schaeffer L, Antoshechkin I, Wold B, Sternberg P: Scaffolding a Caenorhabditis nematode genome with RNAseq.

Nystedt B, Street NR, Wetterbom A, Zuccolo A, Lin YC, Scofield DG, Vezzi F, Delhomme N, Giacomello S, Alexeyenko A, Vicedomini R, Sahlin K, Sherwood E, Elfstrand M, Gramzow L, Holmberg K, Hällman J, Keech O, Klasson L, Koriabine M, Kucukoglu M, Käller M, Luthman J, Lysholm F, Rilakovic N, Ritland C, Sena J, Niittylä T, et al.: The Norway spruce genome sequence and conifer genome evolution.

Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA: GAGE: a critical evaluation of genome assemblies and assembly algorithms.

Hunt M, Newbold C, Berriman M, Otto T: A comprehensive evaluation of assembly scaffolding tools.

Huson DH, Reinert K, Myers EW: The greedy pathmerging algorithm for contig scaffolding.

Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding preassembled contigs using SSPACE.

Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus.

Dayarian A, Michael TP, Sengupta AM: SOPRA: scaffolding algorithm for paired reads via statistical optimization.

Gao S, Sung WK, Nagarajan N: Opera: reconstructing optimal genomic scaffolds with highthroughput pairedend sequences.

Roy RS, Chen KC, Sengupta AM, Schliep A: SLIQ: Simple linear inequalities for efficient contig scaffolding.

Salmela L, Mäkinen V, Välimäki N, Ylinen J, Ukkonen E: Fast scaffolding with small independent mixed integer programs.

Gritsenko AA, Nijkamp JF, Reinders MJ, de Ridder D: GRASS: a generic algorithm for scaffolding nextgeneration sequencing assemblies.

Sahlin K, Street N, Lundeberg J, Arvestad L: Improved gap size estimation for scaffolding algorithms.

Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods.

Vezzi F, Narzisi G, Mishra B: Featurebyfeature–evaluating de novo sequence assembly.

Vezzi F, Narzisi G, Mishra B: Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons.

Miller JR, Koren S, Sutton G: Assembly algorithms for nextgeneration sequencing data.

Ribeiro FJ, Przybylski D, Yin S, Sharpe T, Gnerre S, Abouelleil A, Berlin AM, Montmayeur A, Shea TP, Walker BJ, Young SK, Russ C, Nusbaum C, MacCallum I, Jaffe DB: Finished bacterial genomes from shotgun sequence data.

Picard [http://picard.sourceforge.net webcite]

Kolmogorov AN: Sulla determinazione empirica di una legge di distribuzione (On the empirical determination of a distribution law).
Giornale dell’Istituto Italiano degli Attuari 1933, 4:8391.

Networkx [http://networkx.lanl.gov/ webcite]

Li H, Durbin R: Fast and accurate longread alignment with BurrowsWheeler transform.

Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.