Skip to main content

SIS: a program to generate draft genome sequence scaffolds for prokaryotes

Abstract

Background

Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs onto a reference genome. However, rearrangements that may exist between the query and reference genomes may result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of inversions given sufficient sequencing coverage and a sufficiently close reference genome.

Results

We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the presence of matching sequence patterns between the query and reference genomes that can be interpreted as the result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will be correct. We compare the performance of sis, the program that implements the algorithm, to seven other scaffold-generating programs. The results of our tests show that sis has overall better performance.

Conclusions

sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web server. The good performance of sis in our tests adds evidence that large-scale inversions are widespread in prokaryotic genomes.

Background

With the decreasing costs of DNA sequencing it is now very common for prokaryotic genomes to be sequenced at “draft” status only. This means that the generated sequence will be a set of contigs (a contig is a substring of the string over the DNA alphabet that represents the genome sequence). The number of contigs depends on the sequencing fold coverage and DNA sequencing technology, and typically varies between half a dozen to a few hundred.

As of December 9, 2011, the number of draft microbial genome sequences in GenBanka is 2324, compared to 1814 complete sequences. This difference is growing (in favor of draft sequences) over time (on March 15, 2011 the numbers were 1821 and 1485, respectively). Therefore there is an increasing need for tools that can improve the sequencing and assembly results beyond a simple contig set.

One technique that can be used to improve automated assembly results is to generate contig scaffolds from the contig set. A scaffold is an ordered set of contigs, with the desired order being the correct genome order, and with each contig in the correct orientation. Scaffolds help in whole genome comparisons and they can guide the finishing process, showing where the gaps are.

Some references in the literature adopt the term scaffold only for the situation where the contig ordering and orientation is given by paired-end read information [1]. Here we adopt a more general meaning, assuming a scaffold is a contig ordering obtained by any technique and/or additional data, or combination thereof.

There are various techniques that can help to create a scaffold. In addition to paired-end reads, one can use data from physical [2] or optical maps [3, 4]. Information about the order of specific contigs sometimes can be obtained by searching for genes that have been split by the gaps between those contigs.

Another technique, which became popular in the last few years, is to use a reference genome. In this case we assume the query (draft) genome A has a close phylogenetic relative B that has been fully sequenced, and the genome of B can be used to guide the assembly of A and to generate a scaffold as well. This technique is used by programs such as ABACAS [5], fillScaffolds [6], Mauve Aligner [7], OSLay [8], Projector 2 [9], r2cat [10], PGA4genomics [11], and CONTIGuator [12].

When using a reference genome B to create a scaffold for A, one possible problem is the existence of rearrangements in A with respect to B. In prokaryotes, the most common rearrangement is the inversion [13], leading to the ‘X’ patterns that are frequently seen in dotplot graphs representing whole genome alignments of prokaryotic genomes [14] (when inversions are symmetric with respect to the origin of replication).

In this work we present an algorithm for obtaining contig scaffolds that explicitly takes into consideration the presence of inversions in A with respect to B. We do this by searching for inversion signatures in the input contigs. To our knowledge the only existing program that deals directly with inversions is fillScaffolds [6] (FillScaffolds deals with several other kinds of rearrangements besides inversions.). One other program seems to deal at least indirectly with rearrangements: Mauve Aligner [7], which is based on the multiple genome alignment program Mauve [15] and on GRIL [16].

We created program sis as an implementation of our algorithm, and tested it on real draft genomes that have been completed, comparing its performance to seven other programs. The criterion used to select genome sequences for testing was solely the availability of each genome in two formats: incomplete (several contigs) and complete (one contig per replicon). In these tests sis had the best overall performance.

Methods

Definitions

A replicon is a self-replication unit of a genome, such as a chromosome or a plasmid.

We represent a pair of single-replicon genomes by signed permutations, where numbers represent conserved genes or blocks between the two genomes, and the signs represent the plus or minus strand. We represent the reference genome by the identity permutation ι n =[ + 1, + 2, + 3,…, + n], and the query genome by the permutation Π=[Π 1,Π 2,Π 3,…, Π n ], where 1≤| Π i |≤n and | Π i |≠| Π j | if ij.

An inversion ρ(i,j) is a rearrangement event that reverses the order and the signs of a consecutive section of a genome: Πρ(i,j)=[Π 1,…,Π i−1,− Π j ,…,− Π i ,Π j + 1,…, Π n ], such that 1≤ijn. Two consecutive conserved blocks ( Π i ,Π i + 1) are a breakpoint if Π i Π i + 1−1; otherwise the consecutive blocks are an adjacency. A strip is a maximal substring [ Π i ,Π i + 1,…, Π j ] such that every pair ( Π k ,Π k + 1) is an adjacency, for ik<j. We say that an inversion ρ(r,s)acts on a strip [ Π i ,Π i + 1,…, Π j ] if i<rs<j.

An inversion signature (IS) is a breakpoint ( Π i Π i + 1) such that Π i and Π i + 1 have different signs. Two ISs ( Π i Π i + 1) and ( Π j Π j + 1) are an IS pair if Π i =− Π j −1 and Π i + 1=−Π j + 1 + 1 (see Figure 1); our concept of inversion signature is unrelated to a concept of similar name presented in [17].

Figure 1
figure 1

Inversion Signatures. The figure shows two regions with inversion signatures. The region a1a2 in the draft genome matches regions b1 and b3 in the reference genome. b1 and b3 are not consecutive in the reference genome and hence this defines a breakpoint; moreover, because the match between a1 and b1 has a diffferent orientation as compared to the match between a2 and b3 this breakpoint is an inversion signature. The same happens with region a3a4 of the draft genome and regions b2 and b4 of the reference genome. Both ISs are related because they are the result of the same inversion event. Therefore these two ISs constitute an IS pair.

An inversion ρ(i,j) is symmetric if n=i + j−1 (Figure 2). An inversion ρ(i 1,j 1) is nested with respect to an inversion ρ(i 2,j 2) if i 2<i 1j 1<j 2 (Figure 3). An inversion ρ(i,j) is safe with respect to Π if ρ(i,j) acts on a strip [ Π r ,Π r + 1,…, Π s ] of Π (Figure 4). An inversion ρ(i,j) without any restrictions on the values of i and j is a generic inversion (Figure 5).

Figure 2
figure 2

Symmetric Inversions. Three symmetric inversions ρ(3,6), ρ(4,5) and ρ(2,7) applied to the identity permutation ι 8. The final result does not depend on the order of the inversions. The colored circles represent IS pairs created by the inversions.

Figure 3
figure 3

Nested Inversions. Three nested inversions ρ(1,7), ρ(3,6) and ρ(5,5) applied to the identity permutation ι 8. The final result may change if the order of the inversions changes. The colored circles represent IS pairs created by the inversions.

Figure 4
figure 4

Safe Inversions. Four safe inversions ρ(8,8), ρ(1,2), ρ(4,6) and ρ(5,5) applied to the identity permutation ι8. The colored circles represent IS pairs created by the inversions.

Figure 5
figure 5

Generic Inversions. Five generic inversions ρ(1,5), ρ(1,4), ρ(2,3), ρ(7,7) and ρ(3,7) applied to the identity permutation ι8. The white circles represent ISs overwritten by nonsafe inversions.

Any sequence of symmetric inversions can be sorted so that the result is a series of nested inversions, with the same effect on a genome. Every series of nested inversions is a series of safe inversions, but the converse is not true.

Let P 1(n), P 2(n), P 3(n) and P 4(n) denote the sets of all permutations that can be obtained from ι n by applying a series of symmetric inversions, a series of nested inversions, a series of safe inversions, and a series of generic inversions, respectively. Clearly, P 1 (n) P 2 (n) P 3 (n) P 4 (n).

A collection of m contigs from a genome Π is represented by a collection of substrings C1, C2, …, Cm, such that, for 1≤km, C k =[ Π i k ,, Π j k ] or C k =[ Π j k ,, Π i k ], and the intervals in Π that correspond to any two contigs C k 1 and C k 2 cannot have an overlap, that is [ i k 1 .. j k 1 ][ i k 2 .. j k 2 ]=, for 1≤k 1<k 2m.

Algorithm

The algorithm we have developed generates correct scaffolds in the presence of symmetric, nested, or safe inversions (as long as at least one IS is present for every IS pair); in addition it can also produce scaffolds, not necessarily 100% correct, for generic inversions.

The algorithm depends on the following assumptions in order to generate one single correct scaffold: (1) There are no duplicated conserved blocks in either of the two genomes; (2) all contigs are correctly assembled; (3) both genomes contain only one chromosome; and (4) conserved block 01 is in the first position (with + sign) or in the last position (with - sign), in some contig.

Assumption (1) means that there is no conserved block that contains a sequence that is a repeat in any of the two genomes. This is somewhat unrealistic, since all genomes in different degrees have repeats. In this sense, assumption (2) also depends on the lack of repeats. But as the tests below show, these assumptions do not seem to be major obstacles. In particular, repeats account for no more than 13% of errors in contig adjacencies in our tests. We require assumption (3) primarily for simplicity of exposition; in a real setting it is generally possible to decompose scaffold creation for separate replicons as separate problems. Assumption (4) helps simplify the algorithm description. In case block 01 does not follow assumption (4), a remapping of conserved blocks can be made in time O(n) such that the result does follow assumption (4).

The algorithm can detect lack of information, caused by absence of an IS pair in the input or by the overwriting of an earlier (in evolution) IS by a later unsafe inversion, and still generate a scaffold. In such cases the algorithm chooses a contig that has not been positioned yet, and begins generation of a new scaffold (so the result will be a set of scaffolds). When IS pairs are missing there are no guarantees that the scaffolds generated are correct.

The input for the algorithm is a list of contigs, each contig represented by the conserved blocks it contains. Note that we deal only with conserved blocks; insertions and deletions of the query genome with respect to the reference genome are not considered.

We now give a detailed example to illustrate the various concepts as well as to present the algorithm. In this example, there are 40 conserved blocks distributed in 9 contigs as follows:

· [−23,−22, + 19, + 20]

· [−15,−14, + 12, + 13,−11]

· [ + 34, + 35]

· [ + 36, + 37, + 38, + 39, + 40]

· [−03,−02,−01]

· [ + 05, + 06, + 07, + 08]

· [ + 31, + 32, + 33,−30,−29]

· [−28,−27, + 26,−25,−24, + 04]

· [ + 21,−18,−17,−16, + 09, + 10]

The correct order for the blocks in the query genome is given as follows:

Π = [ + 0 1 , + 0 2 , + 0 3 , 2 3 , 2 2 , + 1 9 , + 2 0 , + 2 1 , 1 8 , 1 7 , 1 6 , + 0 9 , + 1 0 , + 1 1 , 1 3 , 1 2 , + 1 4 , + 1 5 , 0 8 , 0 7 , 0 6 , 0 5 , 0 4 , + 2 4 , + 2 5 , 2 6 , + 2 7 , + 2 8 , 3 5 , 3 4 , + 3 1 , + 3 2 , + 3 3 , 3 0 , 2 9 , + 3 6 , + 3 7 , + 3 8 , + 3 9 , + 4 0 ]
(1)

Recall that the reference genome is the identity permutation. The IS pairs are the following:

· [ + 03,−23] × [−04, + 24]

· [−22, + 19] × [ + 21,−18]

· [−16, + 09] × [ + 15,−08]

· [ + 11,−13] × [−12, + 14]

· [ + 25,−26] × [−26, + 27]

· [ + 28,−35] × [−29, + 36]

· [−34, + 31] × [ + 33,−30]

An inspection reveals that the query genome has 7 safe inversions with respect to the reference genome. However, note that only the following ISs are present in the contigs:

· × [−04, + 24]

· [−22, + 19] × [ + 21,−18]

· [−16, + 09] ×

· [ + 11,−13] × [−12, + 14]

· [ + 25,−26] × [−26, + 27]

· × [ + 33,−30]

Note that IS pair 6 is missing from the input contigs.

The algorithm builds the scaffold incrementally, choosing one contig following certain rules and placing it in the scaffold immediately after the previously chosen contig. The first contig chosen is the one that contains block +01 (or -01).

Execution of the algorithm on the example

In the example, block 01 is found inverted in the last position of contig 5 (see Figure 6). Thus, the algorithm reverse complements contig 5 and inserts it as the first component of the scaffold. The next conserved block searched is + 04, which should be in the first position of its contig on the plus strand or in the last position of its contig on the minus strand. Block 04 is found in the last position of contig 8 on the minus strand, which is interpreted by the algorithm to be the result of a safe inversion. The algorithm then uses the IS (−24, + 04) to search for block −23, which should be at the other end of the inversion that created the IS found. Block −23 is found in the first position of contig 1, so this contig becomes the second in the scaffold.

Figure 6
figure 6

Scaffold Generated in the Presence of Safe Inversions. Illustration of the example discussed in the text of a genome with 40 conserved blocks with respect to a reference genome. The query genome and the reference genome differ by seven safe inversions, denoted by the nested horizontal square brackets. The conserved blocks appear in nine contigs. The permutation on the top is the “real” query genome, and the permutation at the bottom is the scaffold reconstructed by the algorithm. The bars with alternating colors help visualize inversion signatures.

The next conserved block to be searched is + 21. It is found at the beginning of contig 9, on the plus strand as expected, so this contig is placed in the third scaffold position. The next block searched is + 11, which is found inverted in the last position of contig 2. This means that contig 2 will be inserted in the fourth scaffold position after it is reverse-complemented. The next block to be searched is + 16, which is found on the minus strand in the middle of the already placed contig 9. The algorithm then uses the IS (−16, + 09) to determine that the next block to be searched is −08. Block −08 is found on the plus strand in the last position of contig 6. The subsequent steps proceed along the same lines. The resconstructed scaffold is [−C 5,−C 1, + C 9,−C 2,−C 6,−C 8,−C 7, + C 3, + C 4]. This scaffold contains errors, because the correct scaffold given by permutation Π is [−C 5,−C 1, + C 9,−C 2,−C 6,−C 8,−C 3, + C 7, + C 4]. This was caused by the fact that from the given input it is impossible to tell that contig 3 should remain as it is and that contig 7 should be reverse complemented. This stems from the fact that the endpoints of IS pair 6 are not to be found in the input contigs. But the algorithm still managed to get 7 correct adjacencies out of 9.

Pseudocode for the algorithm is given in Figure 7. The algorithm assumes that the following data structures are available and filled in: contigs stores contig information and inContig contains conserved block information, storing for each conserved block the position in structure contigs where it can be found, as well as the contig number and the sign (strand). More formally, if contigs [c][p]=x, then inContig [|x|].contig=c, inContig [|x|].pos=p and inContig[|x|]. x=s, such that s×|x|=x. For example, contigs[7] =[ + 31, + 32, + 33,−30,−29], contigs[7][4] =−30, inContig[|-30|].contig = 7, inContig [|-30|].pos = 4 and inContig [|-30|].sign =−1.

Figure 7
figure 7

Algorithm to determine a genome scaffold based on inversion signatures. Data structures used are explained in the text.

In addition to the data structures described above the algorithm uses two auxiliary vectors: used indicates which contigs have been placed in the scaffold; and searched records conserved blocks that have been searched so far. Variable nextC indicates the contig with smallest ID not yet placed in the scaffold.

The main loop in lines 7–39 is executed until all contigs have been placed in the scaffold. Variable ok indicates whether the algorithm found a candidate contig to be placed in the scaffold. Line 9 ensures that the algorithm is not searching for an invalid conserved block. Lines 21–24 check whether the algorithm is searching an already searched conserved block.

If the algorithm decides to insert a new contig in the scaffold (test in line 25) lines 26–39 are executed. If no valid contig was found, the first contig not yet placed is determined in lines 26–28. Lines 29–30 place that contig as the first in a new scaffold. Vector used and variables nextC and i are updated in lines 31–34.

Recalling that n is the number of conserved blocks and m is the number of contigs, with nm, we now state the algorithm’s complexity. Data structure inContig can be built in O(n). Initialization takes O(n + m). The main loop is executed n + m times, since in each iteration either a new position in vector searched is marked or a new contig is placed in the scaffold. There is one operation inside the main loop that does not take constant time, which is the loop in lines 33–34. However, variable nextC can be incremented at most m times, so the amortized cost of each update is O(1). This leads to an overall complexity of O(n).

Testing Methodology

We have implemented the algorithm creating program sis (Scaffolds from Inversion Signatures). We stress that sis does not require the algorithm assumptions listed above in order to be used. We used real contigs available in GenBank for testing as described below.

The main quality measure for a scaffold is the number of correct contig adjacencies. A contig adjacency for contigs x and y is correct if they are consecutive in the (completed) query genome as well as in the generated scaffold set. In the case of circular replicons, all correct scaffolds with m contigs have exactly m correct adjacencies. Even though SIS in general outputs a set of scaffolds, for the purposes of this test we consider the result to be just one scaffold, with scaffolds being placed side by side as the algorithm creates them.

A second quality measure is genome coverage. In this measure we attempt to determine how much of the genome to which the contigs belong is actually covered by the scaffolds generated. The working definition we use is as follows. If both ends of a contig have correct adjacencies, then we count the entire length of that contig as contributing to the coverage. If only one end has a correct adjacency, then half of the contig length is used. If both ends have incorrect adjacencies, then the contig is not used. We define coverage as the ratio of the sum of contig lengths considered per the above rules and the sum of all contig lengths.

In this test we compared sis to seven other scaffold generating programs, namely ABACAS [5], fillScaffolds [6], Mauve Aligner [7], OSLay [8], Projector 2 [9], r2cat [10], and CONTIGuator [12]. Table 1 gives a summary of all programs tested.

Table 1 Programs Tested

It is important to mention that among programs tested fillScaffolds [6] was the only one designed specifically for eukaryote sequences, whereas all of our tests are done on prokaryote sequences. FillScaffolds is a sophisticated tool that takes into account many kinds of rearrangements in addition to inversions, such as transpositions, reciprocal translocations, and chromosome fusions and fissions. We have included fillScaffolds in the present study for completeness.

One important consideration is block conservation detection. In the case of sis this was done by programs nucmer [18] and promer [18], with default settings. We postprocessed the outputs using the MUMmer script delta-filter [18] with parameter −1 , which ensures that only nonrepeated blocks are processed.

Mauve Aligner uses progressiveMauve [15] for sequence comparison. OSLay can use nucmer and that is what we chose (with default settings). r2cat does sequence comparison internally. Projector 2 can use BLAT [19] (default) or BLAST [20], and we chose BLAT. CONTIGuator uses BLAST+ [21]. ABACAS can use nucmer or promer. We chose nucmer. (We observed that choosing promer made ABACAS run for more than two days without providing any scaffold.) For fillScaffolds [6] we have used both nucmer and promer.

Input genomes

We have tested the programs on real contigs from preliminary assemblies of 19 genomes. These genomes have been finished, so that we know the correct order for contigs in all cases. The list of genomes used is given in Table 2. In the table, column ‘cov.’ contains the fraction of each chromosome covered by the contigs selected. Column ‘tech.’ describes de sequencing technology used, when this information was available in the corresponding GenBank file. Four of the genomes in Table 2 have two chromosomes; each chromosome was dealt with separately, thus creating a total of 23 test cases. The genomes chosen were those available from GenBankb on December 7, 2011, for which complete genomes were also available (with at least 4 contigs for each genome). These genomes are quite diverse phylogenetically, representing seven different bacterial groups (where “group” comes from the GenBank table of completed genomes and generally corresponds to taxonomic phylum or class) and one archaeal group. No other filter was applied in this selection. The sequencing technologies represented in this table include at least one representative of a genome sequenced with 454, Solexa/Illumina, and/or ABI SOLiD. The genome sequences we used in our tests (draft and complete) are available at the website http://marte.ic.unicamp.br:8747.

Table 2 Chromosome Sequences Used in Tests

For each of the query chromosomes we obtained a list of the 20 closest genomes (excluding the query genome itself), among 1331 complete prokaryotic genomes available at GenBank on May 5, 2011. We used NUCMi [22] (a variation of MUMi [23]) to compute the distance between each of the 1331 genomes and the query genome. Our rationale for selecting 20 closest other genomes instead of only the closest was as follows. In practice, the choice of a reference genome should be guided by phylogenetic distance. The closest genome to the draft genome is the one most likely to yield best results. On the other hand, given the extremely small sample of the prokaryotic world that has been sequenced to date, it is perfectly possible that new strains will be sequenced in the future for which the closest reference genome will not be particularly close. Therefore it is important to understand how the performance of a scaffolding program changes with increased distance between the query and possible reference genomes.

Results and discussion

We ran each scaffold program on each of the 23 query chromosomes using as reference each of the 20 closest chromosomes to the query chromosome. This resulted in 460 scaffold sets for each program. For a given program and a given set of reference chromosomes, we computed average results over the 23 query chromosomes.

Table 3 shows average results for three cases: when the reference chromosome used was the closest available, then the average (mean) over the 10 closest chromosomes, then the average (mean) over all 20 closest chromosomes. Performance of each program on each test set given by each column in the table. The numbers are percentages of correct adjacencies, averaged over all instances in each column. In bold are the best results. The table is sorted from best to worst in the first column. In all cases the best performance was that of sis (promer). If the median is used rather than the mean, then sis (nucmer) is the best program for the closest case, Mauve is the best for the 10-closest case, and sis (promer) is the best for the 20-closest case (see Additional file 1).

Table 3 Average Performance (Correct Adjacencies)

In order to evaluate the performance of the programs tested with increased distance from the reference chromosome we created the graphs shown in Figures 8 and 9. These graphs show that the performance of all programs degrade with increased distance, but again here sis (promer) stays above the others in all cases

Figure 8
figure 8

Average of Correct Adjacencies. Variation in number of correct adjacencies for each program as the reference genome changes from the closest to the most distant, in a total of 20 reference genomes. fs is fillScaffolds.

Figure 9
figure 9

Average of Correct Adjacencies (Cumulative Version). Variation in number of correct adjacencies for each program as the reference genome changes from the closest to the most distant, in a total of 20 reference genomes. This is a cumulative version of Figure 9. The points plotted correspond to the cumulative averages as more distant genomes are included. fs is fillScaffolds.

The results presented so far are averages. In Table 4 we present results concerning the fraction of the instances in which each program was the winner (provider of the best scaffold). In the table columns do not add up to 100% because of ties. Best results are shown in bold. The table is sorted from best to worst in the first column. Using this measure sis (promer) superiority over the other programs is even more pronounced.

Table 4 Best Scaffolds (Correct Adjacencies)

Tables 5 and 6 are analogous to Tables 3 and 4, respectively, but show the results in terms of coverage (see Testing Methodology). Also under this measure SIS (promer) was the best program.

Table 5 Average Performance (Coverage)
Table 6 Best Scaffolds (Coverage)

It is important to note that the pairs of query-reference genomes tested were not selected for containing inversions. So any given pair query-reference could have any kind of rearrangement. The success of sis in the cases tested suggests that inversions are indeed widespread in prokaryotic genomes (to the extent that genomes in Table 2 are a representative sample).

Additional results

Here we briefly describe additional results for which tables and graphs can be found in the Additional file 1.

Another way of parsing results is to ask the question: which program yields best results for each query chromosome sequence, in terms of number of average correct adjacencies? sis (promer and nucmer) are the best programs under this measure.

Another question to ask is what kind of influence does the number of contigs has on a program’s performance? We have found that all programs obtain their best results for chromosome sequences that have few contigs (between 4 and 25). As for worst results, the outcome was less clear, since it is not always the case that the largest number of contigs results in poorest performance. We also determined that fillScaffolds and r2cat seem to be less sensitive to variation in number of contigs in the input than other programs. These results need to be taken with caution, because the bins we used had a relatively small number of instances in each (between five and seven).

Finally, we investigated the influence of duplications on the performance of SIS in terms of correct adjacencies. We found that on average no more than 13% of incorrect adjacencies are due to duplications.

Running times

Running times of the tests are dominated by the sequence comparison step. For example, in the case of sis, nucmer takes on average 1 minute for a genome pair, and promer takes on average 5 minutes. sis itself takes less than a second to compute a scaffold set. OSLay takes about 3 seconds. After obtaining the conserved blocks, fillScaffolds takes about 3 seconds. r2cat takes between 15 and 20 seconds to determine conserved blocks and about 2 seconds to generate a scaffold. ABACAS takes between 10 and 30 seconds total time. CONTIGuator takes about 2 minutes to determine conserved blocks using BLAST+ and to generate a scaffold. Mauve Aligner takes between 8 and 322 minutes per genome pair (on average, 46 minutes). All these tests were executed on the same machine, a standard desktop computer with Intel Core2 Duo processor, 3.0 GHz and 4 GB RAM. Projector 2 can only be run on a web server. We observed that the total computation time on the server was about 1 minute, when the server appeared to have a light load.

Details of these tests and other tests

In Additional file 1 we present more detailed results of the tests above as well as results using pairs of real genomes and simulated contigs. sis also came out as the best program in these other tests. These results are a refinement of preliminary tests presented in [24].

Availability of SIS

SIS is available as a web server at http://marte.ic.unicamp.br:8747. It can also be freely downloaded as a stand alone program from the same website. SIS generates its scaffolds both as ordered lists of contigs as well as in nucleotide sequence FASTA format.

Conclusions

We have presented a new linear-time algorithm for generating scaffolds for draft genomes, based on the concept of inversion signatures. We implemented this algorithm creating program sis, which is to our knowledge the first scaffold program that explicitly models the biological phenomenon of replicon inversion in prokaryotes. We compared sis to seven other programs, and demostrated that sis achieves better performance relative to these other programs using a real and diverse suite of test cases.

In a real world setting, the scaffolds generated by sis can help in the gap closure process. For example, sis output can easily be used as input to primer generation programs such as Primer3 [25]. If paired-end read data is available, it can be used to check the scaffolds provided and to connect separate scaffolds, thus generating more complete and reliable contig orderings. Ideally, sis should be able to use paired-end read information, but this will require a substantial change to the algorithm presented here, as evidenced by the sophistication of the recent algorithm for paired-end read scaffolding presented by Gao et al. [1]. Another possible improvement in the scaffold generating process is to use several reference genomes instead of only one, under the rationale that some inversions may be missed by using reference genome X but may be detected using some other reference genome Y . An idea similar to this is described for the program PGA4genomics [11].

Endnotes

a http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi b ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT

References

  1. Gao S, Sung WK, Nagarajan N: Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 2011, 18(11):1681–1691. 10.1089/cmb.2011.0170

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Warren RL, Varabei D, Platt D, Huang X, et al.: Physical map-assisted whole-genome shotgun sequence assemblies. Genome Res 2006, 16: 768–775. 10.1101/gr.5090606

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Nagarajan N, Read TD, Pop M: Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 2008, 24: 1229–1235. 10.1093/bioinformatics/btn102

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Valouev A, Zhang Y, Schwartz DC, Waterman MS: Refinement of optical map assemblies. Bioinformatics 2006, 22: 1217–1224. 10.1093/bioinformatics/btl063

    Article  CAS  PubMed  Google Scholar 

  5. Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 2009, 25: 1968–1969. 10.1093/bioinformatics/btp347

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Munoz A, Zheng C, Zhu Q, Albert VA, Rounsley S, Sankoff D: Scaffold filling, contig fusion and comparative gene order inference. BMC Bioinf 2010, 11: 304. 10.1186/1471-2105-11-304

    Article  Google Scholar 

  7. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT: Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 2009, 25: 2071–2073. 10.1093/bioinformatics/btp356

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics 2007, 23: 1573–1579. 10.1093/bioinformatics/btm153

    Article  CAS  PubMed  Google Scholar 

  9. van Hijum, Zomer AL, Kuipers OP, Kok J: Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res 2005, 33: W560–566. 10.1093/nar/gki356

    Article  Google Scholar 

  10. Husemann P, Stoye J: r2cat: synteny plots and comparative assembly. Bioinformatics 2010, 26: 570–571. 10.1093/bioinformatics/btp690

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. Zhao F, Hou H, Bao Q, Wu J: PGA4genomics for comparative genome assembly based on genetic algorithm optimization. Genomics 2009, 94: 284–286. 10.1016/j.ygeno.2009.06.006

    Article  CAS  PubMed  Google Scholar 

  12. Galardini M, Biondi EG, Bazzicalupo M, Mengoni A: CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code Biol Med 2011, 6(11):.

    Google Scholar 

  13. Darling AE, Miklós I, Ragan MA: Dynamics of genome rearrangement in bacterial populations. PLoS Genet 2008, 4(7):e1000128. 10.1371/journal.pgen.1000128

    Article  PubMed Central  PubMed  Google Scholar 

  14. Eisen JA, Heidelberg JF, White O, Salzberg SL: Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol 2000, 1(6):research0011.1–0011.9. 10.1186/gb-2000-1-6-research0011

    Article  Google Scholar 

  15. Darling AE, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004, 14: 1394–1403. 10.1101/gr.2289704

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Darling AE, Mau B, Blattner FR, Perna NT: GRIL: genome rearrangement and inversion locator. Bioinformatics 2004, 20: 122–124. 10.1093/bioinformatics/btg378

    Article  CAS  PubMed  Google Scholar 

  17. Swenson KM, Moret BM: Inversion-based genomic signatures. BMC Bioinformatics 2009, 10 Suppl 1: S7.

    Article  PubMed  Google Scholar 

  18. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. 10.1186/gb-2004-5-2-r12

    Article  PubMed Central  PubMed  Google Scholar 

  19. Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res 2002, 12: 656–664.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinf 2009, 10: 421. 10.1186/1471-2105-10-421

    Article  Google Scholar 

  22. Dias U, Dias Z, Setubal JC: Two new whole-genome distance measures. In Proceedings of the 6th Brazilian Symposium on Bioinformatics (BSB’2011). , ; 2011:61–64.

    Google Scholar 

  23. Deloger M, El Karoui, Petit MA: A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol 2009, 191: 91–99. 10.1128/JB.01202-08

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Dias Z, Dias U, Setubal JC: Using Inversion Signatures to Generate Draft Genome Sequence Scaffolds. In Proceedings of the 2nd ACM International Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB 2011). , ; 2011:39–48.

    Chapter  Google Scholar 

  25. Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology 2000, 132: 365–386.

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (Brazil) [PhD fellowship 2007/05574-4 to UD]; and by Conselho Nacional de Desenvolvimento Científico e Tecnológico (Brazil) [200815/2010-5,483177/2009-1, 473867/2010-9 to ZD].

This work is based on the earlier work: Using Inversion Signatures to Generate Draft Genome Sequence Scaffolds, in Proceedings of the 2nd ACM International Conference on Bioinformatics, Computational Biology and Biomedicine. ACM New York, NY, USA 2011. ACM ISBN: 978-1-4503-0796-3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João C Setubal.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JCS had the idea of using inversion signatures and sketched the algorithm on which SIS is based. ZD and UD designed the algorithm and implemented SIS. JCS, ZD, and UD designed the tests. ZD and UD carried out the tests. UD created the webserver version. All authors contributed to the writing of the manuscript.

Electronic supplementary material

12859_2011_5307_MOESM1_ESM.pdf

Additional file 1: Figure S1. Variation in the Number of Correct Adjacencies (Top 1). Variation in the number of correct adjacencies determined by each scaffold program when the reference genome is the closest to the query genome. The diamond is the median. Figure S2. Variation in the Number of Correct Adjacencies (Top 10). Variation in the number of correct adjacencies determined by each scaffold program averaged over the 10 closest genomes to the query genome. The diamond is the median. Figure S3. Variation in the Number of Correct Adjacencies (Top 20). Variation in the number of correct adjacencies determined by each scaffold program averaged over the 20 closest genomes to the query genome. The diamond is the median. Figure S4. Test Cases X Correct Adjacencies (Top 1). Figure S5. Test Cases X Correct Adjacencies (Top 10). Figure S6. Test Cases X Correct Adjacencies (Top 20). Figure S7. Correct Adjacencies X Number of Contigs (Top 1). Figure S8. Correct Adjacencies X Number of Contigs (Top 10). Figure S9. Correct Adjacencies X Number of Contigs (Top 20). Figure S10. Example of Dotplot. Pairwise whole genome comparison of two Pseudomonas species. The comparison was done using nucmer [18]. Figure S11. Mycobacterium (All pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for the complete set (210 pairs) of Mycobacterium genomes. Figure S12. Pseudomonas (All Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for the complete set (153 pairs) of Pseudomonadaceae genomes. Figure S13. Shewanellas (All Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for the complete set (190 pairs) of Shewanella genomes. Figure S14. Xanthomonas (All Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for the complete set (36 pairs) of Xanthomonas genomes. Figure S15. Mycobacterium (Best Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for only those pairs of Mycobacterium genomes that are closest to each other in the second batch of tests. Figure S16. Pseudomonas (Best Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for only those pairs of Pseudomonas genomes that are closest to each other in the second batch of tests. Figure S17. Shewanellas (Best Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for only those pairs of Shewanella genomes that are closest to each other in the second batch of tests. Figure S18. Xanthomonas (Best Pairs). Variation of the distribution of the number of correct adjacencies in the scaffolds generated by the various programs for only those pairs of Xanthomonas genomes that are closest to each other in the second batch of tests. Figure S19. Average (All Pairs). Variation of the distribution of the average number of correct adjacencies in the scaffolds generated by the various programs for the complete set of test instances in the second batch. Figure S20. Average (Best Pairs). Variation of the distribution of the average number of correct adjacencies in the scaffolds generated by the various programs for only those pairs of genomes that are closest to each other in the second batch of tests. (PDF 3 MB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Dias, Z., Dias, U. & Setubal, J.C. SIS: a program to generate draft genome sequence scaffolds for prokaryotes. BMC Bioinformatics 13, 96 (2012). https://doi.org/10.1186/1471-2105-13-96

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-13-96

Keywords