Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146, Hamburg, Germany
Abstract
Background
Ongoing improvements in throughput of the nextgeneration sequencing technologies challenge the current generation of de novo sequence assemblers. Most recent sequence assemblers are based on the construction of a de Bruijn graph. An alternative framework of growing interest is the assembly string graph, not necessitating a division of the reads into
Results
Here we present efficient methods for the construction of a string graph from a set of sequencing reads. Our approach employs suffix sorting and scanning methods to compute suffixprefix matches. Transitive edges are recognized and eliminated early in the process and the graph is efficiently constructed including irreducible edges only.
Conclusions
Our suffixprefix match determination and string graph construction algorithms have been implemented in the software package Readjoiner. Comparison with existing string graphbased assemblers shows that Readjoiner is faster and more space efficient. Readjoiner is available at
Background
The
The introduction of the massively parallel nextgeneration DNA sequencing technologies has led to a considerable increase in the amount of data typically generated by sequencing experiments. For example, as of January 2012, the HiSeq2000 sequencer of Illumina delivers sets of 100 bp reads with a total length of up to 600 Gbp
The computation of the overlap graph is the most time and space consuming of the three phases, and was considered a bottleneck in the computation. Therefore, alternative methods were developed avoiding an explicit overlap computation. An approach which proved to be effective is based on the enumeration of all
The de Bruijn graph describing the
Edena
A more spaceefficient approach to the string graph construction has been presented in
Recently, a compact representation for exactmatch overlap graphs has been described in
In this paper, we present new efficient algorithms for the computation of irreducible suffixprefix matches and the construction of the assembly string graph. These are implemented in a new string graph based sequence assembler
All string graphbased assemblers aim at constructing the same graph: However, the algorithms and data structures employed in Edena, LEAP, SGA and
Methods
Basic definitions
Let
A read
Computing suffix and prefixfree read sets
The first step of our approach for assembling a collection of reads is to eliminate reads that are prefixes or suffixes of other reads. Here we describe a method to recognize these reads. Consider an ordered set
We define a binary relation ≺ on
To obtain a prefix and suffixfree set of reads we lexicographically sort all reads using a modified radixsort for strings, as described in
During the sorting process, the length of the longest common prefix (lcp) of two lexicographically consecutive reads is calculated as a byproduct. For two lexicographically consecutive reads
To handle reverse complements and to mark reads which are suffixes of other reads, one simply applies this method to the multiset
In a final step of the algorithm one eliminates all reads from
Computing suffixprefix matches
Suppose that
The method to solve the
Consider a suffixprefix match 〈
To enumerate the set of all suffixprefix matches of length at least ℓ_{
min
}, we preprocess all reads and determine all proper suffixes of the reads which may be involved in a suffixprefix match. More precisely, for all reads
The set of all matching candidates and all reads forms the
An efficient algorithm for identifying and sorting all SPMrelevant suffixes
The first two phases of our algorithm follow a strategy that is borrowed from the counting sort algorithm
In contrast to counting sort, our algorithm uses an extra sorting step to obtain the final order of elements presorted in the insertion phase. Under the assumption that the maximum read length is a constant (which does not imply that the reads are all of the same length), our algorithm runs in
We first give a description of our algorithm using string notation. In a separate section, we explain how to efficiently implement the algorithm. In the following, we only consider the reads in the forward direction. However, it is not difficult to extend our method to also incorporate the reverse complements of the reads and we comment on this issue at the end of the methods section.
The
In the next step, a linear scan of the sorted
The counts for the elements in
Up until now, only the initial
The next task is to process a suffix, say
We propose an efficient method that works as follows: Store each
This simultaneous linear scan of
Once all reads have been processed, for any initial
After all
To sort the
Sorting all remaining suffixes and computing the lcptable
Supplemental Material. This document describes implementation techniques for the methods and algorithms described in the main document. Moreover, it gives a lemma and a theorem (including proofs) characterizing transitive SPMs, and an algorithm to enumerate irreducible and nonredundant suffixprefix matches. Furthermore, a method to recognize internally contained reads is given, as well as results for a benchmark set with reads of variable length. Finally, an example of SPMrelevant suffixes and their corresponding lcpinterval is presented.
Click here for file
All in all, our algorithm runs in
Implementation
We will now describe how to efficiently implement the algorithm described above. An essential technique used in our algorithm are integer codes for
Besides
We implement
The sets
When determining the
We implement the counts by a byte array of size
The partial sums in table
For the insertion phase we need a representation of the read set (2
Although the data structures representing tables
The four tables that can be split over
An obvious disadvantage of the partitioning strategy (with, say
The expected size of a bucket to be sorted after the insertion phase is smaller than the average read length. The maximum bucket size (determining the space requirement for this phase) is 1 to 2 orders of magnitude smaller than
An efficient algorithm for computing suffixprefix matches from buckets of sorted SPMrelevant suffixes
The input to the algorithm described next is a bucket of sorted SPMrelevant suffixes, with the corresponding table
Note that the bucketwise computation does not deliver the lcpvalues of pairs of SPMrelevant suffixes on the boundary of the buckets. That is, for all
The suffixes occurring in a bucket will be processed in nested intervals, called lcpintervals, a notion introduced for enhanced suffix arrays by
•
•
•
•
We will also use the notation ℓ − [
An lcpinterval ℓ′ − [
This parent–child relationship of lcpintervals with other lcpintervals and singleton intervals constitutes a virtual tree which we call the
Abouelhoda et al. (
• A stack stores triples (ℓ,
•
•
•
• ⊥ stands for an undefined value.
•
•
•
Algorithm 1
Algorithm 1. Bottomup traversal algorithm for arrays of SPMrelevant suffixes. This is an extension of [18, Algorithm 4.4] with the additional lines marked as new.
Depending on the application, we use different functions
Additional file
Consider a path in the lcpinterval tree from the root to a singleton interval [
Consider a suffixprefix match 〈
Algorithm 2
Algorithm 2. Bottomup traversal of lcpinterval tree enumerating suffixprefix matches.
Whenever a terminal edge for read
As soon as all edges outgoing from
The lcpinterval tree for
Handling reverse complements of reads
Reads may originate from both strands of a DNA molecule. For this reason, suffixprefix matches shall also be computed between reads and reverse complements of other reads. Handling the reverse complements of all reads is conceptually easy to integrate into our approach: One just has to process
The three steps which involve scanning the reads are extended to process both strands of all reads. This does not require doubling the size of the read set representation, as all information for the reverse complemented reads can efficiently be extracted from the forward reads. Additional file
The scan of the reverse complemented reads has a negligible impact on the runtime. Of course, the size of the table
When computing suffixprefix matches some minor modifications are necessary: Applying Algorithm 2 to
•
•
•
For any
Recognition of transitive and irreducible suffixprefix matches
For the construction of the string graph, we do not need transitive
Example of a transitive suffixprefix match
Example of a transitive suffixprefix match. An example of a transitive SPM. A set of three reads with a transitive SPM 〈
The following theorem characterizes an
Illustration of transitivity of a suffixprefix match
Illustration of transitivity of a suffixprefix match. Schematic illustration of transitivity.
Theorem 1. Let 〈
The proof of Theorem 1 can be found in Additional file
If the
Theorem 1 suggests a way to decide the transitivity of an
Due to the bottomup nature of the traversal in Algorithm 2, the
From Theorem 1 one can conclude that the first
Transitivity and left contexts
Transitivity and left contexts. Transitivity and left contexts. Let the SPM 〈
Recognition of internally contained reads
At the beginning of the methods section we have shown how to detect reads which are prefixes or suffixes of other reads. When constructing the string graph we also have to discard internally contained reads, which are contained in other reads without being a suffix or a prefix. More precisely,
Construction of the assembly string graph
Consider a read set
For each
For each nonredundant irreducible
1. if
•
•
2. if
•
•
3. if
•
•
In our implementation of the string graph, vertices are represented by integers from 0 to 2
To output the contigs, we first write references (such as read numbers and edge lengths) to a temporary file. Once this is completed, the memory for the string graph is deallocated, and the read sequences are mapped into memory. Finally, the sequences of the contigs are derived from the references and the contigs are output.
To verify the correctness of our string graph implementation and to allow comparison with other tools, we have implemented the graph cleaning algorithms described in
Results
The presented methods for constructing the string graph and the subsequent computation of contigs have been implemented in a sequence assembler named
•
•
•
Experimental setup
For our benchmarks, the 64bit
All tests were performed on a computer with a 2.40 Ghz Intel Xeon E5620 4core processor, 64 GB RAM, under a 64bit Linux operating system, using a single core only.
For memory usage measurements, we monitored the VmHWM (“high water mark”) value in the /proc file system
For all runs of
Human genome sequencing simulations
We tested our assembler on simulated errorfree sequencing read sets based on human genomic sequences (latest available release of GRCh37). For each human chromosome we prepared a template sequence by deleting ambiguity symbols. Then we simulated reads by pseudo random sampling of the template sequence and its reverse complement, until the desired number of reads was obtained. This was done using the
From each of the 24 human chromosome sequences, we generated a separate read set with 20 × coverage and a constant read length of 100 bp. The read set are called
In a first computational experiment, we determined the time vs. space tradeoff of our partitioning strategy, by applying
Influence of the partitioning technique on space and time requirement
Influence of the partitioning technique on space and time requirement. Influence of the partitioning technique on space and time requirement. Running time and space peak of Readjoiner for the index construction of the c2 dataset with a varying number of parts (from 1 to 9, ℓ_{min} = 45).
The complete
Running time and space peak for Readjoiner for all 24 read sets derived from human chromosomes
Running time and space peak for Readjoiner for all 24 read sets derived from human chromosomes. Running time (A) and space peak (B) of Readjoiner for all 24 read sets c1, c2, …, c22, cX, cY derived from the human chromosomes (ℓ_{min} = 45). Each dot represents a human chromosome placed on the Xaxes according to its length and on the Yaxes according to the running time (A) and the space peak (B) required by Readjoiner to process it. The line was fitted to the dots using the least square regression command lm from the Rproject
Comparison with other string graphbased assemblers
The 64bit Linux binaries of Edena
As Edena is based on the original string graph construction method proposed by
RJ
Edena
RJ
Edena
RJ
Edena
Results of applying Readjoiner (RJ) and Edena to the datasets c22, c15, c7 (ℓ_{ min } = 45).
Read set
c22
c22

c15
c15

c7
c7

Genome size (Mbp)
34.9
34.9

81.7
81.7

155.4
155.4

Number of reads (M)
7.0
7.0

16.3
16.3

31.1
31.1

Contained reads (K)
686.4
686.4

1665.7
1665.7

3103.0
3103.0

Irreducible SPM (M)
7.2
7.2

17.2
17.2

36.4
36.4

Overall time (s)
360
4903
13.62×
945
13609
14.40×
2035
29404
14.45×
Overall space (MB)
294
2753
9.35×
703
6415
9.13×
1331
12255
9.21×
Contigs
120712
120462

254830
254111

503446
502706

Total contigs length (Mbp)
45.7
44.7

103.0
101.1

198.8
195.0

Assembly N50 (Kbp)
1.6
1.7

2.4
2.5

2.3
2.4

Assembly NG50 (Kbp)
2.7
2.7

3.7
3.7

3.9
3.9

Longest contig (Kbp)
41.4
41.4

54.2
54.2

44.9
44.9

SGA
RJ
SGA
RJ
SGA
RJ
SGA
RJ
SGA
Results of applying Readjoiner (RJ) and SGA to the datasets c22, c15, c7, c2 (ℓ_{ min } = 45).
Read set
c22
c22

c15
c15

c7
c7

c2
c2

Genome size (Mbp)
34.9
34.9

81.7
81.7

155.4
155.4

238.2
238.2

Number of reads (M)
7.0
7.0

16.3
16.3

31.1
31.1

47.6
47.6

Sga index d (K)

300


700


1350


2300

Overall time (s)
360
7508
20.86×
945
19334
20.46×
2035
39988
19.65×
3185
65194
20.47×
Overall space (MB)
294
383
1.30×
703
842
1.20×
1331
1568
1.18×
2094
2436
1.16×
Contigs
120712
231594

254830
547217

503446
1215816

634403
1702714

Total contigs length (Mbp)
45.7
55.9

103.0
130.5

198.8
266.4

292.2
396.1

Assembly N50 (Kbp)
1.6
0.8

2.4
1.0

2.3
0.5

3.2
1.2

Assembly NG50 (Kbp)
2.7
2.7

3.7
3.7

3.9
3.9

4.5
4.5

Longest contig (Kbp)
41.4
41.4

54.2
54.2

44.9
44.9

52.9
52.9

LEAP implements the methods described in
RJ
LEAP
RJ
LEAP
RJ
LEAP
RJ
RJ
Results of applying Readjoiner (RJ) and LEAP to the datasets c22, c2, hg20× , hg30× , hg40× (ℓ_{ min } = 45). LEAP was not able to process hg30× and hg40× on the test machine with 64 GB RAM.
Read set
c22
c22

c2
c2

hg20×
hg20×

hg30×
hg40×
Genome size (Mbp)
34.9
34.9

238.2
238.2

2861.3
2861.3

2861.3
2861.3
Number of reads (M)
7.0
7.0

47.6
47.6

579.5
579.5

869.2
1155.3
Overall time
6 min
9 min
1.60×
53 min
1 h 36 min
1.81×
20 h 4 min
35 h 56 min
1.79×
34 h 9 min
51 h 16 min
Overall space (GB)
0.3
0.9
2.99×
2.0
4.0
1.98×
27.9
45.6
1.63×
39.8
52.0
Contigs
120712
113428

634403
630408

3239309
11662607

13497497
16253905
Total contigs length (Mbp)
45.7
43.1

292.2
280.6

2833.1
3642.7

4003.9
4281.1
Assembly N50 (Kbp)
1.6
1.6

3.2
3.0

3.0
1.4

1.2
0.9
Assembly NG50 (Kbp)
2.7
2.4

4.5
3.9

3.0
2.5

2.9
2.8
Longest contig (Kbp)
41.4
39.4

52.9
48.9

63.4
58.6

63.4
63.4
Evaluation of assemblies
In order to assess the quality of the assemblies delivered by the different programs, we used the script assess_assembly.pl of the Plantagora project
Furthermore, assemblies were evaluated using the basic Assemblathon 1 statistics as defined in
Assemblathon metrics
RJ
SGA
Edena
LEAP
Metrics of the assemblies of the dataset c22 as delivered by Readjoiner (RJ), SGA, Edena and LEAP (
Number of contigs
120712
231594
120462
113428
Genome size (bp)
34894545
34894545
34894545
34894545
Total contigs length
45667531
55880641
44737441
43099113
 as % of genome
130.87
160.14
128.21
123.51
Mean contig size
378.32
241.29
371.38
379.97
Median contig size
132
101
120
117
Longest contig
41352
41352
41352
39379
Shortest contig
102
100
100
101
Contigs > 500 bp
13467 (11.16%)
13416 (5.79%)
13439 (11.16%)
13430 (11.84%)
Contigs > 1 Kbp
8700 (7.21%)
8684 (3.75%)
8696 (7.22%)
8578 (7.56%)
Contigs > 10 Kbp
264 (0.22%)
264 (0.11%)
264 (0.22%)
228 (0.20%)
N50
1614
815
1699
1617
L50
5684
10118
5416
5488
NG50
2737
2739
2733
2461
LG50
3120
3113
3121
3429
Plantagora metrics
RJ
SGA
Edena
LEAP
Covered Bases
34343945
34357693
34300114
12968118
Ambiguous Bases
159997
154584
182952
696334
Misassemblies
4
4
4
3693
Misassembled Contigs
4
4
4
2344
Misassembled Contig Bases
1283
417
1245
2797710
SNPs
104
125
120
46270
Insertions
5
2
1
2403
Deletions
43
23
28
5187
Positive Gaps
2679
2471
2925
26495
Internal Gaps
0
0
0
21
External Gaps
2679
2471
2925
26474
 total length
547408
558921
589979
19064103
 average length
204
226
202
720
Negative Gaps
110888
218908
110811
18198
Internal Overlaps
0
0
0
17
External Overlaps
110888
218908
110811
18181
 total length
−10247647
−20078971
−9424823
−1859835
 average length
−92
−92
−85
−102
Redundant Contigs
864
1158
607
6329
Unaligned Contigs
3262
4686
3221
60563
 partial
18
57
21
3252
 total length
462668
599320
447922
27666823
Ambiguous Contigs
2631
3876
2619
799
 total length
369284
483895
366418
93102
Effect of sequencing errors
In order to assess the efficiency of
N50 (bp)
NG50 (bp)
RJ
SGA
RJ
SGA
N50 and NG50 values for the Readjoiner and SGA assemblies of Ecoliwithouterrors, Ecoliwitherrors, EcoliSGAcorrected and EcoliSGAcorrected+filtered.
54948
54936
57213
57210
203
5110
245
8645
38178
40002
39999
40824
41872
41872
41905
41903
Discussion and conclusion
In this paper, we presented methods and implementation techniques of a new string graph based assembler, named
Although the different string graphbased assemblers aim at constructing the same graph, they apply different heuristics to compute a layout from the string graph. The quality of assemblies of simulated datasets was compared using metrics from the Plantagora project
Our main development is a new efficient algorithm to compute all irreducible suffixprefix matches from which the string graph is constructed. While the basic techniques we use (e.g. integer encodings, suffix sorting, integer sorting, binary search, bottomup traversal of lcpinterval trees) are mostly wellestablished in sequence processing, their combination is novel for the considered problem. The different techniques were chosen with the overall goal of performing as few as possible random accesses to large data structures to obtain algorithms with excellent data locality which in turn leads to short run times. For most parts of our method, this goal was achieved, mostly due to the partitioning of the set of SPMrelevant suffixes. There are still many random accesses to the representation of the reads, which, however, cannot fully be prevented in an index based approach.
The problem of computing suffixprefix matches has long been studied in the literature, mostly with the goal of finding, for each pair of reads
Ohlebusch and Gog
In Edena, suffixprefix matches are computed using a suffix array. Details of the algorithm or the implementation are not published.
Like Simpson and Durbin
There are two main approaches to the construction of a string graph. The original approach of Myers
An alternative overlap graph representation for exact suffixprefix matches was introduced in
It is worthwhile to note that the contigs output by LEAP contain many differences with respect to the target sequences they were sampled from. It is not clear to us, whether this is an artifact of the method or an implementation issue.
Another efficient way to reduce the space peak for string graph constructions is to recognize transitive
Our comparative tests (Table
We see two reasons for the time advantage of
The minimum match length parameter ℓ_{
min
} is used to restrict the search to the exact
Among the string graphbased assemblers mentioned here, SGA is the only one that can distribute parts of the computation across multiple threads. Some of the algorithms employed in
Another important issue for future development is the improvement of the assembly quality for real world data. Here further preprocessing steps, in particular quality filtering and error detection are required, as well as the handling of paired read information in the assembly phase.
The present manuscript focuses on the algorithmic approach and implementation of methods for the computation of irreducible suffixprefix matches and the construction of the string graph. We report our results on errorfree datasets: This is in analogy to the first papers describing the methods implemented in SGA
Several error correction strategies have been applied so far: The classical method was to consider approximative suffixprefix matches of the reads and to correct the resulting contigs in a consensus phase. With large nextgeneration datasets, the method of choice consist in
Approximative suffixprefix matching algorithms can be implemented to work on index structures, but the increased search space makes them significantly slower than exact matching algorithms. Among the string graphbased assemblers, only SGA implements an approximate suffixprefix matching algorithm: Nevertheless, this is not used by default, and the authors recommend using their faster
The fact that
Pairedend and mate pairs provide short and long range positional information, which is critical for improving the quality of assembling eukaryotic genomes. The classical approach consists in using this information for connecting contigs into scaffolds either in a postprocessing phase, which may be integrated in the assembler software, or using a standalone tool, such as Bambus
Availability
The
Authors’ contributions
GG developed most of the methods, implemented
Acknowledgements
We would like to thank Gordon Gremme and Sascha Steinbiss for developing and maintaining the