Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, P.O. Box 68 (Gustaf Hällstromin katu 2b), Helsinki, 00014, Finland

Abstract

Background

For the development of genome assembly tools, some comprehensive and efficiently computable validation measures are required to assess the quality of the assembly. The mostly used N50 measure summarizes the assembly results by the length of the scaffold (or contig) overlapping the midpoint of the length-order concatenation of scaffolds (contigs). Especially for scaffold assemblies it is non-trivial to combine a correctness measure to the N50 values, and the current methods for doing this are rather involved.

Results

We propose a simple but rigorous

Conclusions

We propose and implement a comprehensive and efficient approach to compute a metric that summarizes scaffold assembly correctness and length. Our implementation can be downloaded from

Background

In

We propose a much simpler but still rigorous approach to compute normalized N50 scaffold assembly metric that combines N50 with correctness measure; in principle, assembly is split into as many parts as necessary to align each part to the reference. For example, let reference be

In more detail, one needs to allow mismatches and indels in the alignment so that only the real structural errors in the assembly are measured. Moreover, the gaps between contigs in a scaffold may not be accurate due to variation in insert sizes of the mate pair reads used for the scaffold assembly. Taking these aspects into account, it would be easy to construct a dynamic programming recurrence to find the best scoring alignment for a scaffold, allowing gaps (

We propose a practical scheme of computing an approximation of the normalized N50 metric using the common seed-based strategy: First compute all maximal local approximate matches between scaffolds and reference, then chain those local alignments that retain the order both in reference and in each scaffold. This approach is called

In what follows, we assume that local alignments are given, and first concentrate on modifying co-linear chaining for the case of restricted gaps. Then we proceed in explaining our implementation of the normalized N50 computation incorporating the local alignment computation with gap-restricted co-linear chaining. We then give our results on an experiment demonstrating how normalized N50 can characterize good and bad scaffold assemblies. Discussion follows on other possible uses and variations of the method proposed.

Methods

Let us assume that all local alignments between scaffold and reference genome have been computed, and we have a set of tuples _{1}
_{2}⋯_{
p
}∈^{
p
} such that _{
j
}
_{
j−1}
_{
j
}
_{
j−1}
_{
j
}
_{
j
}
_{1}
_{2}⋯_{
N
}. Then, fill a table _{
j
}
_{1}
_{2},…_{
j
}}. Hence max_{
j
}
_{
j
} in _{
j
}in

For (b) it holds

Then the final value is ^{a}
[^{b}
[^{
′
})); ^{2}) time algorithm, whereas the use of invariant and search tree gives

**Algorithm **
_{1},_{2},…,_{
N
}

(1)

(2) **for **
** to**
**do**

(3)

(4)

(5) ^{a}[^{b}[

(6)

(7)

(8) **return **max_{
j
}

The alignment given by applying the above algorithm allows arbitrary long gaps, which is not a desirable feature. The gaps between consecutive contigs in scaffolds are restricted by the mate pair insert size, which also tells that in a correct alignment to the genome the gaps should not deviate much from this value. It is easy to modify co-linear chaining to restrict gaps: Replacing
_{
j
}
_{
j
}
^{
′
}<^{
′
}= 1:

(3’) **while**
**do**

(3”)
^{
′
}←^{
′
} + 1

(For simplicity of exposition, this assumes values
_{
j−1}
_{
j
}
_{
j−1}
_{
j
}
_{
j
}
_{
j−1}
_{
j
},_{
j
} in
_{
j
} from
^{
′
}] for all active tuples
^{
′
}] the maximum of its previous value and the value computed applying lines 3-5 in the algorithm above. The correctness now follows from the facts that (a) when _{
j
} is added to the active tuples
_{
j
} and hence trigger the update of active tuple (_{
j
},

Results

We used

The rest of the process (co-linear chaining, extraction of alignments, computation of N50) was executed on a single machine. To compute the normalized N50 value, the process was hence to apply co-linear chaining iteratively, always extracting the best alignment and splitting the scaffold accordingly. The process was repeated until all pieces (that had a local alignment in the first place) found their matches. The N50 of the pieces obtained this way is then called the normalized N50. Reverse complements were taken account appropriately; scaffolds were aligned to both strands and only contig alignments with the same orientation were combined to form a scaffold alignment.

We have already used normalized N50 in

**Original**

**10%**

**20%**

**30%**

**50%**

**100%**

Normalized N50

183891

92212

56964

43461

33533

30403

Genome coverage

0.9333

0.6410

0.4778

0.4153

0.3421

0.3311

Scaffold coverage

0.9859

0.6847

0.5071

0.4414

0.3642

0.3522

For the experiment we ran the validate_distributed.sh script of our tool with parameters

Discussion

The proposed method should also work for validating an RNA assembly against a DNA reference, by just setting the maximum gap length to the maximum possible intron length. Also one could use it for whole genome comparison between two species, by considering how many pieces one genome needs to be partitioned in order to align to the other. Such measure is not very accurate as it does not model a sequence of evolutionary events to explain the transformation, like the genome rearrangement distances, but the approach gives the number of breakpoints which can be used as a lower-bound. However, much more elaborate tools for that purpose have been developed

We stress that our approach has also some conceptual value in avoiding unnecessary heuristics. The three main steps (i) finding maximal local alignments, (ii) co-linear chaining, and (iii) splitting the scaffolds, have each an algorithmically correct solution. For (i) and (ii) one can refer to

An example where greedy extraction of gap-restricted co-linear chains may result into more pieces than optimal

**An example where greedy extraction of gap-restricted co-linear chains may result into more pieces than optimal.** Greedy selection would align blocks 2, 3, 4 with dashed edges, but then with suitable gap-restriction blocks 1 and 5 could not be aligned together, and the assembly would be split into 3 parts. Optimal algorithm can choose 2 and 4 with dashed edges and then blocks 1, 3, 5 together, resulting into 2 parts only. It is possible to construct such an example even without multiple mappings for the blocks.

Finally, the approach in

Conclusions

We proposed and implemented a comprehensive and efficient approach to compute a metric that summarizes scaffold assembly correctness and length. Our implementation can be downloaded from

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

VM and LS developed the gap-restricted version of co-linear chaining and it was implemented by VM. All authors contributed to the development of the normalized N50 framework and it was implemented and experimented by JY. All authors contributed to the writing. All authors read and approved the final manuscript.

Acknowledgements

We wish to thank Juha Karjalainen for the initial implementation of co-linear chaining, and Rainer Lehtonen, Virpi Ahola, Ilkka Hanski, Panu Somervuo, Lars Paulin, Petri Auvinen, Liisa Holm, Patrik Koskinen, Pasi Rastas, Niko Välimäki, and Esko Ukkonen for insightful discussions about sequence assembly and scaffolding. We are also grateful to the anonymous reviewers from their constructive comments that improved the article considerably.

This work was partially supported by Academy of Finland under grants 118653 (ALGODAN) and 250345 (CoECGR).