INRIA Rennes - Bretagne Atlantique, EPI Symbiose, Rennes, France

ENS Cachan/IRISA, EPI Symbiose, Rennes, France

Abstract

Background

The analysis of next-generation sequencing data from large genomes is a timely research topic. Sequencers are producing billions of short sequence fragments from newly sequenced organisms. Computational methods for reconstructing whole genomes/transcriptomes (

Results

We present

Conclusions

Background

Genomics witnessed an unprecedentedly deep change a few years ago with the arrival of the Next Generation Sequencers (NGS) also known as High Throughput Sequencing (HTS). These technologies enable sequencing of biological material (DNA and RNA) at much higher throughput and at cost that is now affordable to most academic labs. These new technologies generate gigabyte- or terabyte-scale datasets. The size of datasets is one of the two main bottlenecks for NGS. The other bottleneck is the analysis of generated data. Current technologies cannot output the entire sequence of a DNA molecule, instead they return small sequence fragments (

With sequencing costs falling, sequencing efforts are no longer limited to the main species of interest (human and other primates, mouse, rat,

We seek to establish that many biological questions can be answered by analyzing unassembled reads. In particular, the user may possess

Another key aspect of

• For a known biological event, e.g. a SNP (*), a splicing event (*) or a gene fusion (*),

• Do these genes have close homologs in this set of reads (*)? Similarly, do these enzymes exist in this metagenomic set, or do these exons expressed in this [meta]transcriptomic set? Using genes or the enzymes or exons as starters,

• In case of complex genomes, one may be interested in finding approximate repeated occurrences of known sequence fragments (*). Using such sequence fragments as starters, their occurrences within a fixed Hamming distance are found and their flanking regions are recovered as a graph.Note that this approach is limited to a small number of slightly differing occurrences. Indeed, graph-based

•

The symbol (*) indicates that an example of this use case is given in the Results section. Furthermore, it is important to note that

Methods

The

1. **Mapping**.

2. **De novo assembly**. Each read coherent starter is extended in both directions. In accordance to user choice:

(a) the extension process is stopped as soon as several divergent extensions are detected. In this case, the output is a FASTA file containing the consensus assembly around each starter;

(b) the extension process continues even in the case of several divergent possibilities. Extensions are represented as a directed graph. Each node stores a sequence fragment and its read coverage per position. This graph, is output in

The mapping phase performs several tasks. A maximum number

Algorithm overview.

**Algorithm overview.** Overview of the algorithm steps with reads of length 7, a minimal coverage of 2 and k-mers of length **a)** Representation of the sub-starter generation step. A set of reads is mapped to the starter _{1 } and _{2}) is computed from each perfect multiple read alignment. The Hamming distance between each sub-starter and **b)** Representation of an extension. Three reads have prefix of length at least

Definitions

We first introduce some notations and definitions used throughout the paper. A sequence ∈^{∗} is a concatenation of zero or more characters from an alphabet _{
H
}(_{1},_{2}) between two sequences _{1 } and _{2 } of equal length is the number of positions at which the corresponding characters are different:

Definition 1 (**
Hamming distance for overlapping sequences
**)

Given two sequences _{1} and _{2}∈^{∗}, and
_{
H
}(_{1},_{2}) as the hamming distance of the overlapping part between _{1} and _{2}, considering the first character of _{2} aligned to position _{1}. Formally,
_{1}[_{2 } is not aligned with _{1}(_{2 } is not aligned with _{1}(_{1}|).

Definition 2 (**
Mapped read
**)

Given a sequence ^{∗}, a read ^{∗ } is said to be mapped to _{
H
}(

The notation

Example 1 (**
Mapped read
**)

Given _{
H
}(

Algorithm

An overview of the whole process is presented in Algorithm 1. In a few words, the algorithm is divided into two main phases: the **mapping phase** (Steps 1 to 4 of Algorithm 1). This first phase is similar to seed-based mapping algorithms such as **targeted** **
de novo
**

Algorithm 1: Mapsembler overview

**Requires:** Set of reads **Ensure:** For each starter in

1: Index the

2: Map reads

3: **for all**
**do**

4: Using reads mapped to

5: Add new sub-starters to _{0}.

6:

7: **while**
_{
i
}≠**do**

8: Free previous index, index _{
i
} with

9: Map reads _{
i
}, using the

10: **for all**
_{
i
}**do**

11: Using reads mapped to

12: Create nodes containing the extensions &manage graph

13: Store all novel extensions in _{
i + 1}

14:

15: Simplify the created graphs

16: For each starter in

Explanation of Algorithm 1 steps

• Step 1: An index of all _{
id
} belonging to the indexed set, and for each _{
id
}, a list of couples (_{
id
},_{
id
}) is stored, with _{
id
} being a position where the _{
id
}. Note that, as a _{
id
}, several distinct couples may be stored for a given _{
id
}. All couples (_{
id
},_{
id
}) of a given

• Step 2: input reads (and their reverse complement) are processed on the fly, only mapped reads are stored in memory. The mapping process is as follows. All

• Steps 8 and 9: Indexing of extensions _{
i
} and read mapping are performed similarly to Steps 1 and 2. During these steps, reads have to perfectly agree with the extensions, hence read mapping is done with distance threshold

• Step 11: For each sub-starter, extensions are always stored in a rooted directed string graph, each node containing a sequence fragment. A node storing a sequence _{
s
}. The node storing the sub-starter itself is the root of the graph. For each sequence _{
i
}, using all error-corrected mapped reads

1. An empty extension is found.

2. Exactly one extension _{
e
}, and link the node _{
s
} to the node _{
e
}. Store the fragment _{
i + 1}.

3. Several extensions {_{1},_{2},…,_{
n
}} are found, then:

For simple sequence output, the longest common prefix _{
i
} is stored in a new node _{
p
}. Link _{
s
} to _{
p
} for output purpose. As _{
i + 1}, its extension stops.

For graph output, link _{
s
} to _{1},_{2},…,_{
n
}} are stored in _{
i + 1}.

• Step 12: Generate enriched extensions by adding suffix of

Step 13: Novel extensions are those corresponding to nodes which are not already present in the graph (see Section “Graph management”).

Step 16: In case of simple sequence format, the extensions graph of each sub-starter do not contain branching nodes. A simple traversal provides the consensus sequence of the contig containing the sub-starter.

Error correction

Actual sequencing reads are error-prone, therefore error correction mechanisms are implemented inside the mapping phase. At Steps 2 and 9, error-prone reads are mapped to starters. An error correction phase is performed immediately after both of these steps, by taking advantage of the multiple read alignments. This procedure is based on nucleotide votes, similarly to greedy assemblers

We now provide deeper algorithmic explanations for sub-starter generation (Step 4) and the graph management (Steps 12 and 15). The remaining steps (read mapping and greedy sequence extensions) are classically well known

Sub-starter generation and read coherence

The sub-starter generation and read coherence step take place immediately after the mapping phase (Step 4). Given a starter _{
i
}) of sequences (called

originate from the reads, i.e. each _{
i
} is a consensus sequence of a subset of reads from

are coherent with the starter _{
i
} is at most

are significantly represented, i.e. each position of _{
i
} is covered by at least

A starter is

Problem 1 (**
Multiple consensuses from read alignments
**)

Given a starter _{
i
} is aligned to _{
i
} with at most _{
i
} of

1. each subset _{
i
} admits a perfect consensus _{
i
}, i.e. each read _{
i
} aligns to _{
i
} at position _{
i
} (relative to

2. the consensus _{
i
} aligns

3. each position of _{
i
}.

A trivial (exponential) solution is (i) to generate the power set (all possible subsets) of

The completeness proof that Algorithm 2 finds all maximal subsets corresponding to correct sub-starters is as follows. The proof is by contradiction: let _{1},…,_{
n
} be the maximal subset of reads which yields _{1},…,_{
k
}, for _{1},…,_{
k
} is part of a returned subset _{0}, we show that _{1},…,_{
k + 1} is also returned. Since _{
k + 1} is part of a subset which yields _{
k
}. However, _{
k + 1} does not necessarily belong to _{0}. Let
_{
k
} in _{0}. In the ordering of the reads by increasing position, if the read _{
k + 1}is seen before
_{
k + 1} perfectly overlaps with _{
k
}, a new subset is created from _{0}, which contains exactly _{1},…,_{
k + 1}. Eventually, from the induction, a subset which contains _{1},…,_{
n
} is constructed. Since _{1},…,_{
n
} is itself maximal, the subset found by the algorithm is exactly _{1},…,_{
n
}.

Note that Algorithm 2 may return subsets which do not satisfy all the three conditions (e.g. coverage of ^{
d
} maximal subsets, one for each combination of substitutions with ^{
d
}) intermediate subsets at any time. Assuming that the read length is bounded by a constant, the overlap detection steps 4 and 7 can be performed in ^{
d
}|^{2}), where in practice

Algorithm 2: Generating candidate subsets _{
i
} for solving the multiple consensuses from read alignments problem

**Requires:** Set of reads **Ensure:** Set

1:

2: **for** each read (**do**

3: **for** each subset _{
i
} in **do**

4: **if** r overlaps without substitutions with the lastread of _{
i
} **then**

5: Add _{
i
}.

6: **else**

7: **if**
_{
i
} **then**

8: Let (^{
′
}, ^{
′
}) be the last read of _{
i
} overlapping with

9: Let _{
i
} of all reads up to(^{
′
},^{
′
}).

10: Create a new subset ^{
′
}=

11: Insert ^{
′
} into

12: **if**
**then**

13: Create a new subset with

14: Remove any subset from

15: **return **

Graph management

Adding a node

Several biological events such as a SNP, an indel, or exon skipping, create two or more distinct paths in the extension graph. These paths eventually converge and continue with an identical sequence. Consequently, path convergence is checked during the iterative assembly phase (Algorithm 1, Step 12). When a sequence _{
s
} is linked to
^{
′
} is added between _{
s
} and
_{
s
} as it is already present in node

Graph simplification

Once extensions are finished, each graph is simplified as follows:

As presented in Figure

Two nodes _{
s
} and
_{
s
} has only
_{
s
} as predecessor. This is a classical concatenation of simple paths. See Figure

For all nodes successors of a node _{
s
} having only _{
s
} as predecessor, their longest common prefix _{
s
}, thus generating node _{
s.pre
}. Similarly, for all nodes predecessors of a node _{
s
} having only _{
s
} as successor, their longest common suffix _{
s
}, thus generating node _{
suf.s
}. This simplification relocates branching in the graph, to the exact position where sequences diverge and converge. See Figure

Graph simplification

**Graph simplification.** Graph simplification (Algorithm 1, Step 15). **a)** the graph before simplification. **b)** After removing the first **c)** After common prefix and suffix factorizations.

Availability and requirements

Results

All presented results were obtained on a 2.66 Ghz dual-core laptop with 3 MB cache and 4 GB RAM memory.

For each experiments presented in this manuscript, details about datasets,

**Material and**
**Mapsembler**
**commands and results.**

Click here for file

In Figures representing graphs, the node size indicates average read coverage in the sequence and the node border size indicates the length of the sequence.

Note that

Mapsembler and the state of the art

Targeted assembly should not be confused with Sanger-generation, localized BAC-by-BAC assembly methods (e.g. Atlas

sub-starters retrieval;

multiple iterations to extend starters as far as possible. This is equivalent to re-running

graph output of the left and right neighborhood of starters.

We compared

Using a unique randomly selected read as starter,

The iterative mapping and assembly strategies are also used in the IMAGE approach

Assembly accuracy

The accuracy of

Dealing with large data sets

In this section, we focus exclusively on _{10K
}, _{100K
}, _{1M
}, _{10M
}, and _{100M
}, were generated by random sampling of 10^{5}, 10^{6}, 10^{7}, 10^{8} and 10^{9} reads. A targeted assembly of 10 randomly selected reads as starters was performed using

Results summarized in Table
_{100M
} data set) was analyzed using <1.5 MB of memory. These results also show that computation time is reasonable even on such large data sets as time linearly increases with the number of starters. On average on the _{100M
} data set, checking read coherence of all starters took 1813 seconds while one extension of all sequence fragments took 903 seconds.

**Reads data set**

**Mapping time (s)**

**Assembly time (s)**

**Total time (s)**

**Memory (MB)**

Time and memory requirements for targeted assembly of 10 starters using increasingly large human genome read data sets. Mapping time corresponds to the mapping phase (Algorithm 1, Steps 1 to 5). Assembly time corresponds to the assembly phase (Steps 18 to 14) per iteration.

_{10K
}

<1

<1

1

<1.5

_{100K
}

1

2

5

<1.5

_{1M
}

14

6

40

<1.5

_{10M
}

170

95

442

<1.5

_{100M
}

1813

903

3983

<1.5

Recovering environments of repeat occurrences

We analyzed a dataset of 20.8 M raw Illumina reads (SRA: SRX000429) from

Repeated starter

**Repeated starter.** Graph obtained using a repeat occurrence as starter. To be readable the prefixes of left extensions and the suffixes of right extensions, as well as the core or the starter are truncated.

Detecting AluY sub-families in a personal genome

We downloaded a dataset of high-coverage, NA12878 chromosome 19 reads from the 1000 Genomes project. We selected bases 60-120 of the RepBase

A total of 58,656 reads mapped to the 60 bp starter and 8 sub-starters were constructed by

Several sub-starters (1, 2, 3, 4 and 6) did not exactly correspond to a known Alu consensus sequence. We manually verified that all these sub-starters are valid as follows. Sub-starters 2, 4, 5 and 6 (resp. 2, 4 and 5) align perfectly to the NA12878 maternal (resp. paternal) reference. Mutations of sub-starters 2 and 4 (bases 50 and 49 respectively) are also found in Alu Ya5

Gene detection in a different strain

The folA gene (dihydrofolate reductase) is present in several strains of

Detection of known biological events in

In this section, one Illumina HiSeq2000 RNA-Seq run of 22.5 million reads of length 70 nt from Drosophila Melanogaster is analyzed (data not published). As presented in upcoming sections,

Exon skipping

We chose a starter located close to a known exon fragment (Chr4:488,592-488,620 BDGP R5/dm3). Using less than one megabyte of memory and in 33 minutes,

Drosophila exon

**Drosophila exon.** Visualization of

Drosophila exon - blat result

**Drosophila exon - blat result.** Visualization of Blat results on sequences obtained from graph presented Figure

Visualizing SNPs

On the same read data set, we used a fragment (chrX:17,783,737-17,783,812 BDGP R5/dm3) for which neighboring genes are known. We applied

Drosophila SNPs

**Drosophila SNPs.** Visualization of

Detection of fusion genes in breast cancer

Recent work from Edgren

It is of particular interest to notice that

Gene fusion in human breast cancer

**Gene fusion in human breast cancer.** Extension graph of an extremity of an exon from the VAPB human gene located on chromosome 20. **a)**: the raw graph produced by **b)**: the same graph manually curated by mapping the sequence of each node on the human genome. Nodes where moved in order to reflect their relative mapping position on the chromosomes. Nodes from the raw graph having sequences mapping at the same position where merged. For each node, the start and stop positions of the mapping are indicated. The presence of two start and stop positions reflects the presence of a central intron. Except for the purple node having multiple hits among the genome, 100% of the sequence of each node was mapped, either to an exon from gene VAPB on chromosome 20 or from gene IKZF3 on chromosome 17. The bold edge corresponds to the gene fusion found in

Gene fusion in human breast cancer - Blat results

**Gene fusion in human breast cancer - Blat results.** Blat

Discussion

We presented

Homology/similarity distance

Furthermore, setting a large ^{
d
}) sub-starters having at most

Secondly,

Paired reads vs. single reads

The

Instead of injecting the paired read information in the algorithm, we believe that it is simpler to run

Micro assembly vs. full contigs assembly

One key aspect of

Does contig length matter?

If feasible on the data, using

We argue that short contigs provide sufficient biological information for our purpose. As presented in the results section, we retrieved SNPs, different isoforms and gene fusions using short contigs. For instance, the graph presented Figure

Sensitivity to SNPs

Similarly to greedy assemblers, in the simple sequence output mode,

Starter selection

The input starters are sequence fragments on which reads will be mapped. They can be of any length, however very long starters (over 10^{4}nt) are discouraged, as the sub-starters generation step is quadratic in the number of aligned reads. Furthermore, Mapsembler verifies that starters are read-coherent, hence longer starters are more likely to contain regions where the coverage is too low. Also, as previously mentioned, long starters may lead to false positive sub-starters.

Mapsembler discards read alignments which contain an indel. Hence, it is advised to input small, well-conserved starters. However, indels in the extensions are retained in the graph structure. Starters are typically constructed from an external source of information, such as sequence information from a related species, a known conserved gene, or an existing collapsed assembly.

Full biological events calling versus

Conclusions

We presented the main

There is much room for future work. Currently the error correction is based on substitutions only. For opening

To finish, its simplicity and its power make

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

PP initiated the work and PP and RC designed the algorithms. RC developed the sub-starter generation and read coherence algorithms, while PP developed the other parts. RC and PP performed the experiments and wrote the paper. Both authors read and approved the final manuscript.

Acknowledgements

Authors warmly thank Vincent Lacroix, Claire Lemaitre, Delphine Naquin, Hélène Falentin and Fabrice Legeai for their participation to discussions. This work was supported by the INRIA “action de recherche collaborative” ARC Alcovna and by the MAPPI ANR.