Department of Computer Science, University of California Los Angeles, Los Angeles, California, USA

Department of Mathematics, University of California Los Angeles, Los Angeles, California, USA

Abstract

Motivation

Recent studies in genomics have highlighted the significance of structural variation in determining individual variation. Current methods for identifying structural variation, however, are predominantly focused on either assembling whole genomes from scratch, or identifying the relatively small changes between a genome and a reference sequence. While significant progress has been made in recent years on both

Results

In this paper, we present a computational method for incorporating a reference sequence into an assembly algorithm. We propose a novel graph construction that builds upon the well-known de Bruijn graph to incorporate the reference, and describe a simple algorithm, based on iterative message passing, which uses this information to significantly improve assembly results. We validate our method by applying it to a series of 5 Mb simulation genomes derived from both mammalian and bacterial references. The results of applying our method to this simulation data are presented along with a discussion of the benefits and drawbacks of this technique.

Introduction

Within a species, individual genomes differ from one another by a certain amount of genetic variation. These variations exist at different scales, ranging from single nucleotide variants (SNVs), to small-scale insertions and deletions (indels), up to large structural variations (SVs) of kilo- to mega-base scale. Many studies in genomics are focused on characterizing the content of these variations and identifying associations with diseases or other phenotypes

In recent years, the development of high-throughput sequencing (HTS) technologies has made it possible to sequence an individual genome rapidly and at low cost. However, the problem of how to interpret this sequencing data remains. Traditionally, one of two approaches is taken. In

It is helpful to consider these two approaches,

To address this problem, a number of methods have been developed to both identify the loci of larger SVs

Because of this difficulty, many studies continue to rely on

A number of software packages have been developed in recent years with the aim of utilizing a set of reference genomes to produce a more optimized scaffolding, or layout, of the contigs produced in

Our aim in this paper is to propose a novel model for the assembly of a donor genome which uses the reference as a guide, and to show how this approach improves assembly results over pure

Results

Here we present the results of our work, beginning with a brief overview of our method. We follow this with a discussion of our simulation results and the implications for the feasibility of our method.

Method overview

As an example of how a reference sequence can aid in assembly, consider the de Bruijn graph of a donor genome "ATAGAGGCAATGAGCGTGGAGTTC" in Figure

A motivating example demonstrating how the use of a reference can help discover the most parsimonious traversal of the de Bruijn graph

**A motivating example demonstrating how the use of a reference can help discover the most parsimonious traversal of the de Bruijn graph**. (a) de Bruijn graph of the donor sequence "ATAGAGGCAATGAGCGTGGAGTTC". (b) de Bruijn graph of the reference sequence "ATAGCAATCGTGTTC", including edge index labels. (c) Graph combining the donor and reference sequences. (d) Graph stripped of red edges with no parallel blue edge.

With this idea in mind, our method begins by building a graph of the contigs in the donor sequence. The construction of these contigs is flexible, and they may be derived from the sequencing reads through either

It is important to note that while we use the alignment information in our method, it is never assumed that any specific alignment is correct. We believe this is a strength compared to other methods that more heavily rely on read mapping and as a result may be more biased towards the reference.

Simulation results

In order to validate our method, we design a simulation framework using two reference genomes; the O157:H7 strain of the

For each reference genome, we generate simulated donor genomes by applying a series of mutations to the reference, including insertions, deletions, duplications, and translocations. We vary the average size of the mutation events from 5 Kb to 50 Kb, such that these events comprise roughly 15% of donor genome. We further apply a set of SNV mutations at a rate of 0.1%.

We generate simulated paired-end sequencing data from each donor genome using a read length of 100 bp and fixed insert size of 500 bp. In all cases we assume error-free reads and uniform coverage (a read from every position). While these assumptions are unrealistic in practice, correcting for read errors and variable coverage are orthogonal problems which have been studied independently

For each data set, we perform paired-end assembly using Velvet

We further evaluate our method by comparing two different strains of the E. coli bacteria, using the O157:H7 strain as a reference, and the K-12 strain as a donor. The K-12 strain is significantly shorter in length than the O157 strain, indicating the presence of large-scale deletions. This is supported by our assembly results, which indicate mutation events up to 120 Kb in size. The full results of our simulations on the E. coli reference are reported in Table

Results of running both Velvet and our method on simulated mouse chromosomes.

**Velvet**

**Our method**

**Donor genome**

**# Contigs**

**N50**

**Max contig**

**# Contigs**

**N50**

**Max contig**

**Accuracy**

Mouse, 5 Kb

1014

14315

56677

**352**

**73042**

**288172**

**99.7%**

Mouse, 25 Kb

773

19038

102858

**386**

**88473**

**227406**

**99.7%**

Mouse, 50 Kb

705

21721

98684

**410**

**117127**

**336208**

**99.2%**

Each simulated chromosome is 5 Mb in length, containing roughly 15% mutated content, using mutation event sizes of 5, 25, and 50 kb. In each case, contig statistics are given for the Velvet assembly and for our method, along with the accuracy of our computed contigs. Accuracy is calculated as the percentage of computed contigs that align back to the donor genome.

Results of running Velvet and our method on

**Velvet**

**Our method**

**Donor genome**

**# Contigs**

**N50**

**Max contig**

**# Contigs**

**N50**

**Max contig**

**Accuracy**

1034

25477

158013

**422**

**56750**

**274293**

**99.5%**

870

71194

286061

**727**

**96535**

**285958**

**99.6%**

166

125649

327149

**33**

**429486**

**734812**

**97.0%**

In the first three cases, simulated donor genomes are derived from the O157 reference by applying a series of mutations (insertions, deletions, and SNVs). In the final case, the K12

It is important to note that while comparisons against

Methods

Let _{R }_{D }_{R }_{D }

Reference/donor graphs

Given the multisets of _{R }_{D}_{R }_{R}, E_{R}_{D }_{D}, E_{D}_{R }_{D}_{R }_{D}_{R }_{D}_{R }_{D}

**Definition: **a **reference/donor graph **_{RD }_{R }_{D }_{R }_{D }_{RD}_{R }

Refer to Figure

A most basic example of a reference/donor graph, constructed from the superposition of the two original graphs

**A most basic example of a reference/donor graph, constructed from the superposition of the two original graphs**. (a) Reference graph _{R}_{D}

**Definition: **a **donor tour **of a reference/donor graph is a complete tour that includes only blue (donor) edges.

In other words, a donor tour of _{RD }_{D}

We are now interested in a concise way to characterize the similarities between the reference and the donor. We start by considering those cases in which a red and blue edge are parallel in the graph (denoted by the || operator). The notation and terminology we will use in discussing these cases is defined as follows.

**Definition: **a donor edge **∈ **_{RD }**reference-parallel **if there exists a reference edge _{RD }_{1}, _{2}, . . ., _{n}_{i}, e_{i}_{+1 }∈ _{i}_{j}_{RD }_{j}_{i}_{1}, _{2}, . . ., _{n}**novel **if it is not reference-parallel.

**Definition: **the reference indexes of a reference-parallel sequence are the values _{1})._{2})._{n}_{1})._{n}**reference marker**. The beginning and end of the reference-parallel sequence are referred to as

Refer to Figure

**Definition: **given two reference markers _{i }_{j}_{i }**connects to **_{j }_{j}.start _{i}.end

With this graph construction, we can now de ne the genome reassembly problem.

**The genome reassembly problem**: given a reference/donor graph _{RD}

We note here that this is an extremely large combinatorial problem, and as such a solution is impractical. We therefore formulate a new problem, imposing an assumption on the size of any single variation event in the donor genome.

**The τ-gap genome reassembly problem**: given a reference/donor graph

Condensed reference/donor graph

In order to reduce the computation and storage demands of the algorithm, we first transform the reference/donor graph _{RD }_{CRD}

Construction of a condensed reference/donor graph in a more complex case

**Construction of a condensed reference/donor graph in a more complex case**. In the first pass, any reference edges with no parallel donor edge are removed. In the second pass, linear subpaths are condensed to single edges, and the parallel reference edges are summarized using reference markers. (a) Initial reference/donor graph. (b) Intermediate graph with isolated reference edges removed. (c) Final condensed reference/donor graph, with reference markers shown.

With the condensed reference/donor graph in mind, we may think of a valid traversal as one which touches a sequence of reference markers, one from each edge in the path. For this traversal to be valid, each adjacent pair of reference markers in this sequence must be separated by a distance of at most

**The τ-gap genome reassembly problem on a condensed reference/donor graph**: given a condensed reference/donor graph

Note, however, that this formulation requires at least one reference marker at

Message passing algorithm

Having constructed the reference/donor graph, our aim is to encode within the graph exactly those traversals which satisfy our initial assumptions on

Showing the process of propagation and pruning, assuming

**Showing the process of propagation and pruning, assuming τ = 15**. All edges shown have an edge length of 10. Initially, we have the graph in (a), in which the edges (6,7) and (5,7) have no reference markers. The marker attached to edge (7,8) is propagated along incoming edges up to a distance of

Propagation

As previously described, each edge in the condensed reference/donor graph stores a list of the reference markers associated with any reference-parallel subsequences within that edge's contig. The first phase of the message-passing algorithm propagates this information throughout the graph, such that each edge additionally stores a list of reference markers at edges that are

A message in this propagation phase consists of a set of pairs, each pair (

**Algorithm 1 **Propagation_ReceiveMessage(edge, message)

1:

2: **for **(**∈ **message **do**

3:

4: **if ****then**

5: edge.

6:

7: **end if**

8: **end for**

9: **if ****then**

10:

11: **end if**

Pruning

Following the propagation phase, each edge in the graph must have at least one reference marker in its list (if any edge does not, then our assumption on

In the second phase, each edge on receiving a message inspects the markers in its list, categorizing each as either "connected" or " orphaned." A connected marker _{in }_{out}

**Algorithm 2 **Pruning_ReceiveMessage(edge)

1: r

2: **for **_{e }**do**

3: **false**

4: **for ****∈ ****do**

5: **for **_{o }**do**

6: **if **_{e }_{o }**then**

7: **true**

8: **end if**

9: **end for**

10: **end for**

11: **if **¬ **then**

12: edge._{e}

13: _{e}

14: **end if**

15: **end for**

16: **if ****then**

17:

18: **end if**

Merging and iteration

During the pruning phase, we will observe many cases in which there is only one possible path for a traversal to follow. In these cases, we can merge the adjacent edges and recompute their respective markers. Each such merge operation will reduce complexity of the graph, further allowing reference markers to be eliminated. Refer to Figure

Showing the process of merging edges when there is only one possible path

**Showing the process of merging edges when there is only one possible path**. In (a), edge (1,3) could be followed by either (3,4) or (3,5) based on the reference markers. Edge (2,3) however can only be followed by (3,5). In (b), (2,3) and (3,5) are merged, which makes it possible to eliminate a marker from edge (1,3), and to merge it with (3,4). (c) shows the final state.

Implementation notes

The condensed reference/donor graph is an annotated version of the condensed de-Bruijn graph, and can be easily constructed as such. In recent years, a number of methods have been proposed to construct the condensed de-Bruijn (contig) graph. The simplest method was given in

Our simulations were performed using a single-threaded implementation running on a 3.2 GHz processor with 16 GB of memory, and demonstrated a worst-case running time of approximately 1 hour. The time complexity of the algorithm is ^{3}) in the worst case, where

Discussion

The goal of any genome sequencing project is to characterize the full genomic content of an individual organism. With the steeply declining cost of genome sequencing in recent years, there has been significant focus on new and improved methods for both

One such challenge is that as input to our method we require an estimate of the maximum mutation event size. In practice, this value is not known, and this is currently a significant drawback of our method. It is possible, however, that methods could be developed to estimate this parameter. For example, iterative application of our method with successively larger or smaller values could help discover the true maximum size. Alternatively, the parameter could be estimated directly from the alignment data. Notably, we have also not discussed the application of our method in the presence of read errors. The effect of these data imperfections can be mitigated to an extent by the application of preprocessing methods to correct the errors prior to assembly. Recent studies have shown that read errors can be significantly reduced even in the presence of non-uniform coverage

Despite these remaining challenges, we believe our method presents a novel approach to the challenge of genome assembly that takes advantage of the increasing availability of reference sequences. It is our hope that this work can help motivate future research into unified reassembly methods.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

N.P. developed and implemented the algorithm, generated and analyzed results, and authored this manuscript. B.S. and E.E. provided the initial problem framework and algorithmic approach, insights and suggestions on the development of the algorithm, and critiques of the manuscript.

Declarations

The publication costs for this article were funded by the corresponding author's institution.

This article has been published as part of

Acknowledgements

N.P. and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676 and 1065276, and National Institutes of Health grants K25-HL080079, U01-DA024417, P01-HL30568 and PO1-HL28481. B.S. was supported by NSF grant DMS-1101185, by AFOSR MURI grant FA9550-10-1-0569 and by a USA-Israel BSF grant.