Abstract
Background
Recent advances in rapid, lowcost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference.
Graphs have proven to be a powerful tool for coping with the complexity of genomescale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graphbased approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment.
Results
We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures.
Conclusion
We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graphbased approaches. Based on these findings, we outline a conceptual framework for graphbased genome alignment that can assist in the development of future genome alignment tools.
Background
Sequence comparison through multiple alignment is an indispensable tool for understanding genomes and their shared histories [1]. Even though the foundation for genomic sequence alignment was already laid in the 1980s [2], the interest is still ongoing [1,3,4], one reason being that it has critical relevance [5] for many bioinformatics analyses. The aim of sequence alignment is to uncover homologies by assigning sequence positions to each other, which implies that these positions derived from a common ancestor.
Evolutionary events that change genomic sequences are often classified into small changes and large structural changes [6]. Small changes affect only one or few sequence positions and include substitutions, insertions, and deletions. They do not influence the order of sequence positions, and thus can be captured by colinear alignment. Structural changes involve longer genomic segments, thereby affecting the structure and order of genomic sequences. They include noncolinear changes like inversions, translocations and duplications in addition to insertions and deletions of longer segments.
While colinear multiple sequence alignment has been studied extensively for a long time [716], the problem of noncolinear alignment has been brought into focus only within the last decade [1722], after more and more whole genomes started to become available. Noncolinear alignments, as opposed to colinear alignments, model all kinds of evolutionary changes and thereby enable correct homology prediction for whole genomes with noncolinear changes. This is comparable to the way global alignments integrate more information than local alignments by assigning all parts of sequences to each other, and the way multiple alignments take information from more than two sequences into account for homology prediction. Over and above, noncolinear multiple global alignments of whole genomes, genome alignments for short, integrate as much sequence similarity information as is available.
Together with the prediction of homology, genome alignments provide a segmentation of the genomes originating from large structural changes. Depending on the similarity of genomes, segments can be shorter or span several genes and reveal local colinearity. Rearrangement studies [23] explore the order of such segments and infer genomic distances based on the number of breakpoints [24,25] or predict scenarios of evolutionary changes [2628]. These studies often employ graphs, e. g., breakpoint graphs [2931], that resemble graph data structures used for genome alignment. Despite this similarity in the approach, genome alignments pursue a slightly different goal than rearrangement studies. The goal is homology prediction instead of reconstruction of evolutionary histories. Genome alignments, which are the focus of this article, integrate more information than rearrangement studies by combining segmentation and sequence similarity.
Considering the large search space, genome alignment is an ambitious task and is usually accomplished using heuristic approaches. The first step in genome alignment is commonly the computation of a set of local alignments. It is essential for most methods that the set of local alignments covers all main genomic similarities, whereas additional spurious similarities have a smaller impact. In colinear alignment, such a set usually constitutes a superposition of several alignment possibilities with some local alignments in conflict regarding the colinearity constraint (see Figure 1). The task is then to select the best conflictfree subset according to a given optimization function. In genome alignment, as opposed to colinear alignment, any set of local alignments can be viewed as a valid solution, one that induces a segmentation. However, the induced segmentation can be improved by selecting a subset of local alignments. The subset should contain those local alignments that are most likely to represent homologies when viewed in the context of the whole set of local alignments. The final step is then to find the best segmentation according to the set of local alignments and possibly a subsequent realignment of segments with a colinear alignment method.
Figure 1. Alternative alignments of the sequences CATCGA and CCGATA. The alignment on the left is colinear if the dinucleotides AT (red) are interpreted as insertion or deletion. Alternatively, the AT dinucleotides can be aligned and the CG dinucleotides interpreted as insertion or deletion. Noncolinear aligners that allow for translocations may align the AT dinucleotides in addition to the CG dinucleotides. The alignment on the right shows a noncolinear alternative that interprets the four nucleotides ATCG as inversion (reverse complement). In this example, we expect noncolinear aligners to prefer the inversion (right) over the translocation (left) since it creates fewer segments.
For the step of selecting subsets of local alignments and for inducing a segmentation, graphs serve as a convenient tool. The idea is that graphs show substructures indicating errors in the alignment, e. g., specific cycles. Once identified in a graph, we can eliminate these substructures, e. g., by removing local alignments, which is a modification of the genome alignment. Thus, graphs can assist in improving genome alignments. In addition, graphs provide an intuitive representation of similarities and changes between genomes, and so visualize alignment structures. In comparison to tabular alignments, genome alignment graphs are more versatile insofar that it is possible to model colinear and noncolinear changes without the need of choosing a reference genome.
Several graphs have been proposed, each in the context of a specific application such as synteny detection, segmentation, or simply colinear alignment. The earliest graph has been the alignment graph, formally defined for colinear multiple alignment by Kececioglu in 1993 [32]. In his definition, the graph contains a vertex for each sequence character and edges for aligned characters. The alignment graph has since been used in various versions, e. g., with additional sequence edges [33] and with genes [34] or segments [15] instead of single characters. In all versions, a colinear alignment can be obtained from the alignment graph by solving the maximum weight trace problem [32], but its structure also allows noncolinear changes to be modeled (see below).
Pevzner et al. introduced ABruijn graphs[35] as a generalization of de Bruijn graphs [36,37]. The structure of ABruijn graphs revisits an idea briefly mentioned by Kececioglu [32], the idea of merging aligned vertices. Consequently, ABruijn graphs have one vertex for sets of aligned positions, and edges represent sequence adjacencies. For the purpose of genome alignment with ABruijn graphs, the maximum subgraph with large girth (MSLG) problem [19] and the sequence modification problem (SMP) [38] were proposed, both targeting types of short cycles in ABruijn graphs in order to eliminate local alignments that hide local colinearity.
In the context of a pipeline for genome alignment that consists of the programs Enredo and Pecan [39], another graph has been published, the Enredo graph. The program Enredo applies Enredo graphs to partition genomes into segments. Subsequently, the program Pecan provides nucleotidelevel colinear alignments of segments. Enredo graphs have two vertices per set of aligned segments, a head and a tail vertex, resembling breakpoint graphs from rearrangement studies. The Enredo method iteratively eliminates various substructures from the Enredo graph before deriving a final genome segmentation.
A recent and slightly dissimilar graph is the cactus graph[22,40]. Cactus graphs have vertices for adjacencies and edges for genome segments. Their structure has two valuable properties. The cactus property subdivides the graph (and genomes) into independent units by ensuring that any edge is part of at most one simple cycle [41]. These units assist in computing genome alignments with the cactus alignment filter (CAF) algorithm [22]. The second property is the existence of an Eulerian circuit. This circuit traverses all genome segments exactly once, even duplicated segments, conveniently providing a consensus genome.
In this paper, we compare the mentioned graphbased genome alignment approaches with an emphasis on the structures of the underlying graphs. Our aim is to clarify similarities of the approaches and the underlying graphs but also to work out differences and highlight limitations. We realize our comparison using the same terminology for all graphs and by describing transformations among the graphs (see Figure 2). We assess the graphs in terms of their capabilities to display alignment information in their structure alone. For all graphs, substructures and modifications constitute key aspects of corresponding genome alignment approaches. We carefully examine substructures as well as modifications independently from the particular graphs they were first described for. Founded on our comparison, we derive a generic framework for graphbased genome alignment. The framework gives an overview of the general graphbased approach to genome alignment and, hence, may assist in the development of future genome alignment tools.
Figure 2. Overview of transformations among four graph representations for genome alignments. Some transformations require information from labels (red boxes), which is not present in the graph structures (see text for details). The Cactus method keeps an Enredolike graph in addition to the cactus graph.
Results
Terminology
The biological term homologous denotes two or more genomic positions that derived from a single position in an ancestral genome, or two or more segments that derived from a single segment in an ancestral genome. An alignment of genomes is an assignment of positions from the aligned genomes. Usually, the goal is to align only homologous positions to each other, but since the ancestral genome is unknown, an alignment can only be a prediction of homology.
In the following, we formally define a genomic position and give a very general definition of an alignment. Next, we define a genomic segment and constrain the alignment definition to colinearity. Since colinearity is often too strict for predicting homology in whole genomes, genome aligners use socalled blocks, which are colinear alignments of genomic segments. Blocks can be arbitrarily combined to noncolinear genome alignments. We give a general definition of blocks as the basic entities that underlie graphbased genome aligners. Finally, we define the terms adjacency and breakpoint.
Let be a set of genomes. Each genome
An ordered pair of two positions p=(g,i) and q=(g,j) from the same genome g defines a segments=(p,q) of length i−j, where min{p,q} is the smallest position and max{p,q} the position directly following the largest position in the segment. If p<q, the segment is in the forward orientation, and if p>q, the segment is in the reverse complemented orientation (see Figure 3). As an alternative to an ordered pair (p,q), a segment could equivalently be represented by a start position, a length, and an additional orientation bit. Two segments s_{1}=(p_{1},q_{1}) and s_{2}=(p_{2},q_{2}), where without loss of generality min{p_{1},q_{1}}≤ min{p_{2},q_{2}}, are nonoverlapping if max{p_{1},q_{1}}≤ min{p_{2},q_{2}}. If max{p_{1},q_{1}}= min{p_{2},q_{2}}, s_{1} and s_{2} are adjacent and define the adjacency at position a= max{p_{1},q_{1}} (see Figure 3). Two segments fully overlap if both min{p_{1},q_{1}}= min{p_{2},q_{2}} and max{p_{1},q_{1}}= max{p_{2},q_{2}}.
Figure 3. Three segments of a genomeg. Segments s_{1}=(p_{1},q_{1}) and s_{3}=(p_{3},q_{3}) are in the forward orientation and their sequences read TTGC and TCACG, respectively. Segment s_{2}=(p_{2},q_{2}) is in the reverse complemented orientation and reads CCTGC. s_{1} and s_{3} are nonoverlapping but not adjacent, s_{2} and s_{3} are nonoverlapping and adjacent at position a=(g,140).
An alignment of a set of segments S is colinear if each alignment component contains at most one position from each segment s∈S and if it is possible to impose a strict total ordering ≺ on the alignment components
Noncolinear operations divide an alignment into units that are colinear in themselves but not with respect to each other. We call these units blocks and define a block as a colinear alignment of a set of segments. Note that a block may contain multiple segments of the same genome if duplications are present. We refer to the number of segments in a block as the size of a block (not to be confused with the length of segments). In Figure 1, areas shaded in blue and red indicate blocks. For example in the left alignment, the two dinucleotides CG form a block and the two dinucleotides AT form another block. In the right alignment of Figure 1, the segment ATCG and its reverse complement in the second sequence form a block.
A block always has two equivalent representations. In the first block representation, some segments are in the forward orientation and some may be in the reverse complemented orientation. In the second block representation all segments are in the reverse complemented orientation that are in the forward orientation in the first block representation and all segments are in the forward orientation that are in the reverse complemented orientation in the first block representation. The essential information about possible inversions is the orientation of segments with respect to each other and not the orientation of the block representation. Once we choose one of the two representations, we implicitly assign a tail and a head to a block b. The head is the set of positions {p} of all segments s∈b with s=(p,q), and the tail is the set of positions {q}. We refer to the two sets as the ends of b in cases where the orientation of a block is not given.
A set of blocks constitutes a genome alignment and is input for building a genome
alignment graph. To simplify the exposition of the graphs below, we define
Two blocks
An adjacency of two blocks
Most commonly, genome alignment programs use pairwise local alignment methods to generate
blocks. Pairwise local alignments are blocks of size two. These blocks can be combined
with each other to form blocks of a larger size (multiple local alignments) if a segment
from one block fully overlaps with a segment from another block. We briefly address
this preprocessing of blocks in the Discussion and conclusions section, and assume
that a set
In the literature, blocks are often referred to as synteny blocks or locally colinear
blocks. The definitions of blocks differ, usually depending on the specific type of
local alignment method being used for generating blocks. For example, blocks can be
defined as gapped or ungapped colinear alignments with or without mismatches, or simply
as single alignment components. The graph representations are independent from the
precise assignment of positions to alignment components within blocks. Only the set
of segments including their relative orientation within the block is relevant. For
this reason, the different block definitions can be used interchangeably except for
preprocessing the set of blocks to obtain
Within the graphs described in the following sections, blocks and adjacencies are represented by vertices or edges or a combination of both. For each graph, every genome is a (not necessarily simple) path through the graph. We use the term to thread for following the path of a genome through the graph [17,35].
Graphs for genome alignment
We limit our comparison to alignment graphs, ABruijn graphs, Enredo graphs, and cactus graphs. The original publications of these graphs use varying terminology. We describe all four graphs using the same terminology, namely the above defined terms segment, block and adjacency. Figure 4 displays an example alignment with eleven blocks as alignment graph, ABruijn graph, Enredo graph, and cactus graph.
Figure 4. An alignment of three genomes with eleven blocks in all four graph representations. The example covers multiple noncolinear events: Blocks A, E, I, K are conserved in all three genomes without large structural changes. Blocks B, C, and D, as well as G, and H appear in different orders and orientations. Block F is missing in the red genome and J occurs twice. Colors denote the three genomes. In the alignment graph, dashed edges indicate the alignment of a segment with its reverse complement. We consider the information provided by line styles not to be part of the graph structures. In the Enredo graph, components connected by adjacency edges are shaded in gray. For the cactus graph, the figure additionally shows a precursor. Furthermore, enlarged vertices for the precursor and the final cactus graph show adjacencies in vertices.
The input for building a graph is a set of nonoverlapping blocks
Along with the definitions of G and M_{G} for each of the four graphs, we describe how it is possible to transform the different
graph structures into each other (e. g., an alignment graph structure into an ABruijn
graph structure). A transformation is an operation that has as input one graph structure G and outputs another graph structure G^{′}, where both G and G^{′} represent the same genome alignment. If it is possible to obtain a graph structure
G^{′} from another graph structure G without the help of
We examine the transformations that are depicted as arrows in Figure 2. Straight arrows indicate a possible transformation; the other arrows indicate that
a transformation among the structures is impossible, which we prove below by providing
examples for ambiguity. Nevertheless, we describe all transformations depicted as
arrows in Figure 2, using additional information from graph labels if necessary to resolve ambiguity.
We define the sparse labeling functions ℓ^{dup}, ℓ^{inv}, or ℓ^{adj} for this purpose. The sparse labeling functions provide sufficient information for
the transformation but less information than ℓ in the graph models. Note that a transformation among graph models is trivial given
that
Alignment graphs
In the following section, let G=(V,E) be an alignment graph structure and M_{G}=(G,ℓ) be an alignment graph model. We define ℓ as a labeling function of the vertices V of G. The set of edges E=E_{A}∪E_{B} decomposes into a set of directed adjacency edges E_{A} and a set of undirected block edges E_{B}. With both directed and undirected edges, G is a mixed graph.
The vertices V of G represent segments of the genomes. There is a vertex in V for every segment in the set of all segments (
Directed adjacency edges E_{A} (colored edges in Figure 4) represent adjacencies of segments. Given any pair of vertices v_{1},v_{2}∈V and their labels ℓ(v_{1})=(p_{1},q_{1}) and ℓ(v_{2})=(p_{2},q_{2}), there is a directed edge e∈E_{A} from v_{1} to v_{2} if max{p_{1},q_{1}}= min{p_{2},q_{2}}, i. e., the segment ℓ(v_{2}) is adjacent to the segment ℓ(v_{1}) in . Adjacency edges thread the genomes through the alignment graph.
Finally, undirected block edges E_{B} (gray edges in Figure 4) connect vertices labeled with segments from the same block
The formation of connected components is important for recovering
Our definition of the alignment graph structure G models noncolinear changes among the input genomes, in particular translocations and duplications. Translocations appear in G as mixed cycles. A mixed cycle is a cycle in a mixed graph formed by both directed and undirected edges. Duplications appear as block edges within the set of vertices of one genome. Because of these edges our alignment graph is not npartite as in its original definition [32].
Inversions are not visible in the alignment graph structure G; the orientation of segments remains unclear (see also Figure 5). We define the sparse labeling function ℓ^{inv}:V→{+,−} as
where ℓ(v)=(p,q). The function ℓ^{inv} assigns bits to the vertices that indicate the orientation of the represented segments. As an alternative to vertex labels, it is possible to label block edges with bits that indicate equal or opposite orientation of the segments in the endpoints (visualized as dashed and solid lines in Figure 4 or red and black edges in [42]).
Figure 5. The structure of alignment graphs and ABruijn graphs does not display inversions. Let
ABruijn graphs
Let now G=(V,E) be an ABruijn graph structure and M_{G}=(G,ℓ) be an ABruijn graph model. ABruijn graphs have only one type of edge E. We define ℓ as a labeling function of the vertices V. In contrast, the functions ℓ^{inv} and ℓ^{dup} described below provide labels for the edges E.
The vertices V of G represent blocks. For every block in
The edges E of G represent adjacencies just like adjacency edges in alignment graphs. Given any pair of vertices v_{1},v_{2}∈V and their labels b_{1}=ℓ(v_{1}) and b_{2}=ℓ(v_{2}), there is a directed edge e∈E from v_{1} to v_{2} for every two adjacent segments s_{1}=(p_{1},q_{1}) and s_{2}=(p_{2},q_{2}) with max{p_{1},q_{1}}= min{p_{2},q_{2}} where s_{1}∈b_{1} and s_{2}∈b_{2}. If multiple adjacent pairs of segments exist in b_{1} and b_{2}, E contains multiple edges from v_{1} to v_{2}. Thus, G is a multigraph. In the present paper, we prefer the multigraph representation with multiple separate edges between two vertices over the multigraph representation with multiplicity labels on edges.
Adjacency edges are essential for threading genomes through G. However, the path from threading one genome is not necessarily simple. It traverses vertices multiple times if duplications are present (see block J in Figure 4) making the path ambiguous. Thus, threading requires label information that allows incoming and outgoing edges of a vertex to be paired. Such information is not required in the alignment graph structure, where each vertex has at most one incoming and one outgoing edge. Without duplications it is sufficient to color edges of G by genome (red, blue, and green in Figure 4) instead of providing the full labels ℓ. In the presence of duplications, G can be ambiguous even with color labels (see Figure 6 and block J in Figure 4).
Figure 6. Duplications may create ambiguity in the structure of ABruijn graphs and Enredo graphs. In this example, the structure of the ABruijn graph and the Enredo graph represents both the genomes ABDABCEBC and ABCEBDABC. Thus, the order of blocks is ambiguous.
To resolve ambiguity of G for threading, we define the sparse labeling function
where a_{1} is the adjacency position of e_{1}, and a_{2} is the adjacency position of e_{2}.
Furthermore, inversions create ambiguity in G (see Figure 5). Just like the alignment graph structure, G provides no information about the orientation of segments represented in a vertex. We define the sparse labeling function ℓ^{inv}:E→{+,−}×{+,−} for ABruijn graph edges E. For each pair of adjacent segments s_{1}=(p_{1},q_{1}) and s_{2}=(p_{2},q_{2}) from the labels of two vertices s_{1}∈ℓ(v_{1}) and s_{2}∈ℓ(v_{2}), we label an edge e=(v_{1},v_{2}) with
The first bit in the label ℓ^{inv}(e) indicates the orientation of the segment in the source vertex of e, and the second bit the orientation of the segment in the target vertex. It is not sufficient to solely label vertices of G with one orientation bit per segment of the represented block. Figure 7 provides an example where this leads to ambiguity.
Figure 7. The labeling of ABruijn graph vertices with one orientation bit per segment does not resolve ambiguity. In this example, both blocks occur three times, twice in the forward orientation and once in the reverse complemented orientation. Combining the orientations of the segments in the two blocks is ambiguous as the two alternative Enredo graph structures prove. In the left Enredo graph structure, the segment in the reverse complemented orientation of one block is combined with a segment in the forward orientation of the other block. In the right Enredo graph structure, the two segments in the reverse complemented orientation occur consecutively.
Below, we describe transformations between ABruijn graphs and alignment graphs. As stated above, the transformation of the graph models is trivial, but the example in Figure 6 proves that in some cases it is impossible to transform an ABruijn graph structure into an alignment graph structure. We describe the transformation with the help of the sparse labeling function ℓ^{dup} to resolve ambiguity.
ABruijn graphs from alignment graphs.
To transform an alignment graph structure
As a first step, compute all block edge connected components of G^{′}. As described above, each component represents exactly one block, and each vertex
v^{′}∈V^{′} is part of exactly one component. Now, add for every component
The remaining task is to transfer adjacency edges from the alignment graph to the
ABruijn graph. Using the mapping, add an edge e=(u,v) to the set of ABruijn graph edges E for each edge e^{′}=(u^{′},v^{′}) from the set of alignment graph adjacency edges
ABruijn graphs to alignment graphs.
We describe the transformation of an ABruijn graph structure G=(V,E) into an alignment graph structure
We transform the ABruijn graph by following each genome separately and assume that
the edges are given in increasing order of labels: ℓ^{dup}(e_{1})<ℓ^{dup}(e_{2})<⋯<ℓ^{dup}(e_{E}). Initially, add for each genome a new vertex u^{′} to the set of alignment graph vertices V^{′}. If the source vertex u of the ABruijn graph edge e_{1}=(u,v) is labeled with a nonempty set of vertices
Iterate over the ABruijn graph edges in increasing order of labels and repeatedly add for the target vertex a new vertex, add block edges, add the new vertex to the mapping, and add an adjacency edge from the previous to the new vertex. This way, the genomes are threaded through the ABruijn graph and the alignment graph structure G is successively built up.
Enredo graphs
In this section, let G=(V,E) be an Enredo graph structure and M_{G}=(G,ℓ) be an Enredo graph model. In an Enredo graph, the set of edges E=E_{A}∪E_{B} decomposes again into a set of directed adjacency edges E_{A} and a set of undirected block edges E_{B}. We define ℓ as a labeling function of the block edges E_{B}.
The block edges E_{B} of G represent blocks, and vertices V of G represent the ends of blocks. In contrast to alignment graphs, a single block edge
represents an entire block. For every block
Directed adjacency edges E_{A} of G (colored edges in Figure 4) represent adjacencies. Given any pair of block edges e_{1}={u_{t},u_{h}} and e_{2}={v_{t},v_{h}} and their labels ℓ(e_{1})=b_{1} and ℓ(e_{2})=b_{2}, there is a directed edge e∈E_{A} from an endpoint of e_{1} to an endpoint of e_{2} for every two adjacent segments s_{1}=(p_{1},q_{1}) and s_{2}=(p_{2},q_{2}) with s_{1}∈b_{1} and s_{2}∈b_{2}. In contrast to alignment graphs and ABruijn graphs, the endpoints of adjacency edges in G indicate in relation to other adjacency edges the orientation of segments in a block. Given labels, one endpoint of each block edge is a head vertex and the other a tail vertex. If p_{1}<q_{1}, then the adjacency edge e starts at the head vertex u_{h}, and if p_{1}>q_{1}, e starts at the tail vertex u_{t}. If p_{2}<q_{2}, then e points to the tail vertex v_{t}, and if p_{2}>q_{2}, e points to the head vertex v_{h}. In other words, e=(u_{h},v_{t}) if q_{1}=p_{2}, e=(u_{h},v_{h}) if q_{1}=q_{2}, e=(u_{t},v_{t}) if p_{1}=p_{2}, and e=(u_{t},v_{h}) if p_{1}=q_{2}. Again, there may be several adjacency edges connecting the same two vertices. Thus, the Enredo graph is also a multigraph.
Due to its twovertex concept, the structure of an Enredo graph G reflects the relative orientation of blocks as opposed to the alignment graph structure and the ABruijn graph structure (see Figure 5). G is capable of displaying inversions. But just like ABruijn graphs, threading a genome with duplications through G can be ambiguous (see Figure 6). The path from threading a genome through G alternates between block and adjacency edges. Therefore, only multiple occurrences of a block in the same orientation create ambiguity in G.
To resolve ambiguity of G, we define the sparse labeling function
where a_{1} is the adjacency position of e_{1}, and a_{2} is the adjacency positions of e_{2}. As an example, we use again Figure 6 with labels 1 through 8: One of the edges from the head of A to the tail of B would be labeled with 1; for genome ABDABCEBC, the edge from the head of B to the tail of D would be labeled with 2, and for genome ABCEBDABC, the edge from the head of B to the tail of C would be labeled with 2; and so on.
We generalize the Enredo graph compared to its original definition [39] in some aspects. Enredo graphs originally consider blocks of size 1 as adjacencies: Instead of a block edge with two end vertices that are connected to the rest of the graph by two adjacency edges, the Enredo method only adds a single adjacency edge labeled with a segment. This requires another function ℓ_{A}:E_{A}→S that labels adjacency edges E_{A} with segments S. In addition, in the initial phase of the Enredo method segments on adjacency edges between the same two blocks are assumed to be homologous. Because of this assumption and to distinguish nonhomologous multiedges later on, the Enredo method prefers the multigraph representation with multiplicity labels on one adjacency edge over multiple separate edges. We argue that all segments that are assumed to be homologous should be defined as blocks. Consequently, our description with blocks of size 1 is valid and even simplifies the exposition of the method.
Furthermore, the Enredo method only adds edges for adjacencies that are shorter than a predefined threshold. This results already in a partial segmentation of the genomes bearing several segments per genome in the graph. Parts of the genomes may not be represented. We add all adjacencies to the graph and leave it to later stages to modify the graph.
In the transformations below, we include the replacement of labeled adjacencies by blocks of size 1. The transformation from an Enredo graph structure to an ABruijn graph structure is possible without additional labels. The other direction, from ABruijn graphs to Enredo graphs, requires additional information about inversions as shown by the example in Figure 5.
Enredo graphs from ABruijn graphs
First, we describe the transformation of an ABruijn graph structure G^{′}=(V^{′},E^{′}) into an Enredo graph structure G=(V,E_{B}∪E_{A}) using the labeling function ℓ^{inv}:E^{′}→{+,−}×{+,−}. Then, we describe the transformation of blocks of size 1 to labeled adjacency edges in the Enredo graph given full block information by the function ℓ for transferring labels.
To transform G^{′} into G, add for each ABruijn graph vertex v^{′}∈V^{′} a tail vertex v_{t} and a head vertex v_{h} to the set of Enredo graph vertices V. Additionally, add an undirected edge e_{b} between v_{t} and v_{h} to the set of Enredo graph block edges E_{B}. We obtain a onetoone mapping of ABruijn graph vertices and Enredo graph block edges, which we keep as separate labels m:V^{′}→E_{B} on ABruijn graph vertices such that m[ v^{′}]=e_{b}.
Using the labeling function ℓ^{inv} and the mapping m, we can unambiguously transfer adjacency edges to the Enredo graph. For each edge e^{′}=(u^{′},v^{′}) in the set of ABruijn graph edges E^{′} where m[ u^{′}]=e_{u} and m[ v^{′}]=e_{v}, add an edge e=(u_{x},v_{y}) to the set of Enredo graph adjacency edges E_{A} where u_{x} is an endpoint of e_{u} and v_{y} is an endpoint of e_{v}. The vertex u_{x} is the head vertex of e_{u} if the first bit in ℓ(e^{′}) is +, and otherwise the tail vertex. The vertex v_{y} is the tail vertex of e_{u} if the second bit in ℓ(e^{′}) is +, and otherwise the head vertex.
In another step, we can transform all block edges e_{b}={v_{t},v_{h}} representing blocks of size 1 into adjacency edges. Since the size of ℓ(e_{b}) is 1, the corresponding vertices v_{t} and v_{h} are incident to exactly one adjacency edge each, e_{1}=(u_{x},v_{t}) and e_{2}=(v_{h},w_{y}). Replace such sets of two vertices v_{t}, v_{h} and three edges e_{1}, e_{2}, e_{b} by a new adjacency edge e=(u_{x},w_{y}). Finally, transfer the label of the block edge ℓ(e_{b})={s} to the adjacency edge such that ℓ_{A}(e)=s.
Enredo graphs to ABruijn graphs
We start by describing how to recover block edges for blocks of size 1 from adjacency edges that are labeled with segments by ℓ_{A} in an Enredo graph structure G=(V,E_{B}∪E_{A}). Afterwards, we describe the transformation from G to an ABruijn graph structure G^{′}=(V^{′},E^{′}), which is possible without additional labels.
Replace each edge e=(u_{x},v_{y}) from the set of Enredo graph adjacency edges E_{A}, where ℓ_{A}(e)=s, by two vertices v_{t} and v_{h} and a block edge e_{b}=(v_{t},v_{h}), and set ℓ(e_{b})={s}. Further, add e_{1}=(u_{x},v_{t}) and e_{2}=(v_{h},v_{y}) to the set of Enredo graph adjacency edges E_{A}.
For the transformation to an ABruijn graph, add for each edge e_{b}=(v_{h},v_{t}) in the set of Enredo graph block edges E_{B}, a vertex v^{′} to the set of ABruijn graph vertices V^{′}. Again, we obtain a onetoone mapping of Enredo graph block edges and ABruijn graph vertices, which we keep this time as labels m:E_{B}→V^{′} on Enredo graph block edges such that m[ e_{b}]=v^{′}. Finally, add for each edge e=(u_{x},v_{y}) in the set of Enredo graph adjacency edges E_{A} where u_{x} is incident to the block edge e_{u} and v_{y} is incident to the block edge e_{v}, an edge e^{′}=(u^{′},v^{′}) to the set of ABruijn graph vertices, where m[ e_{u}]=u^{′} and m[ e_{v}]=v^{′}. In this last step, we lose inversion information in the graph’s structure.
Cactus graphs
In this section, let G=(V,E) be a cactus graph structure and M_{G}=(G,ℓ) be a cactus graph model. Cactus graphs have only one type of edges. We define ℓ as a labeling function of the edges E. The cactus graph structure G stands out from the other graph structures by fulfilling welldefined structural properties: Every edge e∈E is part of at most one simple cycle, which makes G a cactus graph [41], and G has an Eulerian circuit [43]. A number of construction steps guarantee these properties.
Let A be the set of all adjacencies of segments. The vertices V of G partition A into a set of pairwisely disjoint subsets Ω: Each element ν∈Ω is a subset of A, μ∩ν=∅ for any two sets μ,ν∈Ω, and
The edges E of G represent blocks just like block edges in Enredo graphs. For each block
The cactus graph has no directed edges as found in other graphs. Since vertices of G represent segment adjacencies in sets, the size of blocks and the number and precise set of adjacencies remain unclear in the structure. Recovering this information from G is impossible as the following examples from Figure 4 demonstrate: The cactus graph structure does not tell how many genomes traverse block F and whether block I and K are adjacent in one of the genomes or not.
Still, each genome corresponds to a (not necessarily simple) path through G. With the help of labels we can recover this path. The colored lines in the enlarged
vertices in Figure 4 provide the equivalent information as colored adjacency edges in Enredo graphs and
would resolve ambiguity for threading if no duplications were present. More information
is necessary to resolve all ambiguity. We suggest
Cactus graphs are not as independently used as the other genome alignment graphs. The cactus method operates on two graphs, the cactus graph and another graph called the adjacency graph [40]. Interestingly, the latter has the same structure as an Enredo graph. We view the cactus graph, which enables the characterization and detection of new substructures, as a supergraph on top of the Enredo graph. The transformation of Enredo graph structures to cactus graph structures conforms with the construction of a cactus graph [22,40] and does not require additional labels. The transformation back to Enredo graphs is ambiguous as the above mentioned examples from Figure 4 show. For this reason, our description of this transformation uses the sparse labeling ℓ^{adj} in addition to the graph structure.
Cactus graphs from Enredo graphs
To transform an Enredo graph structure
First, compute all adjacencyedge connected components
In the second step, remove sets of vertices from V that are 3edgeconnected and add instead a single vertex v to V (vertices α and β in Figure 4). Correct the mapping m and redirect block edges that were incident to any vertex in the 3edge connected component, to be incident to v.
Finally, replace connected components formed only by edges whose removal disconnect the graph (not present in Figure 4). Each such component is a tree with leaf and branching vertices v_{1},…,v_{c}. Remove v_{1},…,v_{c} and add instead a new vertex v to V. Just as before, correct the mapping m and redirect incident block edges to v.
Cactus graphs to Enredo graphs
In the transformation from a cactus graph structure G=(V,E) to an Enredo graph structure
Initially, identify among all cactus graph edges incident to the origin vertex u∈V the edge e_{0}={u,v} whose label contains the smallest number n_{0}∈ℓ^{adj}(e_{0}) where n_{0}<n and n∈ℓ^{adj}(e) with e={u,x}. Add two vertices u^{′} and v^{′} to the set of Enredo graph vertices V^{′} and an edge
Among all edges incident to v, identify the next edge e_{1}=(v,w) whose label contains the next larger number n_{1}∈ℓ^{adj}(e_{1}) such that n_{1}>n_{0} but n_{1}<n where n≠n_{0} and n∈ℓ^{adj}(e) with e={v,x}. If the mapping for e_{1} is undefined, add two vertices v^{′′} and w^{′} to the set of Enredo graph vertices V^{′} and an edge
Next, repeat the same for incident edges of w. Proceed like this until reaching the end of all genomes to obtain the full Enredo graph structure G^{′}.
All in all, the need for labels shows that the four graphs markedly differ in the information represented in their structures. Complete duplication information (dup) is only present in alignment graph structures, and only the structure of Enredo graphs reveals inversion information (inv). ABruijn graphs are a compact and intuitive representation but lack both inversion and duplication information. Finally, cactus graph structures do not represent parts of the adjacency information (adj). Despite these structural differences, all graph models, which include labels, can be transformed into each other.
Based on these observations, some advantages or disadvantages of the graph structures become apparent. For example, for a genome aligner intended to reveal inversions, an Enredo graph structure appears to be more suitable, whereas a more general analysis of the genetic content of genomes will work well with the more compact ABruijn graph structure. Duplications are best visible in an alignment graph structure. The advantage and information provided by cactus graphs subdivides genomes into independent regions revealing specific and unique substructures as described in the following.
Graph substructures
We collected substructures from graphbased genome alignment approaches and classify them here into four types: colinear paths, visiting blocks, short cycles, and cactus groups. Substructures are useful for deriving a meaningful genome segmentation or they indicate misalignment, i. e., the alignment of nonhomologous segments. Furthermore, they pinpoint parts of genome alignments that can be improved through modification.
Some substructures have been described for several graphbased approaches, while others are unique to only one approach. We conjecture that it is possible to identify all substructures in all graph models. If the time complexity for detecting occurrences of the substructures was the same in all graphs models, they could be used interchangeably. Here, our aim is to analyze abilities of the graph structures to reveal potential misalignments without additional information from labels.
Colinear paths
We refer to the first type of substructures as colinear paths. Colinear paths are sets of blocks that appear in one or more genomes consecutively in the same orientation and without breakpoints in between. A sequence of blocks b_{1},…,b_{k} forms a colinear path if there is an adjacency but no breakpoint between b_{i} and b_{i+1} for all i=1,…,k−1. Consequently, all blocks along a colinear path have the same size and there are segments s_{1}∈b_{1} and s_{k}∈b_{k} with s_{1}=(p_{1},q_{1}) and s_{k}=(p_{k},q_{k}) such that s=(p_{1},q_{k}) is a consecutive genome segment that concatenates one segment from each block b_{1},…,b_{k}. We also consider a single block as a colinear path.
A colinear path is maximal if it cannot be further extended by other adjacent blocks, but is bounded by breakpoints. Note that alignment modifications often remove bounding breakpoints such that a colinear path can again be further extended. The set of maximal colinear paths of a genome alignment determines the final genome segmentation. Independent from the underlying graph structure, all graphbased genome alignment methods have the common aim to maximize colinear paths both in terms of size (number of genome segments) and length (total segment lengths).
In Enredo graphs, simple nonbranching paths are colinear paths. Similarly, colinear paths appear as nonbranching paths in the ABruijn graph structure, but here a nonbranching path is not necessarily a colinear path. Along a nonbranching path in ABruijn graphs, one or more blocks can be inverted in a subset of the genomes. The structure of ABruijn graphs provides no information about inversions (see Figure 5). Thus, to detect colinear paths in ABruijn graphs, information from labels is necessary. Only a single vertex is detectable as (not necessarily maximal) colinear path in the structure of ABruijn graphs. The same holds for the alignment graph structure: The detection of consecutive blocks is straightforward, but in order to avoid the inclusion of inverted blocks that break colinearity, additional information about inversions is necessary. And finally, colinear paths appear in the cactus graph structure as nonbranching paths although nonbranching paths are not necessarily colinear paths.
Visiting blocks
We name the second type of substructure visiting block, which conceptually is a special type of a maximal colinear path. A maximal colinear path {b_{1},…,b_{k}} is a visiting block if there is a block b_{0} adjacent to b_{1} and a block b_{k+1} adjacent to b_{k} with the following two symmetric conditions (without loss of generality, we assume that the tail of b_{0} is adjacent to the head of b_{1} and the tail of b_{k} is adjacent to the head of b_{k+1}): For all segments s=(p_{1},q_{k}) of the colinear path that are adjacent at position p_{1} to a segment s_{0}∈b_{0}, there is a segment s_{k+1}∈b_{k+1} adjacent to s at position q_{k}; and for all segments s=(p_{1},q_{k}) that are adjacent at position q_{k} to a segment s_{k+1}∈b_{k+1}, there is a segment s_{0}∈b_{0} adjacent to s at position p_{1}. The important property is that all segments from block b_{0} that are adjacent to segments of the colinear path, continue in the same block b_{k+1} at the other end of the colinear path and vice versa.
A visiting block arises from merging blocks from within a colinear path with other blocks. If the merged blocks are short, they often only have spurious similarity. Hence, they break colinearity at two positions without providing much evidence for a large structural change. This is a reason why genome aligners address visiting blocks and separate the otherwise colinear paths.
In ABruijn and Enredo graphs, visiting blocks appear as simple nonbranching paths bounded by branching vertices. In Enredo graphs, the path always starts and ends with a block edge. In both graphs, at least one branch which enters the visiting block at one end must be formed by a set of segments that leaves the visiting block as its own separate branch at the other end (see Figure 8). This condition makes it impossible to identify visiting blocks in the ABruijn and Enredo graph structures. Likewise, the structure of cactus graphs alone does not reveal visiting blocks. Only in the structure of alignment graphs, it is possible to determine whether a given colinear path is a visiting block or not.
Figure 8. Visiting blocks are not distinguishable in the structure of ABruijn graphs and in the structure of Enredo graphs. In this example, only the colors of adjacency edges reflect a difference between the substructure at the top (visiting block) and at the bottom (no visiting block). We consider colors as edge labels, which are not present in the graph structures. Thus, visiting blocks do not form unique substructures in the structures of ABruijn and Enredo graphs.
Visiting blocks have been described for ABruijn graphs as microblocks [38] and also for Enredo graphs both implicitly in the “joining” operation and explicitly as a first type of “aberrant homologies” [39]. Furthermore, we view another type of “aberrant homologies” from Enredo graphs as a special case of this substructure: retrotransposed pseudogenes that cause a series of successive visiting blocks.
Short cycles
Cycles in genome alignments are indicators for rearrangement. A change in one of two identical genomes often introduces a cycle in the corresponding genome alignment. In the same way, spurious similarity causes cycles and breaks colinearity. If there are many cycles, they often hide significant colinearity. For this reason, many genome aligners eliminate short cycles.
Specific types of cycles also play a role for colinear sequence alignment. For example, alignment graphs without mixed cycles are colinear alignments [33]. Thus, we can compute colinear alignments by eliminating mixed cycles from alignment graphs. Similarly, the partial order alignment (POA) program [44] uses directed acyclic graphs (DAGs) for alignment representation, essentially ABruijn graphs without directed cycles.
We define a genome alignment cycle as a sequence of blocks b_{1},…,b_{k} where block b_{i} is adjacent to block b_{i+1} for all i=1,…,k−1 and b_{k} is adjacent to b_{1}. Further, we require all sets of positions that define adjacencies between two blocks b_{i} and b_{i+1} along the cycle to be disjoint. Thereby we exclude pairs of adjacent blocks from the set of genome alignment cycles. A cycle is short if the total length of segments along the cycle is below a given length threshold.
The definition of genome alignment cycles corresponds to simple mixed cycles in the Enredo graph structure. They mostly appear in the ABruijn graph structure and alignment graph structure as (mixed) simple cycles, too, but there is no onetoone correspondence: The alignment graph structure can have more than one cycle for a single genome alignment cycle (see Figure 9A); and genome alignment cycles that are caused by inversions are not visible in the alignment graph structure and ABruijn graph structure. Figure 9B shows an example for two genome alignment cycles that appear as a single cycle in the ABruijn graph structure. Despite these essential differences, cycles in the alignment graph structure, ABruijn graph structure, and Enredo graph structure always correspond to genome alignment cycles as opposed to cycles in the cactus graph structure. Subgraphs in the structures of alignment graphs, ABruijn graphs, and Enredo graphs that correspond to cycles in the cactus graph structure are not even necessarily connected (see below).
Figure 9. Genome alignment cycles have a onetoone correspondence only in the structure of Enredo graphs.(A) A cycle in the Enredo graph structure may correspond to several overlapping cycles in the alignment graph structure. In this example, two cycles in the alignment graph structure are shaded in red and blue. (B) Cycles caused by inversions appear only in the Enredo graph structure. In this example, the upper cycle in the Enredo graph structure is due to an inversion in block A, hence, does not appear in the ABruijn graph structure.
In the following, we discuss two characteristics for discriminating between different types of cycles, the orientation of adjacencies and the number of maximal colinear paths. Next, we briefly address the special case of palindromes. In addition, we describe how simple cycles in cactus graphs are used as characteristic substructures although they differ from genome alignment cycles.
Orientation of adjacencies
ABruijn graphs represent adjacencies as directed edges. This allows classifying cycles into those that follow the direction of edges and those that ignore the direction of edges. Pevzner and colleagues refer to the two types of cycles as whirls and bulges [19,35]. Whirls are directed, and bulges ignore the direction of edges. The graphbased genome aligner ABA addresses whirls and bulges in ABruijn graphs [19].
The classification of cycles in whirls and bulges becomes ambiguous when the graph represents multiple genomes. It depends on the initially chosen relative orientations of the genomes. If we invert a subset of the genomes, some whirls become bulges and some bulges become whirls (see Figure 10). Note that whirls and bulges have been first introduced for repeat resolution within one genome [35], where the classification in whirls and bulges is unambiguous.
Figure 10. The classification of cycles into whirls and bulges depends on the orientation of genomes. In this example, inversion of the red genome transforms the cycle from bulge to whirl or vice versa.
Number of maximal colinear paths
A genome alignment cycle is formed by complete maximal colinear paths and possibly single additional blocks. For example, the cycle in Figure 10 is formed by the maximal colinear path consisting of the single block B and two additional blocks A and C. A and C may be part of longer maximal colinear paths. In contrast to the orientation of adjacencies, the number of maximal colinear paths classifies the cycles unambiguously [38].
The ABruijn graph based approach DRIMMSynteny [38] uses a classification of cycles into oneway, twoway, and composite cycles, which is similar but not equivalent to a classification according to the number of maximal colinear paths. DRIMMSynteny focuses only on oneway and twoway cycles even though there can be cycles formed by more than two paths. The “annealing” operation in Enredo [39] places special emphasis on cycles formed by two maximal colinear paths after each of these paths has been joined to a single adjacency edge. In addition, Enredo addresses all other cycles as the third type of “aberrant homologies”.
Palindromes
Palindromes in genomes are inverted tandem duplications. Hence, they traverse a duplicated block twice and in both directions. Palindromes create a special type of cycles in genome alignments formed by only one adjacency at one end of a block. For the detection of palindromes and distinction against tandem repeats, inversion information is necessary. Thus, the structure of alignment graphs and ABruijn graphs alone cannot reveal palindromes. In Enredo graphs, we recognize palindromes by an adjacency edge loop (see Figure 11). Palindromes are separately addressed as “thorns” in ABruin graphs [38] and mentioned as “aberrant homologies” in Enredo [39].
Figure 11. Palindromes and tandem repeats are not distinguishable in the structure of alignment and ABruijn graphs. Depending on the relative orientation of the segments in block A, the displayed alignment graph structure and ABruijn graph structure represents a palindrome or a tandem repeat. In Enredo graphs, palindromes and tandem repeats form distinct substructures.
Cactus chains
In cactus graphs, simple cycles are named chains[22]. The corresponding subgraphs of cactus chains in Enredo graphs, ABruijn graphs, and alignment graphs are not necessarily connected (see blocks A, E, I, K in Figure 4). But even though chains do not correspond to continuous segments of genomes, they represent conserved orders of blocks (e. g., blocks A, E, I, K in Figure 4 appear in this order in all genomes). Cactus chains are addressed by the Cactus method.
It is possible to identify the subset of blocks forming a cactus chain in the Enredo graph structure, for the simple reason that an Enredo graph can be transformed into a cactus graph. However, it appears impossible to characterize chains in Enredo graphs without computing e. g., 3edge connected components. In the structure of alignment and ABruijn graphs, information about the orientation of adjacent blocks is missing for identifying cactus chains.
Cactus groups
Paten et al. refer to adjacency edge connected components, which are computed for constructing a cactus graph, as groups[22]. A cactus group is a set of adjacencies that forms an adjacencyedge connected component in the Enredo graph. All adjacencies of one group are represented by one vertex in the cactus graph structure, but a cactus graph vertex can represent several groups. Strictly speaking, cactus groups are visible in the structure of Enredo graphs but not in the structure of cactus graphs. Similarly, it is not possible to recognize groups in the alignment graph and ABruijn graph structures because this requires information about the orientation of adjacent blocks.
In summary, inversion and duplication information is necessary for the complete detection of all substructures. Visiting blocks require duplication information, and all other substructures require inversion information. Hence, none of the four graphs reveals all substructures solely by its structure.
This concludes our classification of substructures on the basis of a not necessarily exhaustive list of substructures. Identification of further substructures or an assessment of their relevance for the accuracy of genome alignments may possibly point towards another way of classifying them.
Modifications
Graphbased genome aligners modify the genome alignments by eliminating substructures from the graphs. The aim is to reveal long conserved homologies, i. e., blocks of large size and length. As mentioned in the introduction, genome alignment comprises selection of local alignments and segmentation. Here, we describe modifications that eliminate substructures either by modifying the set of local alignments represented in blocks (“splitting blocks” and “merging parallel blocks”) or by determining breakpoint positions that will be part of the final segmentation (“merging consecutive blocks” and “cutting adjacencies”).
These four modifications derive from the mentioned graphbased genome alignment approaches, but they match the operations described for the approaches only in part. Some genome alignment approaches clearly separate block modification and segmentation, other approaches do both tasks together. Similarly, some approaches apply compound operations consisting of several of the modifications described here. Our intention is to provide small modification entities from which it is possible to assemble more complex operations.
We describe every modification on the set of blocks (not on the level of alignment components but on the level of segments). Furthermore, we mention effects of the modifications in the graph structures although they can be applied to a genome alignment independently from a graph structure. We explain how these modifications correspond to operations in the graphbased genome alignment approaches, especially if the correspondence is not obvious. For example, this is the case for DRIMMSynteny [38], which solves the sequence modification problem (SMP) on ABruijn graphs. The method modifies the sequences and determines the segmentation on the modified sequences before transforming the sequences back. We transfer the effects directly to the original sequences and set of blocks, and refer to the modifications accordingly. The four modifications cover all operations described in the programs ABA, DRIMMSynteny, Enredo, and Cactus.
Splitting blocks
The most prevalent modification is splitting a block by dividing its set of segments into subsets that form new smaller blocks. Formally, the modification replaces a block b={s_{1},…,s_{n}}, where n≥2 is the size of b, by two blocks b_{1}={s_{1},…,s_{k}} and b_{2}={s_{k+1},…,s_{n}}, 1≤k<n. The new blocks may have size 1, thus may consist of a single segment. Transferred to the original set of local alignments from which the blocks were formed, this modification corresponds to removing local alignments. In some cases, it is enough to remove a single pairwise local alignment to split a block into two blocks. In other cases, a particular subset of the local alignments needs to be removed simultaneously.
Splitting blocks has different effects on the genome alignment graphs (see also Figure 12). In the alignment graph structure, the splitting corresponds to removing all edges between two vertex subsets of a block edge connected component. In ABruijn graph structures, where vertices represent blocks, the modification replaces a vertex by two new vertices; incoming and outgoing edges are connected to the respective new vertex. The effect of splitting blocks in Enredo graph structures is very similar: The modification duplicates a pair of head vertex and tail vertex connected by a block edge, and reconnects incoming and outgoing adjacency edges accordingly. In cactus graph structures, the splitting of block edges can lead to complex rearrangements with both splitting and merging of vertices.
Figure 12. The effect of splitting a block on the graphs. In alignment graphs, removal of block edges splits a block if the removal disconnects a block edge connected component. In ABruijn graphs and Enredo graphs, vertices (and block edges) need to be multiplied. In cactus graphs, the effect depends on the context. We show only the simplest possibility, where an edge is multiplied.
Graphbased genome aligners eliminate many substructures using this modification. By splitting blocks, we can clearly modify a graph so as to eliminate visiting blocks (see Figure 8), which is done in the programs DRIMMSynteny and Enredo. While Enredo splits a block into two blocks of arbitrary size, DRIMMSynteny splits single segments from a block, thus creating blocks of size 1. Additionally, this modification can eliminate small cycles, e. g., whirls in ABruijn graphs [35] and mixed cycles in alignment graphs. Further, the sequence modifications in DRIMMSynteny for onepath cycles and palindromes result in the splitting of segments from blocks. Finally, the “melting” operation in the Cactus method splits all blocks along a cactus chain into blocks of size 1.
Merging parallel blocks
The opposite to block splitting is a modification that merges blocks by adding local alignments between segments of the blocks. To merge two blocks b_{1}={s_{1},…,s_{k}} and b_{2}={s_{k+1},…,s_{n}} of size k and size n−k, respectively, into a new block b={s_{1},…,s_{n}} of size n, it is sufficient to add a local alignment of two segments s_{i}∈b_{1} and s_{j}∈b_{2}. Such local alignments can be new or previously removed by splitting blocks. Note that merging of parallel blocks implicitly aligns all segments of the two blocks.
The effect on the graph structures is the reverse of block splitting. In the alignment graph structure, it corresponds to adding block edges. In the ABruijn graph structure two vertices are replaced by a single vertex. In the Enredo graph structure, two block edges with head and tail vertices are being replaced by a single block edge with one head and one tail vertex. In the cactus graph structure, merging of parallel blocks can lead to complex rearrangements just as splitting of blocks. The result is typically a longer chain or a new subcactus.
Graphbased genome alignment approaches usually merge blocks based on the structure of surrounding blocks. Twoway cycles and bulges in ABruijn graphs and Enredo graphs are substructures that suggest to merge parallel blocks [38,39]. Furthermore, the genome segments within cactus groups are more likely to be homologous than others, hence, subject to merging [22]. Both in Enredo and in cactus graphs, the modification is termed “annealing”.
Merging consecutive blocks
The preceding two modifications often generate new or longer colinear paths. It is
possible to replace the consecutive blocks of a colinear path by a new longer block
that rules out the possibility of a breakpoint between the merged blocks. The modification
replaces two adjacent blocks b_{1}={s_{1},…,s_{n}} and
The effects on the graphs are straightforward. In the alignment graph structure, a single vertex replaces each pair of vertices in two adjacent block edge connected components. In the ABruijn graph structure, one vertex replaces two consecutive vertices. In the Enredo graph structure, one block edge replaces a path consisting of a block edge, adjacency edge, and another block edge. And similarly in the cactus graph structure, one block edge replaces a path of a block edge, a vertex, and another block edge, thereby reducing the number of vertices in a chain.
Merging consecutive blocks is part of the “joining” operation in the Enredo method [39]. The other approaches do not apply this modification.
Cutting adjacencies
As opposed to merging consecutive blocks, the last modification fixes a breakpoint in the genome alignment by cutting genomes into several segments. For example, given a block b={s_{1},…s_{n}} with s_{i}=(p_{i},q_{i}) where i=1,…,n and with a breakpoint at the tail of b, the modification cuts the genomes at all positions q_{i}. The modification does not affect the set of blocks but rather the set of genomes. Thus, it is part of the genome segmentation process.
Cutting adjacencies corresponds to removing a single edge from an ABruijn graph structure, a single adjacency edge from an Enredo graph structure, or a set of adjacency edges from an alignment graph structure. Again, there are multiple possible effects in a cactus graph structure. In the simplest case, the cactus graph structure remains unchanged. In all graphs, the removal of edges can disconnect the graph structures, generating several components that correspond to disjoint sets of genome segments. Thus, it can become impossible to thread the genomes through the graphs without additional effort [35,38].
Cutting adjacencies is used in various ways by genome alignment approaches. The ABA method cuts adjacencies for eliminating bulges from ABruijn graphs and the Enredo method for eliminating small cycles in general. In addition, the segmentation processes in ABruijn and Enredo graphs implicitly use this modification: In DRIMMSynteny, segmentation is realized by coloring the graph. In Enredo, it is realized by excluding adjacencies shorter than a given length threshold. Genome segmentation in alignment graphs and cactus graphs has not been described explicitly.
Discussion and conclusions
We compared four graph data structures and their usage for genome alignment. Our comparison identified that essential pieces of information about duplication and inversion are only present in the structures of some graphs. In addition, we examined substructures in the graph structures that are subject to elimination in various genome alignment approaches, and determined four classes of substructures. We found that information about duplications or information about inversions or even both are necessary for distinguishing any type of substructure in the graphs. Thus, it is indeed essential to keep additional information in labels of the vertices or edges, though the different graphs depend on the labels to a lesser or greater extent. Finally, we reduced the set of operations applied for eliminating substructures from the graphs to four elementary modifications. Overall, it became apparent that many ideas are shared by all graphbased approaches.
These shared ideas allow us to derive a framework for graphbased genome alignment (see also Figure 13), an ABC to Genome alignment. It begins with the computation of local colinear alignments among the input genomes (A). The choice of the local alignment method is mostly independent from the following steps though it influences the resulting genome alignments. Combining local alignments to blocks, we can build a graph (B). Which graph to choose depends on the respective importance of different substructures for an application. Next, a graphbased genome alignment approach always characterizes a set of graph substructures (C). Substructures sometimes have equivalences in other graphs, but may as well be distinguishable in the structure of only one graph. Detection of all substructure occurrences (D) is a requirement for their subsequent elimination (E). Elimination is accomplished by modifying the underlying set of blocks and sometimes also by introducing breakpoints in the genomes. The breakpoints determine already parts of a genome segmentation, which is finished in a last step (F). The segmentation together with the modified set of blocks defines the genome alignment (G).
Figure 13. An ABC to graphbased genome alignment. These six steps lead to the genome alignment in all of the compared graphbased genome alignment approaches.
This framework describes the main procedure of graphbased genome alignment. Still, it has limitations and there are additional problems to be solved. One such problem addresses blocks and occurs before building a graph. If we do not break up the local colinear alignments into alignment components, blocks may in general partially overlap. It is possible to resolve overlapping blocks by trimming [45] or avoid overlaps by requiring local alignments to be sparse [39]. A good alternative, which is for example used by the genome aligner Mugsy [21], is to obtain a set of mutually disjoint blocks by refining segment matches [46]. A segment match refinement resolves overlaps through modest computation without losing any alignment information.
In addition, the generation of blocks (multiple alignments) from pairwise alignments may pose a problem. There are only few exceptions of genome aligners that avoid the problem by directly computing local multiple alignments [45,47]. If we assume transitivity of the alignment relation, it is straightforward to go from local pairwise alignments to alignment components or to multiple ungapped alignments. In the case of gapped alignments however, pairwise alignments can have conflicting gap patterns. This complicates the task of combining them to a single block. Heuristic methods such as progressive alignment [7] or transitive alignment [9] carry out this task, but are time consuming. Having said that, a colinear realignment for each block carried out after finishing segmentation has proven to significantly improve alignment accuracy [45,48]. This suggests the alternative to ignore gaps in blocks while improving the genome alignment on the level of blocks.
Further, we have not covered all aspects of the framework in this paper and left out details on the detection of substructures. For example, ABA and DRIMMSynteny detect small cycles by efficiently computing a maximum spanning tree before heuristically inspecting the remaining edges that create cycles. Different detection methods clearly have an influence on the time complexity of an approach and, depending on their sensitivity, also on the accuracy of a genome aligner. Thus, a thorough analysis of detection methods is certainly interesting but beyond the scope of this work.
Similarly, we have not addressed algorithms for eliminating substructures. These algorithms determine the order in which modifications are applied. The elimination of one type of substructures can create other substructures, which again can create the first type of substructures upon elimination. For this reason, iterative elimination strategies are prevalent in graphbased genome aligners. End criteria for iteration are typically given as parameters of the method, e. g., a maximal length of cycles or an explicit number of iterations.
The parameters usually require customized values for every new input set of genomes. Usually, this inhibits broad usage of tools if automatic parameter selection is not offered. A genome aligner has to find a tradeoff between size and length of blocks. Very similar genomes will have long blocks conserved across many genomes, whereas more diverged genomes show fewer long blocks and conservation across fewer genomes. Hence, a factor to consider for parameter selection is genome divergence in addition to genome lengths. Given the initial set of local alignments, automatic parameter selection seems possible. It will be necessary to carefully study the influence of all factors to be able to automate the selection, but consequently it will enable a larger community to benefit from graphbased genome aligners.
Finally, graphbased genome aligners, just as other genome aligners, have to decide between positional homology alignment [49] or alignment of all repeats. More precisely, they have to decide, for segments with multiple copies in several genomes, whether to align them in one or in multiple blocks. Not only do repeats lead to a quadratic explosion in the number of pairwise alignments, but they also hide larger regions of colinearity. For this reason, several genome aligners aim at aligning less and predict positional homology [17,21,45]. Graphbased genome aligners compute positional homology to a certain degree. They do not forbid duplications, but separate blocks into positional homologs when splitting blocks.
In conclusion, our framework demonstrates shared aspects of graphbased genome aligners. It contributes to developing a common view on graphbased genome alignment, an active field of research with currently at least two graphbased tools for genome alignment being actively developed [50,51]. In the future, we might identify the steps that have the greatest influence on alignment accuracy. Already now, we believe that the framework provides assistance for the development of new and improved genome aligners.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
BK and KT participated in the design of the study and drafted the manuscript. MH participated in the design of the study and helped editing the manuscript. KR conceived of the study and participated in its design. All authors read and approved the final manuscript.
Acknowledgements
The authors thank Falk Hüffner for his comments on the manuscript and René Rahn for proofreading. KT and MH were supported by the DFG SPP1307 grants RE1712/31 and RE1712/32.
References

Zerbino DR, Paten B, Haussler D: Integrating genomes.
Science 2012, 336(6078):179182. PubMed Abstract  Publisher Full Text

Smith TF, Waterman MS: Identification of common molecular subsequences.
J Mol Biol 1981, 147:195197. PubMed Abstract  Publisher Full Text

Dewey CN: Wholegenome alignment.
Methods Mol Biol 2012, 855:237257. PubMed Abstract  Publisher Full Text

Kemena C, Notredame C: Upcoming challenges for multiple sequence alignment methods in the highthroughput era.
Bioinformatics 2009, 25(19):24552465. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Blackburne BP, Whelan S: Class of multiple sequence alignment algorithm affects genomic analysis.
Mol Biol Evol 2012, 30(3):642653. PubMed Abstract  Publisher Full Text

Feuk L, Carson AR, Scherer SW: Structural variation in the human genome.
Nat Rev Genet 2006, 7(2):8597. PubMed Abstract  Publisher Full Text

Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees.
J Mol Evol 1987, 25(4):351360. PubMed Abstract  Publisher Full Text

Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice.
Nucleic Acids Res 1994, 22(22):46734680. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Notredame C, Higgins DG, Heringa J: TCoffee: a novel method for fast and accurate multiple sequence alignment.
J Mol Biol 2000, 302:205217. PubMed Abstract  Publisher Full Text

Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
Nucleic Acids Res 2002, 30(14):30593066. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S, NISC Comparative Sequencing Program: LAGAN and MultiLAGAN: efficient tools for largescale multiple alignment of genomic DNA.
Genome Res 2003, 13(4):721731. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res 2004, 32(5):17921797. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S: ProbCons: probabilistic consistencybased multiple sequence alignment.
Genome Res 2005, 15(2):330340. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Löytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions.
Proc Natl Acad Sci USA 2005, 102(30):1055710562. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rausch T, Emde AK, Weese D, Döring A, Notredame C, Reinert K: Segmentbased multiple sequence alignment.
Bioinformatics 2008, 24(16):i187i192. PubMed Abstract  Publisher Full Text

Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L: Fast statistical alignment.
PLoS Comput Biol 2009, 5(5):e1000392. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner.
Genome Res 2004, 14(4):708715. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Darling ACE, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements.
Genome Res 2004, 14(7):13941403. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Raphael B, Zhi D, Tang H, Pevzner P: A novel method for multiple alignment of sequences with repeated and shuffled elements.
Genome Res 2004, 14(11):23362346. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dubchak I, Poliakov A, Kislyuk A, Brudno M: Multiple wholegenome alignments without a reference organism.
Genome Res 2009, 19(4):682689. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Angiuoli SV, Salzberg SL: Mugsy: fast multiple alignment of closely related whole genomes.
Bioinformatics 2011, 27(3):334342. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D: Cactus: algorithms for genome multiple sequence alignment.
Genome Res 2011, 21(9):15121528. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

ElMabrouk N, Sankoff D: Analysis of gene order evolution beyond singlecopy genes.
Methods Mol Biol 2012, 855:397429. PubMed Abstract  Publisher Full Text

Sankoff D, Blanchette M: The median problem for breakpoints in comparative genomics. In Computing and Combinatorics, Volume 1276 of Lecture Notes in Computer Science. Edited by Jiang T, Lee D. Heidelberg: Springer Berlin; 1997:251263.

Kehr B, Reinert K, Darling AE: Hidden breakpoints in genome alignments. In Algorithms in Bioinformatics, Volume 7534 of Lecture Notes in Computer Science. Edited by Raphael B, Tang J. Berlin Heidelberg: Springer; 2012:391403.

Hannenhalli S, Pevzner PA: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals.
J ACM 1999, 46:127. Publisher Full Text

Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange.
Bioinformatics 2005, 21(16):33403346. PubMed Abstract  Publisher Full Text

Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. In Algorithms in Bioinformatics, Volume 4175 of Lecture Notes in Computer Science. Edited by Bücher P, Moret BM. Berlin Heidelberg: Springer; 2006:163173.

Alekseyev MA, Pevzner PA: Breakpoint graphs and ancestral genome reconstructions.
Genome Res 2009, 19(5):943957. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bafna V, Pevzner PA: Genome rearrangements and sorting by reversals.
1993.

Kececioglu JD, Sankoff D: Efficient bounds for oriented chromosome inversion distance. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching, CPM ’94. Berlin Heidelberg: Springer; 1994:307325.

Kececioglu J: The maximum weight trace problem in multiple sequence alignment. In Proceedings of the 4th Symposium on Combinatorial Pattern Matching (CPM), Volume 684 of Lecture Notes in Computer Science. Berlin Heidelberg: Springer; 1993:106119.

Reinert K, Lenhof HP, Mutzel P, Mehlhorn K, Kececioglu JD: A branchandcut algorithm for multiple sequence alignment. In Proceedings of the first annual international conference on Computational molecular biology. RECOMB ’97, New York, NY, USA: ACM; 1997:241250.

Fostier J, Proost S, Dhoedt B, Saeys Y, Demeester P, de Peer YV, Vandepoele K: A greedy, graphbased algorithm for the alignment of multiple homologous gene lists.
Bioinformatics 2011, 27(6):749756. PubMed Abstract  Publisher Full Text

Pevzner PA, Tang H, Tesler G: De novo repeat classification and fragment assembly.
Genome Res 2004, 14(9):17861796. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Compeau PEC, Pevzner PA, Tesler G: How to apply de Bruijn graphs to genome assembly.
Nat Biotechnol 2011, 29(11):987991. PubMed Abstract  Publisher Full Text

Pham SK, Pevzner PA: DRIMMSynteny: decomposing genomes into evolutionary conserved segments.
Bioinformatics 2010, 26(20):25092516. PubMed Abstract  Publisher Full Text

Paten B, Herrero J, Beal K, Fitzgerald S, Birney E: Enredo and Pecan: genomewide mammalian consistencybased multiple alignment with paralogs.
Genome Res 2008, 18(11):18141828. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Paten B, Diekhans M, Earl D, John JS, Ma J, Suh B, Haussler D: Cactus graphs for genome comparisons.
J Comput Biol 2011, 18(3):469481. PubMed Abstract  Publisher Full Text

Harary F, Uhlenbeck GE: On the number of husimi trees: I.
Proc Natl Acad Sci USA 1953, 39(4):315322. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Belal NA, Heath LS: A theoretical model for whole genome alignment.
J Comput Biol 2011, 18(5):705728. PubMed Abstract  Publisher Full Text

Cormen TH, Stein C, Rivest RL, Leiserson CE: Introduction to Algorithms. Cambridge, MA, USA: The MIT Press; 2001.

Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs.
Bioinformatics 2002, 18(3):452464. PubMed Abstract  Publisher Full Text

Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement.
PLoS One 2010, 5(6):e11147. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Halpern AL, Huson DH, Reinert K: Segment match refinement and applications. In Proceedings of the Second International Workshop on Algorithms in Bioinformatics, WABI ’02. Berlin Heidelberg: Springer; 2002:126139.

Höhl M, Kurtz S, Ohlebusch E: Efficient multiple genome alignment.
Bioinformatics 2002, 18 Suppl 1:S312S320. PubMed Abstract  Publisher Full Text

Gotoh O: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments.
J Mol Biol 1996, 264(4):823838. PubMed Abstract  Publisher Full Text

Dewey CN: Positional orthology: putting genomic evolutionary relationships into context.
Brief Bioinform 2011, 12(5):401412. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hickey G, Paten B: Progressive Cactus. [https://github.com/glennhickey/progressiveCactus webcite]

Minkin I, Patel A, Kolmogorov M, Vyahhi N, Pham S: Sibelia: a scalable and comprehensive synteny block generation tool for closely related microbial genomes. In Algorithms in Bioinformatics, Volume 8126 of Lecture Notes in Computer Science. Edited by Darling A, Stoye J. Berlin Heidelberg: Springer; 2013:215229.