Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA

College of Computer Science and Technology, Jilin University, Changchun 130012, PR China

Abstract

Background

Because of the advent of high-throughput sequencing and the consequent reduction in the cost of sequencing, many organisms have been completely sequenced and most of their genes identified. It thus has become possible to represent whole genomes as ordered lists of gene identifiers and to study the rearrangement of these entities through computational means. As a result, genome rearrangement data has attracted increasing attentions from both biologists and computer scientists as a new type of data for phylogenetic analysis. The main events of genome rearrangements include inversions, transpositions and transversions. To date, GRAPPA and MGR are the most accurate methods for rearrangement phylogeny, both assuming inversion as the only event. However, due to the complexity of computing transposition distance, it is very difficult to analyze datasets when transpositions are dominant.

Results

We extend GRAPPA to handle transpositions. The new method is named GRAPPA-TP, with two major extensions: a heuristic method to estimate transposition distance, and a new transposition median solver for three genomes. Although GRAPPA-TP uses a greedy approach to compute the transposition distance, it is very accurate when genomes are relatively close. The new GRAPPA-TP is available from

Conclusion

Our extensive testing using simulated datasets shows that GRAPPA-TP is very accurate in terms of ancestor genome inference and phylogenetic reconstruction. Simulation results also suggest that model match is critical in genome rearrangement analysis: it is not accurate to simulate transpositions with other events including inversions.

Background

While phylogenetic studies in the pre-genome era primarily focused on DNA or protein sequence differences among organisms, informative comparisons can in fact be made at various organizational levels. Higher-level evolutionary events of relevance to phylogenetics include inversion, transposition, deletion, insertion and duplication. Phylogenetic analyses of whole genomes that model these types of events are proving to be extremely useful in elucidating the evolutionary relationships among organisms

During the past several years, computer scientists have been able to make substantial progress in genome rearrangement research. With solutions for inversion distance

Much of the research on genome rearrangement has focused on organellar genomes, such as mitochondrial

Existing methods can still be applied when transposition is the dominant event. For example, given genome 1, 2,⋯,

Genome rearrangements

We represent a genome as a signed ordering of

A

An

There are additional events for multiple-chromosome genomes, such as

Distance computation

Given two genomes _{1 }and _{2}, we define the _{1}, _{2}) as the minimum number of events required to transform one genome into the other.

The _{1 }is defined as an ordered pair of genes (_{1 }but not in _{2}. The breakpoint distance is simply the number of breakpoints in _{1 }relative to _{2}.

When only inversions are allowed, the edit distance is the

The

Yancopoulos et al.

Median problem of three

The median problem on three genomes is to find a single genome that minimizes the sum of pairwise distances between itself and each of the three given genomes. This problem is computationally very hard even for the simplest breakpoint distance

The

Phylogenetic reconstruction from genome rearrangements

Reconstructing phylogenies from genome rearrangement data is computationally much harder than from sequence data. For example, finding the minimum number of evolutionary events given a fixed tree can be done in linear time if the leaves are labeled with DNA or protein sequences, whereas such task for genome rearrangement data is NP hard even when the tree has only three leaves.

Methods for reconstructing trees based on genome rearrangement data include distance-based methods (for example, neighbor-joining

Results and discussion

We examine the performance of the new GRAPPA-TP through two simulation studies: the first study is to measure the accuracy of the inferred median genome (estimated ancestor) compared to the true ancestor, using datasets of three input genomes; the second is to measure the accuracy of the inferred phylogeny compared to the true tree, using datasets of 10 genomes. All the experiments are conducted on a Linux cluster with 152 Intel Xeon CPUs, but each CPU works independently on a test task.

Accuracy of ancestor inference for three genomes

We first examine the quality of GRAPPA-TP in inferring ancestor genomes. In our simulation study, each genome has 37 or 100 genes, spanning the range from mitochondria to chloroplast.

We create each dataset by first generating a tree with three leaves and assigning its three edges with different lengths. The length (number of events) of each edge is sampled from a uniform distribution on the set {0.5

Given an estimated ancestor gene order _{M}, we can use the breakpoint distance between _{M }and _{0 }as a measurement of how close the inferred ancestor is to the true ancestor. For each dataset, we obtain the estimated ancestors by using the following five methods: GRAPPA-TP (TP), DCJ median solver (DCJ), MGR, breakpoint median solver (BP) and inversion median solver (INV). We repeat 100 times for each setting and the averages of the results are reported.

Figure

Breakpoint distance from the inferred median to the true ancestor (37 genes)

**Breakpoint distance from the inferred median to the true ancestor (37 genes)**. TP indicates the result obtained from GRAPPA-TP, INV indicates the result obtained by using the Caprara's inversion median solver, BP indicates the result obtained by using the breakpoint median solver, MGR indicates the result obtained by using MGR and DCJ indicates the result obtained by using the DCJ median solver.

Breakpoint distance from the inferred median to the true ancestor (100 genes)

**Breakpoint distance from the inferred median to the true ancestor (100 genes)**. TP indicates the result obtained from GRAPPA-TP, INV indicates the result obtained by using the Caprara's inversion median solver, BP indicates the result obtained by using the breakpoint median solver, MGR indicates the result obtained by using MGR and DCJ indicates the result obtained by using the DCJ median solver.

As indicated in the later section, GRAPPA-TP uses a simple distance estimator to conduct a randomized search, and we may need to repeat several times to obtain the smallest distance, hence the number of repeats may have impact on its performance. To assess the impact, we compare GRAPPA-TP using two numbers of repeats: 1 and 10, and report the results in Figure

Breakpoint distance from the inferred median to the true ancestor

**Breakpoint distance from the inferred median to the true ancestor**. In this experiment, 1 and 10 repeats are used for the distance computation.

Accuracy of phylogeny inference

We also test the performance of GRAPPA-TP on phylogeny analysis. We first define our measure for the accuracy of reconstructed trees. Given an inferred tree, we compare its topological accuracy by computing

In our experiments, each dataset is tested using seven methods: GRAPPA-TP (TP), GRAPPA using inversion median (INV), GRAPPA using breakpoint median (BP), MGR, NJ using transposition distances (TP-NJ), NJ using inversion distances (INV-NJ) and NJ using breakpoint distances (BP-NJ). We cannot test our DCJ median here because the scoring procedure of GRAPPA-DCJ generates some median problem instances that are too difficult for it to run. Figure

RF errors for seven methods under different expected number of events

**RF errors for seven methods under different expected number of events r**. The horizontal line indicates the acceptance threshold of 5% error rate.

We make the following two observations.

First, NJ has remarkably good performance when the genomes are close (

Second, GRAPPA-TP always returns highly accurate trees, although its performance is slightly worse than TP-NJ for

Although the number of genomes is relatively small in this test, the high accuracy of GRAPPA-TP makes it ideal as a base method for the DCM-GRAPPA developed by Tang et al.

Conclusion

In this paper, we present our new method to handle transpositions and report experimental results on simulated datasets. Although GRAPPA-TP uses a brute-force distance estimator, it remains very accurate for transposition phylogeny. Our studies suggest that model match is very important in both ancestor inference and phylogenetic reconstruction. The main problem of GRAPPA-TP is of course the accuracy and running time of its distance estimator, and a fast and exact method to compute transposition distance is always desirable.

Methods

We extend GRAPPA to handle transpositions. The new method is named GRAPPA-TP, with two major extensions: a heuristic method to estimate transposition distance, and a new transposition median solver for three genomes.

Transposition distance estimation

Although no polynomial algorithms for transposition distance has been reported, researchers are able to estimate the distance using the 1.375-approximation by Hartman

The only existing software that can compute transposition distance is derange2 developed by Blanchette

The new distance estimator is based on the following observation: given two genomes _{1 }and _{2}, a transposition applied on _{2 }can reduce the number of breakpoints by 3, 2, 1 or 0, as shown in Figure

Number of breakpoint changes by applying different transpositions, compared to the identity permutation (1 2 3 4 5 6)

Number of breakpoint changes by applying different transpositions, compared to the identity permutation (1 2 3 4 5 6).

This observation suggests that computing the transposition distance can be transferred to find the fewest number of steps that bring the number of breakpoints to zero.

We develop a brute-force method to quickly reduce the number of breakpoints to zero. The algorithm works as follows: it starts from _{2 }and moves towards _{1}. At each step, it will enumerate all transpositions and apply the one on _{2 }that can reduce the most number of breakpoints. It will continue the process until the number of breakpoints becomes 0 (i.e. _{2 }is transformed to _{1}). The transposition distance is thus the total number of steps used to transform _{2 }into _{1}. At any given step, it will randomly choose one transposition when there are multiple choices.

The above algorithm is heuristic because in some cases, a transposition at the current step that does not reduce the most number of breakpoints may result in better choices later. Thus, to get more accurate distance, we can repeat the above process several times and report the smallest value as the distance. In our experiments, we found that no more than 10 repeats are needed. This algorithm will always return a distance that is greater or equal to the edit distance.

Figure

Distance estimation results for 37 genes (left) and 100 genes (right)

Distance estimation results for 37 genes (left) and 100 genes (right).

One should note that this estimator will fail badly for some cases. For example, it only needs four steps to transform the reverse identity genome (7 6 5 4 3 2 1) into (1 2 3 4 5 6 7), while our estimator needs seven steps. However, such cases are very rare, as indicated by Figure

Transposition median solver

The next step is to develop a transposition median solver to handle the smallest binary trees of three edges. We develop a new median solver that is based on the branch-and-bound method proposed by Siepel and Moret

Given three genomes _{1}, _{2 }and _{3}, and a median genome

In general, the branch-and-bound approach works as follows:

• Given the three genomes _{1}, _{2 }and _{3}, compute the lower bound on the median score, _{i }and _{j}.

• Pick one genome as the start and push it into a queue; its median score is the initial best-so-far.

• Iteratively remove a genome

- If the median score of

- If the median score of

- create all

Clearly, since there are

(Bound 1) If _{1 }to the median

(Bound 2) If _{1 }to the median _{1 }to

_{1}, _{2}, _{3}, _{1}, _{2}, _{3},

In other words, we will ignore those neighbors that can take the search back more than one step.

When the genomes are relatively close, our distance estimation is near optimal, hence the above bounds is still effective. However, these bounds become loose when the genomes are distant, and a new and more effective set of lower bounds should be developed in the future.

The speed of our median solver is regulated by two factors: the distance from the median to its closest leave and the number of genes present. To make the genome length relatively unimportant, we condense the genomes using the concept of conserved adjacency: a gene pair (x, y) is conserved adjacent if (x, y) or its inverse (-y, -x) is present in all genomes as consecutive elements

Phylogenetic analysis

Computing phylogenies requires two main components for more than three genomes: scoring a given tree, and searching for the best tree based on their scores. The scoring procedure we use is based on the iterative approach implemented in the original GRAPPA, shown as function

Algorithm overview for GRAPPA-TP

Algorithm overview for GRAPPA-TP.

The scoring procedure depends on the initial assignment of gene orders to internal nodes, which has no gene-orders assigned when the scoring starts. Internal genomes can be initialized trivially, by giving each internal node a random gene order. However, since the initialization has big impact on the convergence of the scoring procedure, other complex methods are developed and all yield better results. The most used initialization method is the

To search through the large tree space, we will enumerate all trees and use the tightened circular-ordering lower bounds to discard bad trees before scoring them _{ij }is the path between _{1,2 }+ _{2,3 }+ ⋯ + _{N,1}.

This triangular inequality immediately gives us a (circular ordering) lower bound for the tree score, i.e. the tree score

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contribute to the development and implementation of the algorithms, and FY and JT are in charge of conducting simulations and analyzing results.

Acknowledgements

FY and JT are supported by US National Institutes of Health (NIH grant number R01 GM078991-01) and by the University of South Carolina. MZ is supported by NSF of China No.60473099.

This article has been published as part of ^{th }International Conference on Bioinformatics and Bioengineering at Harvard Medical School. The full contents of the supplement are available online at