Center for Computational Biology, Beijing Forestry University, Beijing 100083, China

Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA

Abstract

Background

Inferring gene orders of ancestral genomes has the potential to provide detailed information about the recent evolution of species descended from them. Current popular tools to infer ancestral genome data (such as GRAPPA and MGR) are all parsimony-based direct optimization methods with the aim to minimize the number of evolutionary events. Recently a new method based on the approach of maximum likelihood is proposed. The current implementation of these direct optimization methods are all based on solving the median problems and achieve more accurate results than the maximum likelihood method. However, both GRAPPA and MGR are extremely time consuming under high rearrangement rates. The maximum likelihood method, on the contrary, runs much faster with less accurate results.

Results

We propose a mixture method to optimize the inference of ancestral gene orders. This method first uses the maximum likelihood approach to identify gene adjacencies that are likely to be present in the ancestral genomes, which are then fixed in the branch-and-bound search of median calculations. This hybrid approach not only greatly speeds up the direct optimization methods, but also retains high accuracy even when the genomes are evolutionary very distant.

Conclusions

Our mixture method produces more accurate ancestral genomes compared with the maximum likelihood method while the computation time is far less than that of the parsimony-based direct optimization methods. It can effectively deal with genome data of relatively high rearrangement rates which is hard for the direct optimization methods to solve in a reasonable amount of time, thus extends the range of data that can be analyzed by the existing methods.

Background

Inferring gene orders and gene content of ancestral genomes has a wide range of applications. High-level rearrangement events such as inversions, transpositions that change gene order are important because they are "rare genomic events"

The fundamental question in building phylogenies is how far apart two species are from each other. Hannenhalli and Penvzner

In this paper, we propose a mixture method to enhance the inference of ancestral gene orders. We first use the maximum likelihood method to reconstruct an initial ancestral genome. Then we randomly select a number of gene adjacencies from the ancestral genome and fix them to run the median calculation. By analyzing the results of each median calculation, we try to infer the correct gene adjacencies generated by the maximum likelihood method. Finally we fix these adjacencies and perform median calculation to get the ancestral genome. We have conducted extensive experiments on simulated datasets and the results show that this mixture approach is more accurate than the maximum likelihood method. Also our method is much faster than solely using the median calculation when the rearrangement rate is high. Using this hybrid approach, for those datasets that are previously too difficult for existing methods, we will be able to analyze them within a reasonable time frame with very high accuracy.

Genome rearrangements

Given a set

There are some additional events for multiple-chromosome genomes, such as

We use a breakpoint graph _{2 }= 1,4, 2, -3, 5, the vertex set is

An example of breakpoint graph

**An example of breakpoint graph****.** Breakpoint graph of genome

Inferring ancestral gene order by median calculation

One popular approach to reconstruct ancestral genome is through direct optimization and the task is to seek the tree with the minimum number of events, which is in spirit similar to the maximum parsimony approach used in sequence phylogeny. The core of these methods is to solve the median problem defined as follows: given a set of _{i}}_{1≤i ≤ N }and a distance measure _{M }that minimizes median score defined as

Widely used methods like GRAPPA

The median problem is known to be NP-hard

Figure

An example of adequate subgraph

**An example of adequate subgraph****. **(a) An MBG based on three circular genomes {1, 2, 3} (thick lines), {1, 2, -3} (double lines), {1, 3, 2} (thin lines). (b) The reduced MBG with (2 3) shrunk because (2, 3) is an adequate sub-graph. The dash line indicates the median solution 1, 2, 3, which is the combination of the median solution of the two sub-graphs.

By searching the existing adequate subgraphs up to a certain size and incorporate them into an exhaustive algorithm for the median problem, the ASMedian solver could significantly reduce the computation time of median calculation, though it still runs very slowly with high gene rearrangement rate. Given

Reconstruct the ancestral genomes based on maximum likelihood

Another new approach to reconstruct the ancestral genome is based on maximum likelihood. Ma proposed a probabilistic framework for inferring ancestral genomic orders with rearrangements including inversion, translocation and fusion

Because the gene order of any ancestral node is unknown, the probabilities of all possible gene adjacencies within the genome need to be calculated. Since each internal node of the phylogeny tree has two children and we assume that both children evolved from the parent node, the posterior probability of any gene adjacency within the parent node can be computed when treating the gene orders of the children as observed data. The calculation can be performed recursively from the bottom leaves level to the top level until the root is reached. At this point the probabilities of all the possible gene adjacencies within the target ancestral genome are computed. Finally an approximate algorithm is used to pick up and connect the adjacencies together to maximize the overall probability, and these selected adjacencies form the final inferred ancestral genome.

Methods

We propose a framework that combines both maximum likelihood and direct optimization approaches to infer ancestral gene orders. The maximum likelihood method runs much faster with high genome rearrangement rate compared with the direct optimization methods, but the latter can infer more accurate gene orders. To speed up the latter, we need to reduce the computation time for the median calculation as it is clearly the bottleneck of the computation. Our method tries to pick up the correctly inferred gene order from the maximum likelihood method and fix these adjacencies in the median calculation. As a result, the median search space can be significantly reduced and the median calculation is accomplished much faster while the accuracy of the inferred gene order is also improved.

Reduce median search space by fixing adjacencies

We use Xu's ASMedian solver which is based on the Adequate Subgraph theory _{i}}_{1≤i≤N}, let the

Fix an adjacency (

**Fix an adjacency ( i,j) in the MBG**

If we can fix a number of adjacencies, an equal number of adequate subgraphs can be created. Thus we can shrink these subgraphs from the original multiple break point graph, and the original median problem will be greatly reduced. If the adjacencies that we try to fix are the actual adjacencies present in the final solution, this fixing procedure will not affect the accuracy of the median solution.

Infer the correct adjacencies

Based on our observation and simulation, only a part of the adjacencies inferred by the maximum likelihood method are the correct adjacencies that are present in the ancestral genome. It will be desired if we can optimally pick out all the correct adjacencies and fix them for the median computation. However, it is very difficult to find out all the correct adjacencies inferred by the current maximum likelihood method as only a small percentage of adjacencies are reported with high probability while others usually have similar small probability which cannot be used to judge whether an adjacency is correct or not. Our method to identify the correct adjacencies is a randomized approach. First, we will keep adjacencies that have higher probabilities as "correct". Then we will randomly select a number of adjacencies from those with lower probability and fix them in the median computation using ASMedian. These randomly selected adjacencies may contain both correct and incorrect adjacencies, but the correct adjacencies are more likely to lead the inference of other correct adjacencies within the ancestral genome after the median computation. Thus after the median is found, we examine the non-fixed adjacencies in the "ancestral genome" generated by the median solver and record the ones that also appear in the "ancestral genome" generated by the maximum likelihood method. We repeat the procedure many times. Finally for each adjacency generated by the maximum likelihood method, we get the number of times it appears in the resulted median genomes. The more it appears, the more likely it is a correct adjacency.

The number of randomly selected adjacencies should not be too large. Too many fixed adjacencies may contain too many incorrect adjacencies to pollute the median calculation and leads to bad results. On the other hand, the number of fixed adjacencies should not be too small, otherwise the time consumption of median calculation is unacceptable since the search space is still too large. In our practice, we randomly choose 70-75% of the adjacencies and fix them in the median computations. We also conducted extensive testing to verify that 70-75% is indeed a good choice and the results are shown later.

When more than three genomes are involved and a phylogeny is given, we will first infer part of the correct adjacencies for each ancestral genome that corresponds to an internal node. We will then use our modified median solver to iteratively update the gene order of these internal nodes with those fixed adjacencies. The tree edge length is the genomic distance between the nodes and the score of the tree is defined as the sum of all the tree edge. The iteration stops when the score of the tree dose not change, suggesting that convergence is reached. At this time, gene orders contained in the internal nodes are the final ancestral genomes to be reported.

Results and discussion

Experimental methods

We use simulated data to assess the quality of our method as we know the "true" ancestors in simulations. In our experiments, we use two models to generate the data. First we generate model tree topologies from the uniformly distributed binary trees, each with 10 leaves indicating 10 genomes. We use different evolutionary rates, of 20%, 24% ... 36% expected rearrangement rates per tree edge, with 50% relative flucturation. We also use the model proposed by Lin and Moret

Once the simulated genomes are generated, we use our mixture method to infer the ancestral gene order data. Because we are handling genomes with equal gene contents, the accuracy can be measured with the number of correctly inferred adjacencies (the inferred adjacencies that also appear in the original ancestral genome) divided by the total number of adjacencies for each genome.

Number of replicates required

As discussed before, we use maximum likelihood method to reconstruct the initial ancestral genome and then randomly select a number of gene adjacencies from the initial ancestral gene order data and fix them for median computation. This procedure is repeated until each adjacency has been selected for enough times. Finally we record the number of occurrences of each gene adjacency that are present in the median results (excluding the case when it is fixed in median computation) and sort these adjacencies according to the number of occurrences. If the number of repeats is large enough, the sorting of the gene adjacencies will remain almost unchanged.

According to our test, when the number of replicates is over 100, the sorting of the gene adjacencies comes to a steady state and there is little change with more repeated tests are conducted. Since we will pick a certain percentage of adjacencies with higher numbers of occurrences, the very small change in adjacency sorting will hardly affect the result.

Adjacency selection percentage

When we sort the gene adjacencies by the number of their occurrences in the median calculation result, it is clear that the more an adjacency appears, the more it is likely to be a correct adjacency. Thus we can just pick the adjacencies with large numbers of occurrences and fix them in the median calculation. However, we need to determine the threshold of how many adjacencies should be fixed. We define the

To determine the selection percentage, we use various percentages, from 65% to 85%, to select the gene adjacencies and fix them to reconstruct ancestral genomes through median computation. We compare the average accuracy of the reconstructed gene order under different selection percentages. Table

Accuracy of reconstructed gene order (%) under various selection percentages and rearrangement rates with uniformly distributed tree model

**65% selected**

**70% selected**

**75% selected**

**80% selected**

**85% selected**

20% events per edge

89.73

91.93

90.47

89.53

89.24

24% events per edge

80.28

81.92

81.07

80.56

79.81

28% events per edge

71.59

73.12

73.28

71.10

71.63

32% events per edge

63.33

64.52

65.72

65.25

64.81

36% events per edge

59.45

60.87

62.43

60.27

60.15

Accuracy of reconstructed gene order (%) under various selection percentages and diameters with Lin's random tree model

**65% selected**

**70% selected**

**75% selected**

**80% selected**

**85% selected**

Diameter = 2.0

93.49

94.37

93.97

93.01

92.89

Diameter = 2.5

89.02

89.93

89.02

88.31

88.18

Diameter = 3.0

84.75

86.51

85.25

84.58

84.45

Diameter = 3.5

80.21

82.04

82.93

80.58

80.50

Diameter = 4.0

79.14

79.84

80.96

79.37

78.75

From these tables, we can see that when the rearrangement rate or the tree diameter is relatively small (meaning genomes are closely related), the gene accuracy reaches maximum values when the selection percentage is 70%. When the rearrangment rate or the tree diameter grows larger, it would be better to use 75% as the selection percentage. This is because with larger rearrangement rates, some correct adjacencies inferred by the maximum likelihood method are hard to be inferred by the median calculation, so we should use a larger selection percentage to include more adjacencies inferred by the maximum likelihood method.

Accuracy on closely related genomes

First we compare the accuracy of the ancestral genome reconstructed by our method with the result from both Ma's maximum likelihood approach and the pure ASMedian solver. Because ASMedian solver is very time consuming, we only perform tests under lower rearrangement rates and diameters. We perform simulation tests using the two types of tree models as stated above with various rearrangement rates and tree diameters. For each configuration we run 50 data sets and the results are averaged. Table

Reconstructed gene order accuracy comparison with low rearrangement rates under uniformly distributed tree model

**Ma's method**

**Pure median method**

**Our mixture method**

12% events per edge

92.28

98.53

98.49

16% events per edge

86.37

96.25

96.12

20% events per edge

81.54

91.76

92.13

Reconstructed gene order accuracy comparison with small diameters under Lin's random tree model

**Ma's method**

**Pure median method**

**Our mixture method**

Diameter = 0.6

96.75

99.37

99.12

Diameter = 0.9

95.48

97.62

97.31

Diameter = 1.2

92.39

95.81

95.89

Accuracy on distant genomes

ASMedian solver cannot finish datasets with higher rates or diameters, thus for distant genomes, we only compare the accuracy of the ancestral genome reconstructed by our method with the result of Ma's maximum likelihood approach. To see the effectiveness of our new approach, we also compare the optimal solution when all correct adjacencies in the maximum likelihood results can be told, which is impossible in real data analysis but is trivial in simulation as the truth is known. This optimal solution is also the upper bound that our mixture framework can achieve.

Figure

Comparison of the reconstructed gene adjacency accuracy with high rearrangement rates under uniform distribution binary trees

**Comparison of the reconstructed gene adjacency accuracy with high rearrangement rates under uniform distribution binary trees.**

Figure

Comparison of the reconstructed gene adjacency accuracy with high rearrangement rates under Lin's random tree model

**Comparison of the reconstructed gene adjacency accuracy with high rearrangement rates under Lin's random tree model.**

In conclusion, our method that randomly picks adjacencies from the maximum likelihood method could achieve accuracy that is comparable to the optimal result by picking all the correct adjacencies. For distant genomes, the original median-based parsimony methods cannot give result in reasonable time and we can expect that the result of parsimony method will not exceed the optimal result achieved by fixing part of the correct adjacencies. As a result, we assume that our method could achieve similar accuracy for distant genomes compared with the parsimony methods. Also the result shows that, although the accuracy decreases when genomes are more distant from each other, our method are still more accurate than the original maximum likelihood method. However, for very distant genomes (rearrangement rate over 40% or diameter over 4.5), the maximum likelihood method can only produce result with very low accuracy, resulting not enough correct adjacencies for our mixture method. If we keep the original selection percentage (70% to 75%), the accuracy of the inferred ancestral genome becomes very low. On the other hand, if we use lower selection percentages, it will be very difficult for the median calculation to finish in reasonable time as not so many adjacencies are fixed. Thus inferring the ancestral genome for very distant genomes is still a challenging problem for our method.

Time consumption

We also compare the time consumption of our mixture method and the original parsimony method. The time cost of our mixture method includes the time spent on the maximum likelihood method and the time for median calculations with multiple fixed adjacencies. As stated above, operations for calculating median with randomly selected adjacencies are independent from each other, so they can be performed in parallel. The test is performed on an server with 2.4GHz, 4 core CPUs. Average running time of the two methods is reported in seconds.

Table

Comparison of average time cost under uniform distribution binary trees

**Rearrangement rate per edge**

**16%**

**20%**

**24%**

**28%**

**32%**

Pure median method

315

15800

1.4 × 10^{5}

N/A

N/A

Our mixture method

180

760

1270

1620

1850

Speed up

1.75

20.78

110.23

N/A

N/A

Comparison of average time cost under Lin's random tree model

**Tree diameter**

**1.5**

**2.0**

**2.5**

**3.0**

**3.5**

Pure median method

8500

76400

2 × 10^{5}

N/A

N/A

Our mixture method

430

840

1430

1920

2230

Speed up

19.76

90.95

139.86

N/A

N/A

Conclusions

Ancestral gene order data can be inferred by either median computations or the maximum likelihood method. Median calculation can give a more accurate result but is extremely time consuming under high rearrangement rates. Maximum likelihood method runs much faster but the inferred gene order is not as accurate as the result of median method. We propose a mixture method to infer the ancestral gene order based on the above two approaches to improve both time cost and accuracy. The main idea of our method is to pick the correct gene adjacencies through a randomized procedure and reduce the median computation cost by fixing these correct gene adjacencies. Experiments show that our mixture method produces more accurate ancestral genomes than the maximum likelihood method while the computation time is far less than that of pure median method.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YZ and JT contributed to the development and implementation of the algorithms, and FH were in charge of conducting simulations.

Acknowledgements

This work is partly supported by grant NSF 0904179.

This article has been published as part of