Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Canada K1N 6N5

Abstract

Background

Median construction is at the heart of several approaches to gene-order phylogeny. It has been observed that the solution to a median problem is generally not unique, and that alternate solutions may be quite different. Another concern has to do with a tendency for medians to fall on or near one of the three input orders, and hence to contain no information about the other two.

Results

We conjecture that as gene orders become more random with respect to each other, and as the number of genes increases, the breakpoint median for circular unichromosomal genomes, in both the unsigned and signed cases, tends to approach one of the input genomes, the "corners" in terms of the distance normalized by the number of genes. Moreover, there are alternate solutions that approach each of the other inputs, so that the average distance between solutions is very large. We confirm these claims through simulations, and extend the results to medians of more than three genomes.

Conclusions

This effect also introduces serious biases into the medians of less scrambled genomes. It prompts a reconsideration of the role of the median in gene order phylogeny. Fortunately, for triples of finite length genomes, a small proportion of the median solutions escape the tendency towards the corners, and these are relatively close to each other. This suggests that a focused search for these solutions, though they represent a decreasing minority as genome length increases, is a way out of the pathological tendency we have described.

Background

The median problem, namely to construct the genome, the sum of whose distances from three given genomes is minimized, is of biological interest because it is at the heart of several approaches to phylogenetic inference based on gene order. It is also of computational interest since it represents one of the major axes of generalizations of simple pairwise gene order comparison, and most but not all versions are NP-hard

One concern about the median problem, perhaps of more pertinence to applications than to theory, is that the solution is generally not unique, and that different solutions may be of considerable distance from each other (e.g.,

In this study, based on a series of simulations, we investigate the simplest median problems, that of unsigned genes under the breakpoint distance and that of signed genes under the breakpoint distance. We make use of a reduction of the problems into the Traveling Salesman Problem (TSP)

We generalize this conjecture to the case of the median of four or more genomes. We also conjecture that the phenomenon of medians "seeking corners" carries over to other distances often applied to gene orders. Finally we discuss how it fits in with more general ideas of loss of evolutionary signal as gene orders become increasingly rearranged.

The breakpoint median problem for circular chromosomes

For the unsigned case, we consider genomes modeled as (single) circular permutations on genes 1, …, _{1}, …,_{n }_{i}_{i}_{n }_{1 }

Consider two unsigned genomes _{1}, ..., _{n }_{1}, ..., _{n }

For a signed genome, each gene is assigned a positive or negative orientation. If gene

Given three genomes

More generally, for _{1}, …, _{k }

The unichromosomal breakpoint median problems are known to be NP-hard (

Nevertheless, by reducing the

Given _{1}, …, _{k }_{1}, …, _{k}

A similar strategy transforms the median problem for the signed genome problem to the TSP.

The conjectures

We start with the unsigned case. For a given

Let

in other words, the set of genomes that are close to

We note that for all

because there can be no more than

We impose a uniform measure _{n }_{n}

as

We propose the following:

**Conjecture 1 "Medians Seek the Corners" **_{1}, …, _{k }are k genomes drawn at random from

It is important to note that not only would a median tend to be close to one of the input genomes _{1}, …, _{k}

**Corollary 1 **For _{1}, …, _{k }are k genomes drawn at random from

**Corollary 2 **_{1}, …, _{k }are k genomes drawn at random from _{1 }_{2 }

We now turn to the case of signed genomes. Here, not only are there (^{n }ways of assigning orientations to the genes. Thus the set ^{n}_{n }

Then Conjecture 1 and Corollaries 1 and 2 are also proposed for the signed case, where _{n }_{n}

The conjecture, and its corollaries, might seem counterintuitive, especially if the median is conceived of as being "in the middle" of the input genomes. For example we could imagine constructing a genome containing a proportion 1/

Results

While awaiting formal proof of the conjecture, or its disproof, we can offer some observations based on simulations.

To generate a random genome we applied a series of rearrangements to the identity permutation 1, …,

For signed genomes, we also randomized the orientation of each swapped gene.

To get a sample of many alternative solutions to the median problem, we varied the seed used by Concorde to initialize its solution to the TSP. For our purposes it is desirable to sample uniformly from the entire set of medians for any one instance. Lacking an analysis of the internal workings of Concorde, we simply noted that the solutions seemed maximally diverse, as predicted by Corollary 2, and they showed symmetric tendencies with respect to the presentation order of the input genomes; i.e., there was no tendency for more genomes to be close to _{i }_{j}

The first set of simulations for unsigned genomes depicted in Figure

Evolution of the average distance between median solutions as the input genomes become randomized

**Evolution of the average distance between median solutions as the input genomes become randomized**.

For these same simulations, Figure

Evolution of the median sum as the input genomes become randomized

**Evolution of the median sum as the input genomes become randomized**.

Simulations involving signed genomes gave very similar results to those depicted in Figures

Medians at the middle

In the simulations, most of the solutions to the median problem were distributed evenly to the neighborhoods of the three input genomes. But a few were approximately equidistant from the the three of them: _{1}) ~ _{2}) ~ _{3}). This did not affect the median sum trends since, of course, as medians, these have the same sum as the ones near the input genomes. They do, however, affect the average distance between solutions, since they are closer together and, more important, closer to all of the input medians than the latter are to each other.

To further investigate the role of these "medians in the middle" we measured the average distance of median solutions from the closest input genome, and counted the number of centrally located medians out of 50 for each simulation. To ensure randomness, the inputs were generated with 300 random swaps (each swap involving up to four new breakpoints) per 100 genes in a genome, so that there will remain very few adjacencies in common with the identity permutation and, especially, with the other input genomes. The results are depicted in Figure

Median drift towards the corners and transience of the middle solution

**Median drift towards the corners and transience of the middle solution**. Left: unsigned genomes. Right: signed genomes.

In sum, while there are four types of median solution to each instance of the median problem with random input, three in the neighbourhoods of the input genomes, and one in the middle, the latter is of diminishing frequency; its measure goes to zero as

Generalization to higher

Simulations with

Evolution of the average distance between median solutions as

**Evolution of the average distance between median solutions as k input genomes become randomized**.

Figure

Evolution of the median sum as

**Evolution of the median sum as k input genomes become randomized**.

Taken together, Figures

Discussion

Although it would of course be good to have a proof of our conjecture and its corollaries, the simulations allow us a degree of confidence that they are true. There is a remote possibility that varying the seed used by Concorde does not lead to a uniform sample of median solutions, but this seems unlikely. One indication that there is no presentation-order artifact is that all three corners accumulate solutions to the same extent.

The solutions, of course, pertain only to random genomes. The gradual increase seen in Figure

These results imply that an unreflecting use of the median in comparing three even moderately scrambled genomes, and as the inner optimization step of a small phylogeny analysis, with ancestral gene reconstruction, is methodologically dangerous. A median at a corner contains no compromise information from the other two genomes. The tendency for the medians to seek a corner is a mathematical artifact of the notion of breakpoint or of some more general concept in the comparison of permutations, and should certainly not be attributed any biological significance.

All is not lost, however! Recall that we have actually identified four median tendencies, not three. (Or

Median solutions increasingly concentrated (shaded regions) around corners and shrinking at compromise positions

**Median solutions increasingly concentrated (shaded regions) around corners and shrinking at compromise positions**.

As a consequence, we suggest that applications of median methods should entail the comparison of many alternative medians, the identification and discarding of those contaminated by the drift towards the corner, and the search for the rare median that genuinely reflects a compromise among the input genomes. This may be done in an objective way since the set of medians will have four regions of high probability in the space of genomes, separated by large regions of low probability. Most of the probability will be concentrated on the neighborhoods of the input genomes. Finding the "poor cousin" in the middle may require the generation of large numbers of candidate solutions, but given the computing resources, this seems imperative if we want to make biological sense.

The computational difficulty traditionally ascribed to the median problem, especially when the input genomes are highly rearranged with respect to each other, would seem to preclude this approach. With breakpoint medians, however, computing time need not be a problem. Use of an efficient TSP solver allowed us to find medians when

Finally, we offer a further conjecture, which seems compelling to us, but for which we have only rough justification, and which moreover is unlikely to win many believers. We conjecture that breakpoint medians for the minimum reversals metric or the double-cut-and-join metric will also seek the corners as genomes become longer and more rearranged, although this effect may require relatively large

Authors' contributions

MH and DS did the research and wrote the paper.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada

(NSERC). DS holds the Canada Research Chair in Mathematical Genomics.

This article has been published as part of