Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Methodology article

Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf

Gabriel Cardona, Arnau Mir, Francesc Rosselló*, Lucía Rotger and David Sánchez

Author Affiliations

Department of Mathematics and Computer Science, University of the Balearic Islands, E-07122 Palma de Mallorca, Spain

For all author emails, please log on.

BMC Bioinformatics 2013, 14:3  doi:10.1186/1471-2105-14-3

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/3


Received:17 July 2012
Accepted:18 December 2012
Published:16 January 2013

© 2013 Cardona et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Phylogenetic tree comparison metrics are an important tool in the study of evolution, and hence the definition of such metrics is an interesting problem in phylogenetics. In a paper in Taxon fifty years ago, Sokal and Rohlf proposed to measure quantitatively the difference between a pair of phylogenetic trees by first encoding them by means of their half-matrices of cophenetic values, and then comparing these matrices. This idea has been used several times since then to define dissimilarity measures between phylogenetic trees but, to our knowledge, no proper metric on weighted phylogenetic trees with nested taxa based on this idea has been formally defined and studied yet. Actually, the cophenetic values of pairs of different taxa alone are not enough to single out phylogenetic trees with weighted arcs or nested taxa.

Results

For every (rooted) phylogenetic tree T, let its cophenetic vectorφ(T) consist of all pairs of cophenetic values between pairs of taxa in T and all depths of taxa in T. It turns out that these cophenetic vectors single out weighted phylogenetic trees with nested taxa. We then define a family of cophenetic metrics dφ,p by comparing these cophenetic vectors by means of Lp norms, and we study, either analytically or numerically, some of their basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other metrics.

Conclusions

The cophenetic metrics can be safely used on weighted phylogenetic trees with nested taxa and no restriction on degrees, and they can be computed in O(n2) time, where n stands for the number of taxa. The metrics dφ,1 and dφ,2 have positive skewed distributions, and they show a low rank correlation with the Robinson-Foulds metric and the nodal metrics, and a very high correlation with each other and with the splitted nodal metrics. The diameter of dφ,p, for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M1">View MathML</a> , is in O(n(p+2)/p), and thus for low p they are more discriminative, having a wider range of values.

Background

Many phylogenetic trees published in the literature or included in phylogenetic databases are actually alternative phylogenies for the same sets of organisms, obtained from different datasets or using different evolutionary models or different phylogenetic reconstruction algorithms [1]. This variety of phylogenetic trees makes it necessary to develop methods for measuring their differences [2, Chapter 30]. The comparison of phylogenetic trees is also used to compare phylogenetic trees obtained through numerical algorithms with other types of hierarchical classifications [3,4], to assess the stability of reconstruction methods [5], and in the comparative analysis of dendrograms and other hierarchical cluster structures [6,7]. Hence, and since the safest way to quantify the differences between a pair of trees is through a metric, “tree comparison metrics are an important tool in the study of evolution” [8].

Many metrics for the comparison of phylogenetic trees have been proposed so far [2, Chapter 30]. Some of these metrics are edit distances that count how many operations of a given type are necessary to transform one tree into the other. These metrics include the nearest-neighbor interchange metric [9] and the subtree prune-and-regrafting distance [10]. Other metrics compare a pair of phylogenetic trees through some consensus subtree. This is the case for instance of the MAST distances defined in [11-13]. Finally, many metrics for phylogenetic trees are based on the comparison of encodings of the phylogenetic trees, like for instance the Robinson-Foulds metric [14,15] (which can also be understood as an edit distance), the triples metric [16], the classical nodal metrics for binary phylogenetic trees [5,8,17-19], and the splitted nodal metrics for arbitrary phylogenetic trees [20]. The advantage of this last kind of metrics is that, unlike the edit and the consensus distances, they are usually computed in low polynomial time.

In an already fifty years old paper [4], Sokal and Rohlf proposed a technique to compare dendrograms (which, in their paper, were equivalent to weighted phylogenetic trees without nested taxa) on the same set of taxa, by encoding them by means of their half-matrices of cophenetic values, and then comparing these structures. Their method runs as follows. To begin with, they divide the range of depths of internal nodes in the tree into a suitable number of equal intervals and number increasingly these intervals. Then, for each pair of taxa i,j in the tree, they compute their cophenetic value as the class mark of the interval where the depth of their lowest common ancestor lies. Then, to compare two phylogenetic trees, they compare their corresponding half-matrices of cophenetic values. In that paper, they do it specifically by calculating a correlation coefficient between their entries. Sokal and Rohlf’s paper [4] is quite cited (612 cites according to Google Scholar on July 1, 2012) and their method has been often used to compare hierarchical classifications (see, for instance, [21-23]).

Since Sokal and Rohlf’s paper, other papers have compared the half-matrices of cophenetic values to define dissimilarity measures between phylogenetic trees (see, for instance, [3,24]), and such half-matrices have also been used in the so-called “comparative method”, the statistical methods used to make inferences on the evolution of a trait among species from the distribution of other traits: see [25,26] and [2, Chapter 25]. But, to our knowledge, no proper metric for phylogenetic trees based on cophenetic values has been formally defined and studied in the literature. In this paper we define a new family of metrics for weighted phylogenetic trees with nested taxa based on Sokal and Rohlf’s idea and we study some of their basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other metrics.

Our approach differs in some minor points with Sokal and Rohlf’s. For instance, we use as the cophenetic value φ(i,j) of a pair of taxa i,j the actual depth of the lowest common ancestor of i and j, instead of class marks, which was done by Sokal and Rohlf because of practical limitations. Moreover, instead of using a correlation coefficient, we define metrics by using Lp norms. Finally, we do not restrict ourselves to dendrograms, without internal labeled nodes, but we also allow nested taxa.

There is, however, a main difference between our approach and Sokal and Rohlf’s. We do not only consider the cophenetic values of pairs of taxa, but also the depths of the taxa. We must do so because we want to define a metric, where zero distance means isomorphism, and the cophenetic values of pairs of different taxa alone do not single out even the dendrograms considered by Sokal and Rohlf. That is, two non isomorphic weighted phylogenetic trees without nested taxa on the same set of taxa can have the same vectors of cophenetic values; see Figure 1.

thumbnailFigure 1. An unweighted phylogenetic tree on 7 taxa.

It turns out that the cophenetic vector consisting of all cophenetic values of pairs of taxa and the depths of all taxa characterizes a weighted phylogenetic tree with nested taxa. This fact comes from the well known relationship between cophenetic values and patristic distances. If we denote by δ(i) the depth of a taxon i, by φ(i,j) the cophenetic value of a pair of taxa i,j and by d(i,j) the distance between i and j, then [27]

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M2">View MathML</a>

So, if the depths of the taxa are known, the knowledge of the cophenetic values of pairs of taxa is equivalent to the knowledge of the additive distance defined by the tree. On their turn, the depths and the additive distance single out the unrooted semi-labelled weighted tree associated to the phylogenetic tree with the former root labeled with a specific label “root”, and hence the phylogenetic tree itself: cf. Theorem 1.

The fact that cophenetic vectors single out weighted phylogenetic trees with nested taxa can also be deduced from their relationship with splitted path lengths [20]. Recall that the splitted path length (i, j) is the distance from the lowest common ancestor of i and j to i. It is known [20, Thm. 10] that the matrix ((i, j))i,j characterizes a weighted phylogenetic tree with nested taxa. Since, obviously,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M3">View MathML</a>

the cophenetic vector uniquely determines the matrix of splitted path lengths, and hence the tree.a

The vector of cophenetic values of pairs of different taxa is also related to the notion of ultrametric [28,29]. Indeed, notice that -φ satisfies the three-point condition of ultrametrics: for every taxa i, j, k,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M4">View MathML</a>

But -φ is not an ultrametric, as φ(i, i) = δ(i) ≠ 0. Actually, φ can only be used to define an ultrametric precisely on ultrametric trees, where the depths of all leaves are the same, say Δ. In this case, Δ - φ is the ultrametric defined by the tree. In particular, ultrametric trees can be compared by comparing their vectors of cophenetic values of pairs of different taxa. A similar idea is used in [30] to induce an average genetic distance between populations from the average coancestry coefficient.

We would like to dedicate this paper to the memory of Robert R. Sokal, father of the field of numerical taxonomy and who passed away last April. His ideas permeate biostatistics and computational phylogenetics.

Notations

A rooted tree is a directed finite graph that contains a distinguished node, called the root, from which every node can be reached through exactly one path. A weighted rooted tree is a pair (T, ω) consisting of a rooted tree T = (V, E) and a weight function<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M5">View MathML</a> that associates to every arc e ∈ E a non-negative real number ω(e) > 0. We identify every unweighted (that is, where no weight function has been explicitly defined) rooted tree T with the weighted rooted tree (T, ω) with ω the weight 1 constant function.

Let T = (V, E) be a rooted tree. Whenever (u, v) ∈ E, we say that v is a child of u and that u is the parent of v. Two nodes with the same parent are siblings. The nodes without children are the leaves of the tree, and the other nodes (including the root) are called internal. A pendant arc is an arc ending in a leaf. The nodes with exactly one child are called elementary. A tree is binary, or fully resolved, when every internal node has exactly two children.

Whenever there exists a path from a node u to a node v, we shall say that v is a descendant of u and also that u is an ancestor of v, and we shall denote it by v ≼ u; if, moreover, u ≠ v, we shall write v ≺ u. The lowest common ancestor (LCA) of a pair of nodes u, v of a rooted tree T, in symbols [u, v]T, is the unique common ancestor of them that is a descendant of every other common ancestor of them. Given a node v of a rooted tree T, the subtree of T rooted at v is the subgraph of T induced on the set of descendants of v (including v itself). A rooted subtree is a cherry when it has 2 leaves, a triplet, when it has 3 leaves, and a quartet, when it has 4 leaves.

The distance from a node u to a descendant v of it in a weighted rooted tree T is the sum of the weights of the arcs in the unique path from u to v. In an unweighted rooted tree, this distance is simply the number of arcs in this path. The depth of a node v, in symbols δT(v), is the distance from the root to v.

Let S be a non-empty finite set of labels, or taxa. A (weighted) phylogenetic tree on S is a (weighted) rooted tree with some of its nodes bijectively labeled in the set S, including all its leaves and all its elementary nodes except possibly the root (which can be elementary but unlabeled). The reasons why we allow unlabeled elementary roots are that our results are still valid for phylogenetic trees containing them, and that even if we forbid them, we would need in some proofs to use that Theorem 1 below is true for phylogenetic trees containing them. Moreover, it is not uncommon to add an unlabeled elementary root to a phylogenetic tree in some contexts: see, for instance, the phylogenetic trees depicted in Wikipedia’s entry “Phylogenetic tree”.b

In a phylogenetic tree, we shall always identify a labeled node with its taxon. The internal labeled nodes of a phylogenetic tree are called nested taxa. Notice in particular that a phylogenetic tree without nested taxa cannot have elementary nodes other than the root. Although in practice S may be any set of taxa, to fix ideas we shall usually take S = {1, …, n}, with n the number of labeled nodes of the tree, and we shall use the term phylogenetic tree with n taxa to refer to a phylogenetic tree on this set.

Given a set S of taxa, we shall consider the following spaces of phylogenetic trees:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M6">View MathML</a> , of all weighted phylogenetic trees on S

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M7">View MathML</a> , of all unweighted phylogenetic trees on S

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M8">View MathML</a> , of all unweighted phylogenetic trees on S without nested taxa

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M9">View MathML</a> , of all binary unweighted phylogenetic trees on S without nested taxa

When S = {1, …, n}, we shall simply write <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M10">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M11">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M12">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M13">View MathML</a>, respectively.

Two phylogenetic trees T and T on the same set S of taxa are isomorphic when they are isomorphic as directed graphs and the isomorphism sends each labeled node of T to the labeled node with the same label in T. An isomorphism of weighted phylogenetic trees is also required to preserve arc weights. We shall make the abuse of notation of saying that two isomorphic trees are actually the same, and hence of denoting that two trees T, T are isomorphic by simply writing T = T.

Methods

Cophenetic vectors

Let S be henceforth a non-empty set of taxa with |S| = n, which without any loss of generality we identify with {1, …, n}. Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M14">View MathML</a> be a weighted phylogenetic tree on S. For every pair of different taxa i, j in T, their cophenetic value is the depth of their LCA:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M15">View MathML</a>

To simplify the notations, we shall often write φT(i, i) to denote the depth δT(i) of a taxon i.

The cophenetic vector of T is

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M16">View MathML</a>

with its elements lexicographically ordered in (i, j).

Example 1

If T is the unweighted phylogenetic tree in Figure 2, then φ(T) is the vector obtained by lexicographically ordering in (i, j) the elements of Table 1.

thumbnailFigure 2. Three non-isomorphic trees with the same vector<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M18">View MathML</a>.

Table 1. Cophenetic values of the pairs of taxa in the phylogenetic tree T in Figure2

The cophenetic vectors single out weighted phylogenetic trees with nested taxa.

Theorem 1

For every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M19">View MathML</a>, if φ(T) = φ(T), then T = T.

Proof

Let r be a symbol not belonging to S and let X = S ∪ {r}. Recall that a weighted X-tree is an undirected weighted tree T with set of nodes V endowed with a (non necessarily injective) node-labeling mapping f : X → V such that f(X) contains all the leaves and all the degree-2 nodes in T[31].

For every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M20">View MathML</a>, let T be the weighted X-tree obtained by considering T as undirected and adding to its former root the label r. Then, the distance <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M21">View MathML</a> on T between pairs of labels in X is uniquely determined by φ(T) in the following way:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M22">View MathML</a>

Now, T is singled out by <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M23">View MathML</a>[31, Thm. 7.1.8]. Since T is uniquely determined from T and the knowledge of the root (that is the node labeled with r), we deduce that φ(T) singles out T. □

This result implies that the vectors of cophenetic values of pairs of different taxa single out unweighted phylogenetic trees without nested taxa.

Corollary 1

For every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M24">View MathML</a>, let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M25">View MathML</a><a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M26">View MathML</a>, with its elements lexicographically ordered in (i, j). Then, for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M27">View MathML</a>, if <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M28">View MathML</a>, then T=T.

Proof

If T is unweighted and without nested taxa, then, for every taxon i,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M29">View MathML</a>

and therefore, in this case, φ(T) is uniquely determined by <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M30">View MathML</a>. □

But in order to single out phylogenetic trees with non constant weights in the arcs or with nested taxa, it is necessary to take into account also the depths of the leaves. Actually, for example, there is no way to reconstruct from <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M31">View MathML</a> the weights of the pendant arcs: the depths of the leaves are needed. Or, without being able to compare depths with cophenetic values, there is no way to say whether a taxon is nested or not. More specifically, for instance, the three trees in Figure 1 have the same value of φ(1, 2), and hence the same vector <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M32">View MathML</a>, but they are not isomorphic as weighted phylogenetic trees.

The cophenetic vector φ(T) of a weighted phylogenetic tree <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M33">View MathML</a> can be computed in optimal O(n2) time (assuming a constant cost for the addition of real numbers) by computing for each internal node v, its depth δT(v) through a preorder traversal of T, and the pairs of taxa of which v is the LCA through a postorder traversal of the tree. Both preorder and postorder traversals are performed in linear time on the usual tree data structures.

Cophenetic metrics

As we have seen in Theorem 1, the mapping

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M34">View MathML</a>

that sends each <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M35">View MathML</a> to its cophenetic vector φ(T), is injective up to isomorphism. As it is well known, this allows to induce metrics on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M36">View MathML</a> from metrics defined on powers of <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M37">View MathML</a>. In particular, every Lp norm ∥ · ∥p on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M38">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M39">View MathML</a>, induces a cophenetic metricdφ, p on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M40">View MathML</a> by means of

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M41">View MathML</a>

Recall that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M42">View MathML</a>

and so, for instance,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M43">View MathML</a>

are the cophenetic metrics on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M44">View MathML</a> induced by the Manhattan L1 and the euclidean L2 norms. One can also use Donoho’s L0 “norm” (which, actually, is not a proper norm)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M45">View MathML</a>

to induce a metric dφ,0(T1,T2) on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M46">View MathML</a>, which turns out to be simply the Hamming distance between φ(T1) and φ(T2).

As we have seen in the previous subsection, the cophenetic vector of a phylogenetic tree in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M47">View MathML</a> can be computed in O(n2) time. For every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M48">View MathML</a>, and assuming a constant cost for the addition and product of real numbers, the cost of computing dφ,0(T1, T2) (as the number of non-zero entries of φ(T1)-φ(T2)) is O(n2), and the cost of computing dφ,p(T1, T2)p, for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M49">View MathML</a> (as the sum of the p-th powers of the entries of the difference φ(T1) - φ(T2)) is O(n2 + log2(p)n2), which is again O(n2) if we understand log(p) as part of the constant factor. Finally, the cost of computing dφ,p(T1, T2), <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M50">View MathML</a>, as the p-th root of dφ,p(T1, T2)p will depend on p and on the accuracy with which this root is computed. Assuming a constant cost for the computation of p-th roots with a given accuracy (notice that, in practice, for low p and accuracy, this step will be dominated by the computation of dφ,p(T1, T2)p), the total cost of computing dφ,p(T1, T2) is O(n2).

Next examples show some features of these cophenetic metrics.

Example 2

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M51">View MathML</a>, let (u, v) be an arc of T with u or v unlabeled, and let T be the phylogenetic tree in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M52">View MathML</a> obtained by contracting (u, v): that is, by removing the node v and the arc (u, v), labeling u with the label of v if it was labeled, and replacing every arc (v, x) in T by an arc (u, x). Notice that, in the passage from T to T, for every i, j ∈ S:

• If both i,j are descendants of v in T, then <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M53">View MathML</a>.

• In any other case, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M54">View MathML</a>.

As a consequence,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M55">View MathML</a>

and therefore, if nv is the number of descendant taxa of v,

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M56">View MathML</a>

So the contraction of an arc in an tree T (which is Robinson-Foulds’ α-operation [15]) yields a new tree T at a cophenetic distance from T that depends increasingly on the number of descendant taxa of the head of the contracted arc.

Example 3

Let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M57">View MathML</a>, for some m < n, let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M58">View MathML</a> be such that its subtree rooted at some node z is T0, and let <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M59">View MathML</a> be the tree obtained by replacing in T this subtree T0 by <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M60">View MathML</a>.

Notice that, for every i, j ∈ {1, …, n}, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M61">View MathML</a> if i, j ⩽ m, and φT(i, j) = φT(z, j) if i ⩽ m and j > m, and the same holds in T, replacing T and T0 by T and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M62">View MathML</a>, respectively. Since, moreover, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M63">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M64">View MathML</a> for every j > m, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M65">View MathML</a> for every i,j > m, we conclude that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M66">View MathML</a>

and hence

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M67">View MathML</a>

So, the cophenetic metrics are local, as other popular metrics like the Robinson Foulds or the triples metrics, but unlike other popular metrics, like for instance the nodal metrics.

Results and discussion

Minimum and maximum values for cophenetic metrics

Our first goal is to find the smallest non-negative value of dφ,p on several spaces of phylogenetic trees, and the pairs of trees at which it is reached. These pairs of trees at minimum distance can be understood as ‘adjacent’ in the corresponding metric space, and their characterization yields a first step towards understanding how cophenetic metrics measure the difference between two trees.

Notice that this problem makes no sense for weighted phylogenetic trees. For instance, if we add or subtract an ϵ > 0 to the weight of a pendant arc in a tree T, without changing its topology, the distance between T and the resulting tree will be ϵ, which can be as small as desired. So, we only consider this problem on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M68">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M69">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M70">View MathML</a>.

In order to simplify the statements, set

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M71">View MathML</a>

The following easy result, which is a direct consequence of the fact that <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M72">View MathML</a> for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M73">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M74">View MathML</a>, will be used in the proof of the next propositions.

Lemma 1

Assume that, for every pair of different trees T1, T2 in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M75">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M76">View MathML</a> or <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M77">View MathML</a> such that D0(T1, T2) is minimum on this space, we have that Dp(T1, T2) = D0(T1,T2). Then, the minimum non-zero value of Dp on this space of trees is equal to the minimum non-zero value of D0 on it, and it is reached at exactly the same pairs of trees.

The least non-negative values of Dp, for p ∈ {0}∪[1, [, on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M78">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M79">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M80">View MathML</a>, together with an explicit description of the pairs of trees where these minimum values are reached, are given by the next three propositions. We give their proofs in the Additional file 1.

Additional file 1. Proofs of propositions 1–4.

Format: PDF Size: 395KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Proposition 1

The minimum non-negative value of Dp on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M81">View MathML</a>, for p ∈ {0}∪[1, [ and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M82">View MathML</a>, is 1. And for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M83">View MathML</a>, Dp(T, T) = 1 if, and only if, one of them is obtained from the other by contracting a pendant arc.

So, not every tree in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M84">View MathML</a> has neighbors at cophenetic distance 1: only those trees with some leaf whose parent is unlabeled. Now, it is not difficult to check that a tree <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M85">View MathML</a> such that all its leaves have labeled parents has some tree T such that Dp(T, T) = 2, which is the minimum value of Dp on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M86">View MathML</a> greater than 1. One such T is obtained by choosing a pendant arc in T and interchanging the labels of its source and its target nodes.

Proposition 2

The minimum non-negative value of Dp on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M87">View MathML</a>, for p ∈ {0}∪[1, [ and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M88">View MathML</a>, is 3. And for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M89">View MathML</a>, Dp(T, T) = 3 if, and only if, one of them is obtained from the other by means of one of the following two operations:

(a) Contracting an arc ending in the parent of a cherry (see Figure 3)

(b) Pruning and regrafting a leaf that is a sibling of the root of a cherry, to make it a sibling of the leaves in the cherry (see Figure 4)

thumbnailFigure 3. Contraction of an arc ending in the parent of a cherry.

thumbnailFigure 4. Pruning and regrafting an uncle of a cherry to make it a sibling of them.

So, every tree <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M90">View MathML</a> has neighbors T such that Dp(T, T) = 3. Indeed, take an internal node v in T of largest depth, so that all its children are leaves. If v has exactly two children, one such neighbor of T is obtained by contracting the arc ending in v. If v has more than two children, one such neighbor of T is obtained by replacing any two children of v by a cherry (that is, taking two children i, j of v, removing the arcs (v, i) and (v, j), and then adding a new node v0 and arcs (v, v0), (v0, i), and (v0, j)).

Proposition 3

The minimum non-negative value of Dp on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M91">View MathML</a>, for p ∈ {0}∪ [1, [ and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M92">View MathML</a>, is 4. And for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M93">View MathML</a>, Dp(T, T) = 4 if, and only if, one of them is obtained from the other by means of one of the following operations:

(a) Reorganizing a triplet (see Figure 5)

(b) Reorganizing a completely branched quartet (see Figure 6)

thumbnailFigure 5. Reorganizing a triplet.

thumbnailFigure 6. Reorganizing a completely branched quartet.

So again, every tree <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M94">View MathML</a> has neighbors T such that Dp(T, T) = 4. Indeed, take an internal node v in T of largest depth, so that its two children are leaves. Let w be the parent of v. Then, either the other child of w is a leaf, in which case w is the root of a triple and reorganizing its taxa we obtain a neighbor of T, or the other child of w is the parent of a cherry (it will have the same, maximum, depth as v), in which case w is the root of a completely branched quartet and reorganizing its taxa we obtain a neighbor of T.

We focus now on the diameter, that is, the largest value of dφ,p on the spaces of unweighted phylogenetic trees (as in the case of the minimum non-zero value, and for the same reasons, the problem of finding the diameter makes no sense for weighted trees). Unfortunately, we have not been able to find exact formulas for it, but we have obtained its order, which we give in the next proposition. We also give its proof in the Additional file 1.

Proposition 4

The diameter of dφ,p on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M95">View MathML</a>, <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M96">View MathML</a>, and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M97">View MathML</a> is in Λ(n2) if p = 0 and in Λ(n(p + 2) / p) if <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M98">View MathML</a>.

In particular, the diameter of dφ,1 on these spaces is in Λ(n3), and the diameter of dφ,2 is in Λ(n2).

Numerical experiments

We have performed several numerical experiments concerning the distributions of dφ,1 and dφ,2, and the correlation of these metrics with other phylogenetic tree comparison metrics. The results of all these experiments can be found in the web page http://bioinfo.uib.es/∼recerca/phylotrees/cophidist/ webcite. In this section we report only on some significant results obtained through these experiments.

As a first experiment, we have generated all trees in <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M99">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M100">View MathML</a>, for n = 3, 4, 5, 6, and for all pairs of them we have computed:

• The cophenetic distances dφ,1 and dφ,2 on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M101">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M102">View MathML</a>.

• The Robinson-Foulds distance dRF on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M103">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M104">View MathML</a>[15].

• The classical nodal distances dnodal,1 and dnodal,2 on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M105">View MathML</a>, which compare the vectors of distances between pairs of taxa by means of the Manhattan and the Euclidean norms, respectively; see [5] and [18], respectively, as well as [20].

• The splitted nodal distances <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M106">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M107">View MathML</a> on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M108">View MathML</a>, which compare the matrices of splitted path lengths between pairs of taxa by means of the Manhattan and the Euclidean norms, respectively; see [20].

In order to analyze this data, we have plotted 2D-histograms for all pairs of metrics and we have computed their Spearman’s rank correlation coefficient. On the one hand, the 2D-histograms for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M109">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M110">View MathML</a> (the most significative case) are given in Figures 7 and 8, respectively. For each pair of distances, we have divided the range of values that each of the distances gets into 25 subranges, and computed how many pairs of trees fall into each of the 25 × 25 different possibilities. Each of these possibilities is represented by a rectangle in a grid, whose darkness level is proportional of the number of trees. On the other hand, the Spearman’s rank correlation coefficient between the aforementioned distances in the most significative case of n = 6 are given in Tables 2 and 3.

thumbnailFigure 7. 2D-histograms showing the relationship between different distances on<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M112','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M112">View MathML</a>.

thumbnailFigure 8. 2D-histograms showing the relationship between different distances on<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M114','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M114">View MathML</a>.

Table 2. Spearman’s rank correlation coefficient between different distances on

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M115">View MathML</a>

Table 3. Spearman’s rank correlation coefficient between different distances on

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M117','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M117">View MathML</a>

These histograms and tables show that dφ,1 and dφ,2 are highly correlated, and that each dφ,i, i = 1, 2, is highly correlated with the corresponding <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M123">View MathML</a> on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M124','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M124">View MathML</a>. This is not a surprise, because both types of metrics are based on encodings of phylogenetic trees related to the position in the tree of the LCA of every pair of leaves: remember the relationship between depths, cophenetic values and splitted path lengths recalled in the Background section. More surprising to us is the low correlation between each dφ,i, and the corresponding dnodal,i on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M125','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M125">View MathML</a>, because of the relationship between depths, cophenetic values and patristic distances also recalled in the Background section. The very low correlation between the cophenetic metrics and the Robinson-Foulds metric simply shows that these metrics measure different notions of similarity.

Our second experiment is for values of n greater than 6. The numbers of trees in each of the spaces <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M126">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M127','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M127">View MathML</a> make it unfeasible to compute the distances between all pairs of trees. Hence, we have randomly and uniformly generated pairs of trees in each of these spaces for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M128','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M128">View MathML</a> until the approximated value of the Spearman’s rank correlations of all pairs of distances converge up to 3 significant digits. The corresponding 2D-histograms and Spearman’s rank correlation coefficient tables for the most significative case of n = 100 are shown in Figures 9 and 10 and Tables 4 and 5. These diagrams and tables confirm the very high correlation between dφ,1 and dφ,2, and very low correlation of these metrics and the nodal and Robinson-Foulds metrics. The correlation between each dφ,i, i = 1, 2, and the corresponding <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M129','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M129">View MathML</a> is still significant, but it decreases as n increases.

thumbnailFigure 9. 2D-histograms showing the relationship between different distances on<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M131">View MathML</a>.

thumbnailFigure 10. 2D-histograms showing the relationship between different distances on<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M133','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M133">View MathML</a>.

Table 4. Spearman’s rank correlation coefficient between different distances on

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M134">View MathML</a>

Table 5. Spearman’s rank correlation coefficient between different distances on

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M136','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M136">View MathML</a>

Finally, in Figure 11 we have plotted the histograms of the distributions of dφ,1 and dφ,2 on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M142','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M142">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M143','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M143">View MathML</a> for <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M144','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M144">View MathML</a>. As it can be seen, they are positive skewed, like the splitted nodal metrics [20], Figure 5], but unlike other metrics like the Robinson-Foulds [32] or the transposition distance [33], Figure 2], which are negative skewed, or the triples metric [16], which is approximately normal.

thumbnailFigure 11. Histograms of the distributions of dφ,1 and dφ,2 on<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M148','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M148">View MathML</a>and<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M149','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M149">View MathML</a>for<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M150','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M150">View MathML</a>.

Conclusions

Following a fifty years old idea of Sokal and Rohlf [4], we have encoded a weighted phylogenetic tree with nested taxa by means of its vector of cophenetic values of pairs of taxa, adding moreover to this vector the depths of single taxa. These positive real-valued vectors single out weighted phylogenetic trees with nested taxa, and therefore they can be used to define metrics to compare such trees. We have defined a family of metrics dφ,p, for p ∈ {0}∪[1, [, by comparing these vectors through the Lp norm.

We cannot advocate the use of any cophenetic metric dφ,p over the other ones except, perhaps, warning against the use of the Hamming distance dφ,0 because it is too uninformative. Since the most popular norms on <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M151','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M151">View MathML</a> are the Manhattan L1 and the Euclidean L2, it seems natural to use dφ,1 or dφ,2. And since these two metrics are very highly correlated, the comparison of trees using one or the other will not differ greatly. Each one of these metrics has its own advantages.

On the one hand, the computation of dφ,1 does not involve roots, and therefore it can be computed exactly. Moreover, it takes integer values on unweighted trees and in this case its range of values is greater, thus being more discriminative. Actually, since ∥xp ⩽ ∥x1 for every <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M152','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M152">View MathML</a> and <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M153','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M153">View MathML</a>, we have that

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/3/mathml/M154','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/3/mathml/M154">View MathML</a>

On the other hand, the comparison of cophenetic vectors by means of the Euclidean norm enables the use of many geometric and clustering methods that are not available otherwise. In particular, it is possible to compute the mean value of the square of dφ,2 under different evolutionary models. We shall report on this elsewhere.

As a rule of thumb, and as we already advised in the context of splitted nodal metrics [20], we suggest using dφ,1 when the trees are unweighted, because these trees can be seen as discrete objects and thus their comparison through a discrete tool as the Manhattan norm seems appropriate. When the trees have arbitrary positive real weights, they should be understood as belonging to a continuous space [34], and then the Euclidean norm is more appropriate.

Future work will include a deeper study of the distribution of dφ,1 and dφ,2 on different spaces of unweighted phylogenetic trees.

Endnotes

aThere are some details to be filled here, because for technical reasons we shall allow the root of our phylogenetic trees to have out-degree 1 without being labeled, and this case is not covered by [20, Thm. 10], but it is not difficult to modify the argument given above to cover also this case.bhttp://en.wikipedia.org/wiki/Phylogenetic_tree webcite

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AM and FR developed the theoretical part of the paper. GC, LR and DS implemented the algorithms and performed the numerical experiments. GC and DS prepared the Additional file 1 web page. FR prepared the first version of the manuscript. All authors revised, discussed, and amended the manuscript and approved its final version. All authors read and approved the final manuscript.

Acknowledgements

The research reported in this paper has been partially supported by the Spanish government and the UE FEDER program, through project MTM2009-07165. We thank the comments and suggestions of the reviewers, which have led to a substantial improvement of this paper.

References

  1. Hoef-Emden K: Molecular phylogenetic analyses and real-life data.

    Comput Sci Eng 2005, 7:86-91. Publisher Full Text OpenURL

  2. Felsenstein J: Inferring Phylogenies. USA: Sinauer Associates Inc.; 2004. OpenURL

  3. Rohlf F, Sokal R: Comparing numerical taxonomic studies.

    Syst Zool 1981, 30:459-490. Publisher Full Text OpenURL

  4. Sokal R, Rohlf F: The Comparison of Dendrograms by Objective Methods.

    Taxon 1962, 11:33-40. Publisher Full Text OpenURL

  5. Williams WT, Clifford HT: On the comparison of two classifications of the same set of elements.

    Taxon 1971, 20:519-522. Publisher Full Text OpenURL

  6. Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis.

    Bioinformatics 2005, 21:3201-3212. PubMed Abstract | Publisher Full Text OpenURL

  7. Restrepo G, Mesa H, Llanos E: Three Dissimilarity Measures to Contrast Dendrograms.

    J Chem Inf Model 2007, 47:761-770. PubMed Abstract | Publisher Full Text OpenURL

  8. Steel MA, Penny D: Distributions of tree comparison metrics—some new results.

    Syst Biol 1993, 42:126-141. OpenURL

  9. Waterman MS, Smith TF: On the similarity of dendrograms.

    J Theor Biol 1978, 73:789-800. PubMed Abstract | Publisher Full Text OpenURL

  10. Allen BL, Steel MA: Subtree transfer operations and their induced metrics on evolutionary trees.

    Ann Combinatorics 2001, 5:1-13. Publisher Full Text OpenURL

  11. Finden C, Gordon A: Obtaining common pruned trees.

    J Classification 1985, 2:255-276. Publisher Full Text OpenURL

  12. Goddard W, Kubicka E, Kubicki G, McMorris F: The agreement metric for labeled binary trees.

    Math Biosci 1994, 123:215-226. PubMed Abstract | Publisher Full Text OpenURL

  13. Zhong Y, Meacham C, Pramanik S: A general method for tree-comparison based on subtree similarity and its use in a taxonomic database.

    Biosystems 1997, 42:1-8. PubMed Abstract | Publisher Full Text OpenURL

  14. Robinson DF, Foulds LR: Comparisonof weighted labelled trees. In Proc 6th Australian Conf Combinatorial Mathematics, Lecture Notes in Mathematics. Berlin Heidelberg: Springer; 1979:119-126. OpenURL

  15. Robinson DF, Foulds LR: Comparison of phylogenetic trees.

    Math Bioscie 1981, 53:131-147. Publisher Full Text OpenURL

  16. Critchlow DE, Pearl DK, Qian C: The triples distance for rooted bifurcating phylogenetic trees.

    Syst Biol 1996, 45:323-334. Publisher Full Text OpenURL

  17. Farris JS: A successive approximations approach to character weighting.

    Syst Zool 1969, 18:374-385. Publisher Full Text OpenURL

  18. Farris JS: On comparing the shapes of taxonomic trees.

    Syst Zool 1973, 22:50-54. Publisher Full Text OpenURL

  19. Phipps JB: Dendrogram topology.

    Syst Zool 1971, 20:306-308. Publisher Full Text OpenURL

  20. Cardona G, Llabrés M, Rosselló F, Valiente G: Nodal distances for rooted phylogenetic trees.

    J Math Biol 2010, 61:253-276. PubMed Abstract | Publisher Full Text OpenURL

  21. Basford N, Butler J, Leone C, Rohlf F: Immunologic Comparisons of Selected Coleoptera With Analyses of Relationships Using Numerical Taxonomic Methods.

    Syst Biol 1968, 17:388-406. Publisher Full Text OpenURL

  22. Chui V, Thornton I: A Numerical Taxonomic Study of the Endemic Ptycta Species of the Hawaiian Islands (Psocoptera: Psocidae).

    Syst Biol 1972, 21:7-22. Publisher Full Text OpenURL

  23. Leelambikaa M, Sathyanarayanaa N: Genetic characterization of Indian Mucuna (Leguminoceae) species using morphometric and random amplification of polymorphic DNA (RAPD) approaches.

    Plant Biosystems 2011, 145:786-797. Publisher Full Text OpenURL

  24. Hartigan J: Representation of similarity matrices by trees.

    J Am Stat Assoc 1967, 62:1140-1158. Publisher Full Text OpenURL

  25. Harvey PH, Pagel M: The comparative method in evolutionary biology. USA: Oxford university press; 1991. OpenURL

  26. Pagel MD: Inferring the Historical Patterns of Biological Evolution.

    Nature 1999, 401:877-884. PubMed Abstract | Publisher Full Text OpenURL

  27. Farris JS, Kluge AG, Eckardt MJ: A numerical approach to phylogenetic systematics.

    Syst Zool 1970, 19:172-189. Publisher Full Text OpenURL

  28. Johnson SC: Hierarchical clustering schemes.

    Psychometrika 1967, 32:241-254. PubMed Abstract | Publisher Full Text OpenURL

  29. Sneath P, Sokal R: Numerical Taxonomy. USA: Freeman and Co; 1973. OpenURL

  30. Xu S, Atchley WR, Fitch WM: Phylogenetic Iinference under the pure drift model.

    Mol Biol Evol 1994, 11:949-960. PubMed Abstract | Publisher Full Text OpenURL

  31. Semple C, Steel M: Phylogenetics. USA: Oxford University Press; 2003. OpenURL

  32. Steel M: Distribution of the symmetric difference metric on phylogenetic trees.

    SIAM J Discrete Mathematics 1988, 1:541-551. Publisher Full Text OpenURL

  33. Alberich R, Cardona G, Rosselló F, Valiente G: An algebraic metric for phylogenetic trees.

    Appl Mathematics Lett 2009, 22:1320-1324. Publisher Full Text OpenURL

  34. Billera LJ, Holmes SP, Vogtmann K: Geometry of the space of phylogenetic trees.

    Adv Appl Mathematics 2001, 27:733-767. Publisher Full Text OpenURL