Bioinformatics Research Center, Aarhus University, Denmark

Department of Computer Science, Aarhus University, Denmark

MADALGO, Center for Massive Data Algorithms, a Center of the Danish National Research Foundation, Denmark

Department of Mathematics and Computer Science, University of Southern Denmark, Denmark

PUMPKIN, Center for Membrane Pumps in Cells and Disease, a Center of the Danish National Research Foundation, Denmark

Abstract

The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or different. We present an algorithm that computes the triplet distance between two rooted binary trees in time ^{2 }^{2}) time algorithms, we show through experiments that the triplet distance algorithm can be implemented to give a competitive wall-time running time.

Background

Using trees to represent relationships is widespread in many scientific fields, in particular in biology where trees are used e.g. to represent species relationships, so called phylogenies, the relationship between genes in gene families or for hierarchical clustering of high-throughput experimental data. Common for these applications is that differences in the data used for constructing the trees, or differences in the computational approach for constructing the trees, can lead to slightly different trees on the same set of leaf IDs.

To compare such trees, distance measures are often used. Common distance measures include the Robinson-Foulds distance

Efficient algorithms to compute these three distance measures exist. The Robinson-Foulds distance can be computed in time ^{9 }^{2}) time algorithms exist for both binary and general trees

Brodal ^{2 }^{2}) time algorithm ^{2 }^{2 }^{2}) time algorithm based on

Methods

The triplet distance measure between two rooted trees with the same set of leaf IDs is based on the topologies induced by a tree when selecting three leafs of the tree. Whenever three leaves,

Triplet topologies

**Triplet topologies**. The four different triplet toplogies.

The triplet distance is the number of triplets whose topology differ in the two trees. It can naïvely be computed by enumerating all ^{3}) sets of three leafs and comparing the induced topologies in the two trees, counting how often the trees agree or disagree on the topology. Triplet topologies in a tree, however, are not independent, and faster algorithms can be constructed exploiting this, comparing sets of triplet topologies faster. Critchlow ^{2}) time algorithm for binary trees while Bansal ^{2}) time algorithm for general trees.

For the quartet distance, the analogue to the triplet distance for unrooted trees, Brodal

A naïve algorithm that computes the quartet distance between two unrooted trees by explicitly inspecting each of the ^{4}) quartets can be modified to compute the triplet distance between two rooted trees without loss of time by adding a new leaf

In the following, we develop an efficient algorithm for computing the triplet distance between two rooted binary trees _{1 }and _{2 }with the same set of leaf IDs. Our key contribution is to show how all triplets in one tree, say _{1}, can be captured by coloring the leaves with colors, and how the smaller half trick lets us enumerate all such colorings in time _{2 }that counts its number of compatible triplets. Unlike the algorithms for computing the quartet distance

Counting shared triplets through leaf colorings

A triplet is a set {_{v }_{v }| v _{v }∩ τ_{u }| v T_{1}, _{2}} is also a partition of

where Shared(_{1 }and _{2}. The triplet distance of _{1 }and _{2 }is then

In the algorithm, we capture the triplets _{v }_{v }

Coloring when visiting a node

**Coloring when visiting a node**. Coloring of a sub-tree rooted in node

For such a coloring according to a node _{1}, and for a node _{2}, the number Shared(_{v }∩ τ_{u }_{v }∩ τ_{u }

Explicitly going through _{1 }and coloring for each node ^{2}). We reduce this to _{2 }for each coloring and counting the number of compatible triplets would also take time ^{2}). Using a HDT we find this count in ^{2 }

Smaller half trick

We go through nodes

For

Coloring algorithm

**Coloring algorithm**. The five steps of the coloring in the smaller-half trick.

1. Color

2. Remove the color for

3. Returning from the recursive call, the entire tree is colorless by invariant 2.

4. Color

5. Call recursively on

Using this recursive algorithm, we go through all colorings of the tree. In each instance (not counting recursive calls), we only color leaves in

Hierarchical decomposition tree

We build a data structure, the _{2 }in order to count the triplets in _{2 }compatible with the coloring of leaves in the first tree _{1}. The HDT is a balanced binary tree where each node corresponds to a connected part of _{2}. Each node in the HDT, or _{2 }contains, plus some additional book-keeping that makes it possible to compute this count in each component in constant time using the information stored in the component's children in the HDT.

The HDT contains three different kinds of components:

• **L**: A leaf in _{2},

• **I**: An inner node in _{2},

• **C**: A connected sub-part of _{2},

where for type **C **we require that at most two edges in _{2 }crosses the boundary of the component; at most one going up towards the root and at most one going down to a subtree.

The leaves and inner nodes of _{2 }are transformed into **L **and **I **components, respectively, and constitute the leaves of the HDT. **C **components are then formed by pairwise joining other components along an edge in _{2 }by one of two compositions, see Figure **C **components can be thought of as consisting of a path from a sub-tree below the **C **component going up towards the root of _{2}, such that all trees branching o to other children along the path are all contained in the component. In the following we show how the HDT of _{2 }can be constructed in time

Component types in the HDT

**Component types in the HDT**. The three different types of components. **L **and **I **components contain a single node from the underlying tree while **C **components contain a connected set of nodes.

The construction algorithm operates on **L, I**, or **C**. It has a _{2}.

In a single traversal of the tree, the algorithm initially builds a component for each node in the tree (an **L **component for each leaf and an **I **component for each inner node) and an edge for each edge in the tree. The **L **components and false for **I **components. The edges are put in a list **C **component via the constructions in Figure

Component compostions in the construction of the HDT

**Component compostions in the construction of the HDT**. The two different ways of constructing a **C **component by merging two underlying components. The topmost of the components can either be a **C **component (a) or an **I **component (b) while the bottommost component, _{1}, must be a **C **or **L **component. If the topmost component is an **I **component, the bottommost must be downwards closed, i.e. it cannot have a downwards edge crossing its boundary.

Algorithm for constructing the hierarchical decomposition tree

**Algorithm for constructing the hierarchical decomposition tree**. Algorithm for constructing the hierarchical decomposition tree. The listing shows the algorithm run for each level of the HDT construction. This algorithm is repeated until the

In case 1, one of **C **component in this iteration and should not be contracted again. Case 2 is the situation in Figure **I **component or **I **component and **C **component resulting from joining the ends of the last edge. We now argue that height of the HDT is

We first argue that the number of contractible edges at the beginning of the iteration is at least **I **components might not be contractible, and that the number of down-closed components is at least one larger than the number of **I **components. If the number of **I **components is at most **I **components is more than

Since each contracted edge can prevent at most two other edges incident to the two merged components (see Figure

Counting triplets in the hierarchical decomposition tree

In each component we keep track of

By adding a little book-keeping information to each component we make it possible to compute ^{2 }

•

•

• **C **component. Let _{i}, i _{i}

Counting triplets in a C component

**Counting triplets in a C component**. The two cases of counting triplets in a **C **component.

• **C **component.

• **C **component. Let _{i}, i _{i }_{i}

• **C **component.

We describe how the book-keeping variables and **L **and **I **components are constructed in only one way, while **C **components are constructed in one of two ways (see Figure

**L components: **For a leaf component,

**I components: **All counts are 0.

**C components, case Figure ****C **component, and let _{1 }and _{2}, respectively, with _{2 }above _{1 }in the underlying tree. Then

The triplet count is then computed as

**C components, case Figure **_{1}. Then

Results and discussion

We implemented the algorithm in C++ and a simple ^{2}) time algorithm to ensure that it computes the correct triplet distance.

We then verified the running time of our algorithm, see Figure ^{2 }^{2 }

Validation of running time

**Validation of running time**. (a) Total running time divided by ^{2 }^{2 }

When changing the color of leaves, we spent time

To render the use of the algorithm in practice, we implemented an Efficient ^{2}) time algorithm based on the quartet distance algorithm presented in ^{2}) time algorithm against the ^{2 }

Comparison to ^{2}) algorithm

**Comparison to O(n**. Ratio between the running times for the

Conclusions

We have presented an ^{2 }

The algorithm builds upon the ideas in the ^{2 }

Compressing the HDT during the algorithm makes it possible to reduce the running time of the quartet distance algorithm to

Unlike the ^{2}) time algorithm of Bansal

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed to the design of the presented algorithm. AS implemented the data structures and the algorithm. AS, CNSP, and TM designed the experiments, and AS conducted these. All authors have contributed to, seen and approved the manuscript.

Declarations

The publication costs for this article were funded by PUMPKIN, Center for Membrane Pumps in Cells and Disease, a Center of the Danish National Research Foundation, Denmark.

This article has been published as part of