Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00665 Warsaw, Poland
Abstract
Background
This paper is devoted to distance measures for leaflabelled trees on free leafset. A leaflabelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures.
Results
Here we introduce the tree edit distance designed for leaf labelled trees on free leafset, which occurs to be a metric. It is presented together with tree edit consensus tree notion. We provide statistical evaluation of provided measure with respect to RF, MAST and frequent subsplit based dissimilarity measures as the reference measures.
Conclusions
The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and nonsimilar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type.
Background
This paper is devoted to distance measures for leaflabelled trees on free leafset. A leaflabelled tree is a data structure which is a special type of a tree where only leaves (terminal) nodes are labelled. This data structure is used in bioinformatics for modelling of evolution history of genes and species and also in linguistics for modelling of languages evolution history. Many domain specific problems occur and need to be solved with help of tree postprocessing techniques such as distance measures, consensus trees, clustering. Distance measures play the most important role as they are very often the start point for more complicated techniques. One of such problem is a problem of competing evolutionary hypothesis. In the process of phylogenetic tree reconstruction, different candidate trees may be obtained, the researches have to determine the true tree of life.
Many existing techniques are designed for trees built of the same leafset which is very limiting. Here we focus on techniques that do not require trees to contain the same set of leaves. Previously we introduced the simple zrestriction approach
Methods
Basic Notions
Here we provide the basic notions and the description of some basic operation on leaf labelled trees which were chosen as the basic operation for new tree edit distance measure. Some derived notions are also presented here.
Leaflabelled tree is a tree with labels assigned to its leaves. Unrooted leaflabelled trees are very often represented as a set of splits
Definition [Split] The Split (or Bipartition)
In this paper, we will refer to the leafset of a given split
Definition [Contraction]
The Contraction of a tree T is obtained by removing a chosen internal edge from tree T and identifying adjacent nodes of the contracted edge.
Because split corresponds to edge (provided that no internal edges of degree two occur), so a contraction may be realised by removing a split from a splitset that represents the given tree. Figure
Contraction operation
Contraction operation.
T1 is called a refinement of T2, however T2 is also a subtree of T1 (in more general terms), therefore we will say that T2 is csubtree of T1.
Definition [ccubtree] A Csubtree of tree T is a subtree where only a contraction operation has been used to construct the csubtree from its supertree T.
Definition [Pruning] Pruning is the operation of removing a chosen leaf from a tree, and afterwards removing the nodes of degree two (which is called forced contraction). The pruning operation can be illustrated on a set of splits as the process of removing leaves from splits, and then removing duplicate splits and notvalid splits, which corresponds to forced contraction.
Figure
Pruning operation
Pruning operation. T1  input tree, T2  tree where leaf d was removed, T3  tree after additional forced contraction.
T3 is called an induced subtree of T1, however here we will call it a psubtree.
Definition [psubtree] A Psubtree PS of a tree T is a subtree where only a pruning operation is allowed to construct the subtree PS from its supertree T.
Definition [restricted tree, zrestricted tree, induced subtree] A zrestricted tree
Definition [Restricted Split Equality(zequality)]
Figure
psubtree and csubtree
psubtree and csubtree. Tree together with it's psubtree(restricted subtree on z = acde) and csubtree(edge
In
Definition [Subsplit and supersplit]
This can be presented alternatively as:
Common information extraction techniques
Definition [The strict consensus tree]
Strict consensus tree
Strict consensus tree. Two leaflabelled trees together with their strict consensus tree.
Splits from T1:
Splits from T2:
The common splits of these trees, which build the strict consensus tree, are as follows:
Because the concept of a consensus tree is very strict, for many trees, a consensus tree can easily become a star (a tree built of only trivial splits). In order to deal with this problem, many variations of consensus trees have been proposed, among others, a majority rule consensus tree.
Definition [Majority rule consensus tree] The majority rule consensus tree is built from splits that occur in the majority of trees.
Definition [Maximum Agreement Subtree (MAST)]
An example of a MAST can be seen in Figure
MAST
MAST. Two leaflabelled trees together with their MAST on a, b, c, e, f, g.
Several versions of the MAST problem exist like RMAST, which considers only rooted trees, or UMAST for general unrooted trees.
A MAST problem without any restrictions is generally NPhard
Distance measures
Robinson  Fould distance
For example, in Figure
The tree edit distance
The MAST distance between trees T1 and T2 is the number of leaves that need to be removed to obtain the Maximum Agreement Subtree.
For the trees from Figure
Representative Splitset and derived similarity measure
Here, we recall the basis of our representative splitset approach, which is the foundation for a new consensus technique and new similarity measure, applicable to trees where the leafset may vary without discarding any information. For the detailed information see
Notion of Representative Splitset
Definition [Frequent subsplit] Frequent subsplit s with support minsup in a profile of trees is a split that is a subsplit of at least one split in at least minsup of trees. The minsup parameter is called the minimal support. It may be an absolute value which denotes the minimum number of trees in which the split is supposed to be found (as a subsplit). It can also be given as a relative value, where it is a minimal percentage of the trees in which the split is supposed to be found.
Consider the trees shown in Figure
Sample trees on a different leafset
Sample trees on a different leafset. Two leaflabelled trees on a different leafset.
According to our approach, we count the number of trees in which the split occurs (as a subsplit of any split), rather than counting the number of splits, of which it is a subsplit. For example, in Figure
Definition [Representative splitset] Representative splitset  a set that contains maximal frequent subsplits
Definition [strict representative splitset SFS] The strict representative splitset
where
Definition [Majorityrule representative splitset MRFS] The Majorityrule representative splitset is a representative splitset with minsup = 50%.
Frequent Splitset Interpretation
It is clear that, from the splits of FS, we cannot directly construct one tree because the splits in general have different leafsets.
The full reasoning about frequent interpretation was provided in
Conclusion 1: For each distinct leafset
Conclusion 2: Each split from the frequent splitset discussed above will occur in at least one tree, in a restricted form.
Conclusion 3: Conclusions 1 and 2 are also true for a tree based on the intersection of all the distinct leafsets from the frequent splitset.
Conclusion 4: The set of trees resulting from the frequent splitset will also contain a consensus tree, provided that the input dataset of trees was built on the same leafset.
For example, as the strictfrequent splitset of trees from Figure
Illustration of strict frequent splitset
Illustration of strict frequent splitset. Two trees built from strict frequent splitset of trees from Fig. 6.
Strictfrequentset:
For a more difficult example, let us look at trees
Sample trees on the same leafset
Sample trees on the same leafset. Three leaflabelled trees on the same leafset.
Illustration of strict frequent splitset
Illustration of strict frequent splitset. Trees built from strict frequent splitset of trees T1 and T2 from Fig. 8.
FSbased Dissimilarity Measure
Basing on frequent subsplit notion we defined a dissimilarity measure between two trees (or splitsets)
where
Such a measure determines the dissimilarity on the basis of how many subsplits they share in common. Let us compare this measure to the most popular: RF distance. Consider the example from Figure
Sample trees on the same leafset
Sample trees on the same leafset. Three different trees on the same leafset.
It is clear that the RF distance states that
The main drawback of this measure is that it is not a metric, however it achieves very good statistical characteristics and clustering results as described in the Results section. In this paper the method was compared to RF, MAST and edit distance in the series of experiments.
Tree Edit Distance and Tree Edit Consensus for LeafLabelled Trees
Tree Edit Distance for LeafLabelled Trees
In the following sections we define a new distance and consensus notion based on editing operations on leaflabelled trees. We choose contraction and pruning as editing operations for leaflabelled trees. If tree T3 is a subtree of T1, where both pruning and contraction operations are allowed, then we call it a pcsubtree or edit subtree. An example of transforming tree T1 into T3 using editing operations is shown in Fig.
Sample edit script
Sample edit script. Example of transforming tree T1 into T3 using edit operations.
Definition [Edit script] An Edit script
Definition [Edit script Cost] The cost of an edit script
Definition [Tree edit distance for leaflabelled trees]
Having defined positive value costs for contraction and pruning operation, the tree edit distance for leaflabelled trees T1 and T2 is the minimal cost edit script
However, there is also an interesting variant of the edit distance where a forcedcontraction is ignored. The metric property of such a variant is yet to be verified, the measure will also be considered in experiments due to its interesting features.
Definition [No Forced Contraction Disimilarity Measure for leaflabelled trees] Having defined positive value costs for contraction and pruning operation, the No Forced Contraction Disimilarity Measure for leaflabelled trees T1 and T2 is the minimal cost edit script
Tree Edit Distance versus RF Distance and MAST
As mentioned earlier, comparing distance measures is not a trivial task. Here, we provide a subjective opinion about why this measure is better than others, however an objective statistical comparison will be provided in the Results section.
The RF and MAST distances have some drawbacks which have emerged from the fact that the RF distance may use only contraction operations and MAST uses only pruning operations and forced contractions. There are of course some cases when all three distances perform equally well, as in the example of Figure
RF Distance Drawbacks
1) The first drawback of the RF distance is that it is totally useless for leaflabelled trees on a free leafset. For example, Figure
2) The second drawback of the RF distance is that even if the trees are on the same leafset, one noisy leaf may cause the trees to be considered totally different (all splits must be removed). Removal of one leaf may significantly reduce the distance between trees. Such a situation is illustrated in Figure
Editconsensus tree
Editconsensus tree. Trees T1 and T2 together with their editconsensus tree TCe.
MAST Distance Drawbacks
1) The first drawback of the MAST distance occurs when the trees are similar except for one internal edge as in Figure
Two trees that differ by one internal edge only
Two trees that differ by one internal edge only.
2) The second drawback is that MAST counts only the leaves that are removed from both input trees. If it is allowed to also count leaves that are present in only one tree in order to support a different leafset, then the distance will ignore some subtle changes. For example, in Figure
MAST
MAST. Example of the MAST tree between T1 and T2.
Edit Distance Advantage
In the previous sections, we showed the drawbacks of RF and MAST distances and showed that the Edit Distance is better because:
• it can be used for trees on a free leafset
• it can distinguish differences where the MAST distance cannot as it can use both contraction and pruning.
Let us compare the values of these distances. The Edit Distance is easily compared to the RF distance, provided the same cost of contraction is used to count both distances. The only difference between them is that the RF distance cannot use pruning. However, it is impossible to compare the values directly to MAST as this distance is not well defined if the leaf is removed from one tree only, and the cost of forced contractions is ignored. Therefore, in order to compare the values, we will use the RF distance (denoted here as the cdistance) and instead of MAST, we will count the cost of each pruning operation and forced contraction (denoted here as the pdistance). Some distance values are presented in Table
Values of c, p and edit distances for various examples.
Fig (trees)
cdist
pdist
Editdist
1(T1, T2)
1
2
1
2(T1, T3)

2
2
11(T1, T3)

5
3
14(T1, T2)
4
4
4
13(T1, T2)
1
4
1
15(T1, T2)
8
7
4
These results show that, in some situations, pruning operations are better at unifying trees, sometimes contractions are, and sometimes neither performs well. However, there are cases when using both of them is better. To sum up below, there are some cases when one editing operation is better than another: Pruning is better: when the trees are not on the same leafset, then pruning is necessary (figures
Two trees on the same leafset where the Edit distance is more suitable than others
Two trees on the same leafset where the Edit distance is more suitable than others.
Cost Manipulation
The difference between the Edit Distance and other distances is visible especially when the cost of operations is not the same. Although in some cases both operations can be equally good, one may prefer for example contraction over pruning in some cases. The motivation can be for example the need to have as many leaves as possible in the tree edit consensus. Therefore, our distance uses the costs of editing operations. For example, consider the trees T1 and T2 in Figure
Tree Edit Distance Metric Proof
In order to show that our measure is a true metric, the following conditions shall be proved:
•
•
•
The first two conditions are met by definition: The minimal edit script that unifies T1 and T1 contains no operations, therefore the distance is equal to 0. On the other hand, if two different trees T1 and T2 may be unified only by applying some editing operations, and because cost must be positivevalued, then the distance for different trees cannot have the value 0.
As the Definition states that the distance is the minimal cost of unifying two trees, by applying the editing operations either to T1 or T2, it is therefore symmetric by Definition.
The third condition is slightly more complicated and requires more explanation:
Lemma : Having the edit scripts corresponding to distances
Proof:
Lets denote:
• TPCX the tree edit consensus (unification) of trees T1 and T2.
• TPCY the tree edit consensus (unification) of trees T2 and T3.
• Sx1(T1) = TPCX  the edit subscript that transforms T1 to TPCX
• Sx2(T2) = TPCX  the edit subscript that transforms T2 to TPCX
• Sy2(T2) = TPCY  the edit subscript that transforms T2 to TPCY
• Sy3(T3) = TPCY  the edit subscript that transforms T3 to TPCY
The mentioned artefacts are presented in Figure
Illustration for proof of lemma 1
Illustration for proof of lemma 1.
Because
Therefore, there exists some tree TPCZ, such that
Theorem: The tree edit distance for leaflabelled trees meets the third metric condition.
Proof: Due to Lemma, presented earlier, there exists an edit script
and because
therefore:
Edit Subscript Order
In this section we show that for a given edit subscript (i.e. a set of operations on one tree), changing the order of operations in it will not change the resulting tree. Therefore, it will also not increase the costs. In order to show this, we need to show that if edit script consists of operations:
Let us assume that a tree is represented with two sets
One may notice that pruning also removes some edges, but only trivial ones, which are not considered in edit distance and may be removed at any time. One may also notice that pruning changes the bipartition representation of all nontrivial splits. It is also not a problem, as the total number of edges is not affected. Although we use split representation very often, here the number of edges is important (not the form of their split representations).
As it was presented earlier in this paper, pruning may occasionally introduce forced contraction (see Figure
Let us represent pruning
The last thing that we mention is the edge matching. Forced contraction removes the edge, which is a duplicate of another edge with respect to their split representation. For example
Algorithm for Counting Edit Distance of Leaflabelled Trees
The naive algorithm for this problem can be illustrated as follows:
where
The algorithm can also be presented with pseudocode as follows:
This algorithm is now exponential with respect to the number of leaves. It is possible that this can also be improved so that it has the same complexity as MAST for two trees (which is polynomial), but further investigations are required. For the purpose of this paper, we used a dynamic programming algorithm, where partial results are stored in memory and reused if necessary. It turns out that the algorithm was required to only count a small part of all possible combinations which also gives grounds for optimism that a better algorithm will be found.
Let us look on a few steps of naive algorithm for trees T1 and T2 from Figure
T1:
T2:
Trees are built on the same leafset so we may directly calculate
T1':
T2':
plus the trivial splits
Curly braces denote splits that will be forcecontracted.
So let us remove
T1':
T2':(
T1":(
T2":(
step: 4, so the total cost is equal to 8 thus we received a worse result. We will not continue with pruning of other leaves as it will not lead to better result.
Therefore the best cost is 6, and the edit script contains two subscripts:
From T1: p(d),
From T2: p(d),
Tree Edit Consensus Tree
Similarly, we may define a new consensus method on the basis of editing operations called the Tree Edit Consensus Tree. The Tree Edit Consensus Tree is the maximal (with respect to leaves and edges) common subtree of the input trees, obtained by contraction and pruning operations and is defined as follows:
Definition [Tree edit consensus tree (PCConsensus tree) ] Having defined the positive value costs of contraction and pruning operations, the tree edit consensus for leaflabelled trees
Tree Edit Consensus Algorithm
Similar to the edit distance, based on the fact that, if a prunning operation is used it must be used on all input trees and the fact that contraction is performed only for splits that do not occur in all input trees. The naive, dynamic programming algorithm which counts the score of the tree edit consenus may be defined as follows:
where
The tree edit consensus tree may be obtained by recording prunning operations used along the optimum path. Recorded prunnings must be applied to input trees, and afterwards all unmaching edges must be contracted (strict consensus tree).
Quality of similarity measures
The quality of similarity measures is not obvious to estimate. The best possible method would be a method based on external criteria i.e. based on expert knowledge. In biological applications, it could be a comparison of the consensus tree with the true phylogenetic tree. The true tree however is something that is not known. We agree with the opinion presented in
The methods that can be applied to measure the quality of distance measures and consensus techniques can be roughly divided into:
• qualitative methods which try to de ne properties that the given consensus method or similarity measure must meet as in
• quantitative methods which try to measure the quality of consensus or similarity methods such as
• statistical methods which display the statistical properties of the given method to help an expert score the method instead of scoring it automatically, because the quality of a metric may depend on the application.
In an axiomatic approach, the most common requirement for the similarity measure is that it meets metric properties, or at least pseudometric ones.
The quantitative approach is not very suitable for distance measures due to lack of objective criteria. Even if we are supplied with biological data which contain groups of trees and may count for example the proportion of innergroup distances to betweengroup distances, such an approach is not very trustworthy. This is because we see the effect of the distance measure on selected sets, which may be different for different parts of the treespace. We also ignore some potential properties of distance, for example that the distance metric may be better for some topologies of trees but worse for others and this observation could give hints on where to use it and where not. Simply put, the quality of a metric may depend on the application.
Except for proving the metric properties of some distances, we choose the statistical approach as described in
• Analysis of distance probability distribution
• Analysis of distance dynamics with respect to number of changes in trees.
In the first approach, we count the distance for a large number of randomly generated unrooted trees according to different distributions and examine the distribution of probability. The details of random generation of rooted trees can be found in
• whether the distance distribution has any regularities, follows any known distribution, which proves that the distance does not work in a random fashion
• whether the distance is well enough discriminative (has a large number of values), whether the discrimination property is equally strong for similar and different trees.
The other approach is to mutate the random tree with different mutation operations and see how the distance changes.
More details about the experiments will be provided in the Results section.
Results and Discussion
In this section, an experimental evaluation of the proposed methods is presented. For the purpose of experiments, we use the randomly generated trees with different distributions and we evaluate the statistical properties of the similarity measures as described previously in this paper. From our propositions, we decided to evaluate the Tree Edit Distance, No Forced Contraction Similarity Measure called NFC here and the FSbased Similarity measure.
For comparison with existing distance measures we have chosen the RF and MAST distances. RF is one of most popular and computionally efficient distances, MAST and RF are in a way foundations of the Edit measures presented in this paper. We decided not to normalise the values of distances because sometimes normalisation is not obvious (as in the case of the Edit distance). Normalisation is not necessary in the first experiment as we study the distribution rather than absolute values. In the second experiment, the lack of normalisation does not prevent us observing dynamics, it only forbids the spotting of the crossing points of distances. The approach of normalising with the maximum observed value, as used in literature, in our opinion distorts the results, because if the real maximum value is not achieved then the graph is distorted. The only modifications are made with the FS dissimilarity measure, i.e. values are scaled and biased in order to be compared with other distances on the same chart.
In this experiment, trees with 8 leaves are presented, however tests were also performed with trees with up to 17 leaves for unconstrained trees and 12 leaves for binary trees, with similar results being obtained.
Distribution of Distance Probability
For this test, 1000 pairs of trees with 8 leaves were generated and the distribution of probability
Unrooted binary leaflabelled trees on the same leafset
First consider Figure
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison of RF, MAST and E(1,1) distributions.
Figure
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison E(1,1), E(1,2), NFC(1,1), NFC(1,2), NFC(2,1) distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
The distances E(2,1) and E(3,1) are significantly different, especially E(3,1) which is compared to MAST and RF in Figure
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison E(3,1), MAST and RF distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Conclusion The first conclusion is that by modifying the costs of the Edit distance, we can achieve a measure with very wellbehaving properties: very discriminative and suitable both for similar and dissimilar trees. Moreover, the similarity of the Edit, RF and MAST distributions shows that the distance is not accidental.
The FS similarity measure is the hardest to interpret (see Figure
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison of FS dissimilarity measure with RF distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Unrooted unconstrained leaflabelled trees on the same leafset
This distribution leads to similar observations and conclusions. The E(1,1), E(1,2), NFC(1,1), NFC(1,2), NFC(2,1) distributions are similar or identical (the figure has been omitted). E(1,1) and RF are again similar, however the distribution does not rise asymptotically with increasing distance value Figure
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison of E(1,1) with RF and MAST distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison of E(3,1) with RF and MAST distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison of FS dissimilarity measure with RF and MAST distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Unrooted leaflabelled trees on a free leafset
In this experiment trees with at most 8 leaves were generated. Both binary and unconstrained versions will be discussed together as the differences are only with the RF distance. Characteristics of RF distribution in this experiment does not recall typical RF distribution. The main reason is that it is unsuitable for comparing trees with different leafsets as it will always return the maximum value, which will also be dependent on the number of leaves of the trees. Therefore the distribution reflects the conditional probability of selecting two trees with the same leafset(left part of graph) and trees with different leafsets (right part of graph) of Figure
Comparison of distributions of selected measures
Comparison of distributions of selected measures. RF distributions for binary trees on free leafset, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Comparison of distributions of selected measures
Comparison of distributions of selected measures. RF distributions for unconstrained trees on free leafset, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Comparison of distributions of selected measures
Comparison of distributions of selected measures. Comparison of E(3,1) distance and MAST distributions, showing the number of obtained pairs of trees (y axis) with certain distance values (x axis) in 1000 trials.
Conclusion : To summarise the key points of this experiment:
• The RF distance is not very discriminative for binary trees, it is also weak for distant trees. It is not suitable for trees with different leafsets.
• The MAST distance is good for the same and different leafset, and is good both for distant and similar trees, however it is only weakly discriminative.
• The Edit distance, with the variant where cost of contraction = 1 and pruning = 3, looks very promising as it has a wide range of values and is equally good for distant and similar trees.
• The FS dissimilarity measure is similar to the Edit distance, but it does not have a very regular distribution.
• NFC here behaves like E(1,1) i.e. it is equivalent to RF for the same leafset and equivalent to MAST for different leafsets, which is good. However it is still only very weakly discriminative.
Dynamics of Distances
For this test, one tree is randomly generated and then the second tree is obtained with k mutation operations. Here, we observe the dynamics of distance changes with respect to number and type of mutations. Due to the nature of most of the examined distances i.e. Edit Distance, No Forced Contraction Similarity Measure, MAST and RF, we use the following types of mutation:
• Contraction  we randomly remove a selected split
• Pruning  we randomly remove a selected leaf
• Nearest NonBrother Interchange (NNBI).
Nearest NonBrother Interchange (NNBI) is a modification of the NNI operation
Nearest NonBrother Interchange of leaves f and d
Nearest NonBrother Interchange of leaves f and d.
To analyse the results, let us see the distances counted with respect to the contraction operation (Figure
Comparison of distances with respect to number of contraction mutations
Comparison of distances with respect to number of contraction mutations.
For a pruning mutation, the situation looks very similar (see Figure
Comparison of distances with respect to pruning mutation
Comparison of distances with respect to pruning mutation.
Comparison of distances with respect to all mutations
Comparison of distances with respect to all mutations. Comparison of distances with respect to pruning, contraction and NNBI mutation.
Conclusions
In this paper we have proposed new technique for measuring distance between leaf labelled trees on free leafset, and provided its evaluations with respect to frequent subsplit based method and other measures. The tree edit distance was proven to be a metric and has the advantage of using different costs for contraction and pruning, therefore their properties can be tuned depending on the needs of the user. It is difficult to pick the best distance measure as they all have different interesting properties and may be used in different applications. Two of the presented methods carry the most interesting properties. E(3,1) is very discriminative (having a wide range of values) and has a very regular distance distribution which is similar to a normal distribution in its shape and is good both for similar and nonsimilar trees. NFC(2,1) on the other hand is proportional or nearly proportional to the number of mutation operations used, irrespective of their type. All of these distances have a great advantage in that they can take different costs of contraction and pruning, therefore their properties can be tuned depending on the needs of the user. Future works will be dedicated to discovering more efficient algorithm for tree edit distance and deep experimental evaluation of tree edit consensus method for leaflabelled trees on the same leafset.
Authors' contributions
JK conceived the study, KW coordinated and supervised the study. All authors read and approved the final manuscript.