Email updates

Keep up to date with the latest news and content from BMC Evolutionary Biology and BioMed Central.

Open Access Highly Accessed Research article

Looking for trees in the forest: summary tree from posterior samples

Joseph Heled12* and Remco R Bouckaert12

Author Affiliations

1 Department of Computer Science, University of Auckland, Auckland New Zealand

2 Computational Evolution Group, University of Auckland, Auckland New Zealand

For all author emails, please log on.

BMC Evolutionary Biology 2013, 13:221  doi:10.1186/1471-2148-13-221

The electronic version of this article is the complete one and can be found online at:

Received:18 July 2013
Accepted:18 September 2013
Published:4 October 2013

© 2013 Heled and Bouckaert; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Bayesian phylogenetic analysis generates a set of trees which are often condensed into a single tree representing the whole set. Many methods exist for selecting a representative topology for a set of unrooted trees, few exist for assigning branch lengths to a fixed topology, and even fewer for simultaneously setting the topology and branch lengths. However, there is very little research into locating a good representative for a set of rooted time trees like the ones obtained from a BEAST analysis.


We empirically compare new and known methods for generating a summary tree. Some new methods are motivated by mathematical constructions such as tree metrics, while the rest employ tree concepts which work well in practice. These use more of the posterior than existing methods, which discard information not directly mapped to the chosen topology. Using results from a large number of simulations we assess the quality of a summary tree, measuring (a) how well it explains the sequence data under the model and (b) how close it is to the “truth”, i.e to the tree used to generate the sequences.


Our simulations indicate that no single method is “best”. Methods producing good divergence time estimates have poor branch lengths and lower model fit, and vice versa. Using the results presented here, a user can choose the appropriate method based on the purpose of the summary tree.


Bayesian Markov Chain Monte Carlo (MCMC) analysis provides powerful and popular techniques for performing phylogenetic analysis. The result of such an analysis is a set of trees drawn from the posterior distribution. The set of correlated draws is often condensed into a single tree for visualistion, comprehension, annotations and presentation. When most trees agree in topology and branch lengths, the most frequent tree topology, properly annotated, can give a fair representation of the posterior distribution.

However, a single tree can be misleading, especially when the agreement between posterior trees is small. Posterior tree topologies can be reduced to a set of common sub-topologies [1], but this is also a fragmented view of the posterior. Tree drawing programs such as FigTree [2] can annotate internal nodes with the clade posterior support (the fraction of posterior trees containing the clade), and the credible interval of internal node ages.

Still, the choice of any specific topology highlights one alternative at the expense of others. The tree drawing program DensiTree draws all posterior trees transparently [3]. Where most trees agree in topology and node height, lines are close to each other and distinct edges appear, while areas of uncertainty in topology or heights remain a blur. The composite image allows a direct assessment of posterior support and node height uncertainty by visual inspection. But even the display of the full posterior can be hard to interpret when the uncertainty gets large, and a summary tree overlaid on top can be useful in such situations.

There are many ways of obtaining a “representative” tree topology from a collection of trees. One group of methods look for consensus among the trees using splits, clusters or rooted triplets/quartets present in posterior trees [4]. The TreeAnnotator utility in BEAST [5] uses the clade frequencies as estimated by the posterior to score each tree, selecting the rooted topology of the highest scoring tree amongst the set. This results in a fully resolved topology with a non zero support, in contrast to consensus methods which often produce unresolved trees. A recently published method uses conditional clade splits probabilities to compute a probability for each posterior tree [6].

However, selecting a representative topology is only the first step in generating a summary tree from the output of programs such as BEAST. BEAST trees are rooted with branches proportional to time, as is the summary tree. In the second step, TreeAnnotator assigns a divergence time for each clade using the ages of matching clades from the posterior. If the number of trees containing the clade is small, the divergence time estimate can have high variance which may result in negative branch lengths. Clades in trees from the MCMC samples which do not appear in the summary tree are essentially ignored, and sometimes a large proportion of the posterior goes unrepresented. Ignoring non-matching parts appears to be the accepted practice and is used in the SumTrees utility in DendroPy [7].

In this paper we describe several new ways for building rooted summary trees. These new constructions use more of the information contained in the posterior even when the disagreement between posterior trees is high. Some of the methods are based on rooted tree distances, and are similar in spirit to the method developed by Huggins et al. for unrooted trees [8]. We perform an extensive simulation study and compare the trees from all methods using multiple criteria. Summary trees are assessed with respect to the posterior and by their distance to the tree used in generating the sequence data. The methods are implemented in biopy [9], and integrated with DensiTree, making it easy to examine the summary tree in the context of the full posterior.


Definitions and notations

We define a rooted tree as a collection of clades with ages. Specifically, a tree is a strict hierarchy of clades, where each clade is a subset of the taxa, and a non-negative age is associated with each clade.

Formally, a tree T is a triplet ( L , , h ) , where L = x 1 , x 2 , , x l is the set of taxa and = C 1 , C 2 , , C n is a set of clades. Each clade CiL is a subset of taxa, and h : 0 is a function assigning an age to the clade. The set describes only the clades hierarchy and is referred to as the tree topology. Sometimes we shall use cT as a shorthand for T = ( L , , h ) and c (“clade c is present in tree T”).

To qualify as a tree, the following conditions must hold:

i The tree contains all leaves: i { x i } .

ii The tree contains a root: L .

iii Strict hierarchy of clades: for any two clades c 1 , c 2 , either C1C2, C2C1 or C1C2=. (Note that C1C2 implies C1C2, otherwise we write C1C2.)

iv Non-Negative branches: for c 1 , c 2 , c1c2h(c1)≤h(c2).

For any clade c, the elements in the set A ( c ) = { y : c y } are called ancestors of c, and the minimal element P(c) in A is the parent of c. Every clade except the root has a parent and by association a branch to its parent with length b(c)=h(P(c))−h(c). For convenience, the branch length of a subset not in is defined as zero. Any subset of taxa x has a Most Recent Common Ancestor in the tree, the minimal clade containing all members of x. Formally, ca(x) is the minimal element of { y : x y } . For brevity we omit the tree when the context is clear, and use ca(c,T) to explicitly associate the clade with the tree T.

Extending the domain of b(·) to all taxa subsets simplifies definitions involving sets of trees with different topologies. We extend h(·) for the same reason and define the age of any subset xL to be the age of the common ancestor of x, h ̄ ( x ) = h ( ca ( x ) ) .

Using h ̄ , we define the heights error, a discrepancy score between clade ages of T = ( L , , h ) and a reference tree Tref,

ε H ( T , T ref ) = c | h ( c ) h ̄ ref ( c ) | . (1)

The heights error is the total sum of clade age errors, whether they appear in the reference tree or not. The age of a clade which is not in the reference tree is taken to be the age of the MRCA of the clade taxa, which spans a larger clade in the reference tree. Note that the definition is not symmetric. Alternatively we define the divergence times error which focuses on the time lineages split from each other. The divergence time for any clade xL is the mean divergence time of all pairs of x. Formally, we start with the pairs of taxa which split at the clade; those are the pairs in x whose common ancestor is the clade,

D ( x , T ) = a , b x : ca ( T , a , b ) = x . (2)

Now the average split time is the mean of all pair splits,

h ~ ( x , T ) = | D ( x , T ) | 1 a , b D ( x , T ) h ̄ ( { a , b } ) . (3)

Finally, The total error is,

ε D ( T , T ref ) = c T | h ( c ) h ~ ( c , T ref ) | (4)

The clade and divergence errors are equal for trees with the same topology, but they differ when topologies disagree, and the difference usually increases with the distance in topology.

The clades missed error counts the number of clades in Tref not present in T,

ε cm ( T , T ref ) = | ref | | c ref : b ( c ) > 0 | . (5)

This number is equal to (half) the Robinson-Foulds tree distance [10] when T has no zero length branches. A clade with a zero branch does not count as a match because it is potentially confused with its parent. The clades called error scores a 1 for correctly called clades and a -1 penalty for incorrectly called clades,

ε cc ( T , T ref ) = | c ref : b ( c ) > 0 | | c ref : b ( c ) > 0 | . (6)

A tree set T = T 1 , T 2 , , T k is a set of trees on shared taxa. Typically those sets are samples from a Bayesian analysis, and we define the posterior frequency F(x) of xL as the fraction of times x is present as a clade in the trees:

F ( x ) = 1 | T | | T T : x T | . (7)

The posterior frequency of a subset not in any of the trees is zero.

Distance between trees

The Rooted Branch Score (RBS) measures the distance between two rooted time trees, and is the total sum of the difference in branch lengths of matching clades. This definition is motivated by the distance between unrooted trees [11], but the space of rooted trees is more complex than its unrooted counterpart since branch lengths are not free to vary independently of each other [12]. Since by convention the branch length of a missing clade is zero, any clade present only in one tree contributes its total length to the score.

Formally, for T 1 = ( L , 1 , h 1 ) and T 2 = ( L , 2 , h 2 ) we have,

RBS ( T 1 , T 2 ) = c 1 2 | b ( 1 ) ( c ) b ( 2 ) ( c ) | . (8)

The Squared Branch Score (SRBS) is similar, but taking the square of the difference instead of the absolute value,

SRBS ( T 1 , T 2 ) = c 1 2 ( b ( 1 ) ( c ) b ( 2 ) ( c ) ) 2 . (9)

The Heights Score (HS) takes the difference between clade ages instead of branches. Like the RBS, branches appearing in only one tree are added to the sum,

HS ( T 1 , T 2 ) = c 1 2 | h ( 1 ) ( c ) h ( 2 ) ( c ) | + c 1 2 b ( c ) + c 2 1 b ( c ) . (10)

The heights score is a (non-optimal) edit distance, where the score is the total sum of a sequence of moves which transform one tree into the other. Each move involves sliding an internal node, and two nodes may “merge” into one when they meet.

The Rooted Agreement Score (RAS) measures the disagreement between branches by treating them as intervals. Two branches may be of the same length and still contribute to the distance if they span different intervals as measured from the time of the tips. The score, when divided by the sum of the length of the two trees, is the probability that a random point chosen uniformly on one of the trees has a corresponding point on the other tree. Formally,

RAS ( T 1 , T 2 ) = c 1 2 μ b ( 1 ) ( c ) b ( 2 ) ( c ) + c 1 2 b ( c ) + c 2 1 b ( c ) , (11)

where b ( x ) is the interval spanned by the clade branch, b ( x ) = [ h ( x ) , h ( x ) + b ( x ) ] and △ is the symmetric difference operator, that is

μ ( [ l 1 , h 1 ] [ l 2 , h 2 ] ) = ( h 1 l 1 ) + ( h 2 l 2 ) 2 max ( min ( h 1 , h 2 ) max ( l 1 , l 2 ) , 0 ) . (12)

RBS and RAS are metrics in tree space, while SRBS and HS are not. RBS is a metric since branches can be mapped to the vector space 2 n 1 [8], and a similar argument works for RAS. However, we only require that distances are semimetrics and make no use of the triangle inequality.

Summary trees

BEAST Tree annotator

The Tree Annotator utility in BEAST generates a summary tree using a two stage procedure. First, each posterior tree is assigned a score based on topology. The Clade Credibility of a tree is the product of posterior frequencies (equation (7)) of all clades in the tree,

CC ( T = ( L , C , h ) ) = c C F ( c ) .

The Maximal Clade Credibility (MCC) tree is the tree with the highest score, and we shall refer to its topology as the MCC topology. In the second step, each clade is assigned an age based on the clade age in posterior trees. Formally, the age is set as either the mean or the median of the set of ages

H ( c , T ) = h i ( c ) : ( L , i , h i ) T and c i .

Since each age is set independently, the end result is not guaranteed to be a tree (condition iii). A few “negative branches” are not an unusual occurrence in trees with a medium to large number of taxa and moderate posterior uncertainty.

Minimum distance trees

The distance between the tree set and the tree T is defined as the mean distance of T to all members of ,

d ( T , T ) = | T | 1 t i T d T ( T , t i ) , (13)

where dT is one of the tree scores defined previously. A Minimum Distance Tree is a tree which minimizes d ( T , T ) . While the definition is simple and natural, the details are not. First, the minimal tree is not necessarily unique; there might be several or even an infinite number of minimal trees in some cases. Second, with anything more than a few taxa the space of trees is vast and topologically complex, so there is no guarantee of finding the minimal tree. We therefore limit the search to the topologies present in the posterior, and designate this approach by a lowercase ‘m’ followed by the distance method (mRBS, mRAS, etc). However, even this can be time consuming when the posterior contains many topologies, and in addition we examine a family of methods which consider just a single topology, using one of the heuristics outlined in the next section. The details about the algorithm for searching the best branch assignment for a specific topology are in Appendix 2.

Selecting a topology

All of the two stage methods we considered selects a topology first and assign branch lengths conditional on that topology. We examined three alternatives to the MCC for selecting a topology.

The first alternative uses the recently published Conditional Clade Probability Distribution (CCD). The CCD computes a probability for each tree based upon the posterior probability of the splits in the tree, conditional on the clade posterior frequency [6]. The second is a the Total Clade Branch (TCB), which assigns a score to each clade in the tree equal to the total length of matching branches in the posterior. The total length reflects the support for a clade by combining both the frequency (the number of trees with the clade) and confidence, under the assumption that longer branches are more likely to be “real” than shorter branches. The third is the Highest Posterior Frequency (HPF), which picks the topology of the tree most frequent in the posterior. To break ties, the HPF picks the tree whose height is closest to the mean root height of the posterior.

CA Tree

Negative branches in the TreeAnnotator tree result from using a different subset of posterior trees for estimating each clade age. In the Common Ancestor Tree (CAT), every clade x is assigned an age using the mean of the clade age in all posterior trees. Formally,

h ( c ) : = | T | 1 T i T h ̄ ( c , T i ) (14)

The generated ages always produce a tree, since x y h ̄ ( x , T i ) h ̄ ( y , T i ) . Unlike TreeAnnotator, which may end up using a small number of values for some clades, CAT uses | T | posterior values for estimating the age of each clade.

Taxa partitions tree

We now present the Taxa Partition (TP) tree, a single stage method which does not commit to a particular topology before assigning ages. The TP is inspired by the tree operator described by Mau et al[13]. In this representation each internal node is assigned a left/right orientation, inducing a linear order on the taxa and positioning each internal node between two tips (Figures one and two in [13]). We reverse the process by first ordering the taxa, then using the posterior to assign the ages between tips and finally reconstructing the tree topology from the ages.

For a given ordering of taxa, each posterior tree provides ages according to its topology. A clade contributes an age if it spans an unbroken range in the ordering. For example, for the order [a b c d], the tree (((a,b),c),d) contributes the age of (a,b) to [a | bcd], the age of ((a,b),c) to [ab | cd] and the root height to [abc | d]. The tree ((a,d),(b,c)) contributes only the age of (b,c) to [ab | cd]. (a,((d,b),c)) contribute only its root height to [a | bcd].

After collecting ages for all splits from the posterior, a point estimate of the height at each split is used to build the tree. The precise definitions are given in Appendix 2.

TP incorporates clade ages from competing topologies before committing to the final topology. For example, take the set with a mixture of two topologies, ((a,b),c) and (a,(b,c)). With the obvious ordering [a b c], TP uses all ages in every tree, and the choice between the two topologies is determined by the age of the [ab | c] and [a | bc] splits. If [ab | c] is higher we end up with ((a,b),c), otherwise with (a,(b,c)).

Finding an optimal ordering is hard. Assigning an orientation which minimizes the distance between taxa orders of just two trees is NP complete [14]. We use a fast heuristic which proved effective in practice: build a distance matrix for pairs of taxa and use simple clustering to build the ordering. The distance between taxa a and b in each tree is the size of the clade of their common ancestor, d(a,b)=|ca({a,b})|. The overall distance is the mean of pair distances over all posterior trees. The clustering starts with each taxon in its own group, then progressively joins the two closest groups.

Test cases

To evaluate the various methods we generated 2000 test cases, divided into 20 groups of 100 repeats. For each case, a tree with n tips was drawn from the Kingman coalescent distribution [15] with population size Ne. All repeats shared the same n and Ne, and each group was assigned one pair from the 5x4 grid formed by n=8,16,32,64,128 and Ne=1,2,4,8.

A sequence of length 800bp was generated for the tips of the tree, starting with an ancestral sequence at the root and mutating the sequence along the branches using the Jukes-Cantor substitution model [16] with a mutation rate of 0.005. The sequences were analyzed using BEAST-2 [17] under the same model (Jukes-Cantor and a coalescent prior with constant population size). The tree and population size were estimated but the mutation rate was fixed at its true value. The chain was 2.2M steps, sampled every 2k steps. 200k of the initial samples were discarded (burn-in), leaving 1000 posterior samples. Those were used as input for building a summary tree by each of the methods under consideration.

The test trees contain 8 to 128 tips and range (on average) from a height of 0.01 substitutions to 0.08, or 2 to 16 million years for a nuclear mammalian gene. Sampling the posterior of such trees normally requires a longer MCMC chain, but here a relatively short one is sufficient. The data was generated under a simple model and the exact same model is used for inference, resulting in excellent mixing. Not only was the effective sample size high for all parameters, we made sure the clades were adequately sampled by running a second independent chain, starting with a different seed. We then computed the maximum of the absolute difference between posterior frequency of all clades; this number was well below 5% in most settings, and around 6% for the most diffuse case (128 tips and height of 0.01 substitutions).

The posterior for trees with 32 and more taxa was completely diffuse, with a distinct topology for each sample. Even the easiest cases (n=8 and Ne=8) contained between 1 and 45 distinct topologies, with a mean of 6. Also note that even when the posterior has a single topology, a method may do better that others by setting more accurate branch lengths.

Summary trees were compared using two main criteria: accuracy in estimating ages and model fit. The first criteria was broken into 3 related error measures: accuracy in estimating the root height, accuracy in estimating clade ages (equation 1) and accuracy in estimating divergence times (equation 4). The second criteria was also divided into three: the log-likelihood of the sequence data given the tree (tree likelihood), the log-likelihood of the tree under the coalescent (coalescent likelihood), and the overall model fit, which is the sum of the tree and coalescent likelihood.

How methods are ranked

The methods were compared by aggregating the results from all test cases. Let us take the root height as an example. For each test case, an error value is computed for each method by taking the absolute difference between the summery and true tree heights. Next, the methods are ranked by error using dense ranking (the 1-2-2-3 rule). Finally, the mean rank of each method is computed by averaging its rank over all 2000 tests.

This scoring procedure was repeated (bootstrapped) 4000 times. In each repeat 2000 test cases are sampled (with replacement) from the pool of 2000 test cases, and a mean score computed for each method. Method A was deemed better than B only if A’s mean ranking was greater than B’s in 90% (3600) of the bootstraps. The method gets a final score of 0 (best) if no other method is better, and a score of R+1 if there is a better method of score R.

The same process is repeated, using not the rankings of errors but the normalized error values. The normalization takes the errors of each case and transforms them to have a mean of 0 and a variance of 1. This ranks the methods by the magnitude of the error they make compared to other methods.

This may seem overly complex but making a fair comparison requires extra care. The methods and error measures are correlated in both obvious and subtle ways. Multiple criteria allows for a more nuanced comparison. Ideally, the particular mix of methods should not matter: adding a duplicate (or a very close variant) of one method should not penalize the ranking of lesser methods. Using dense ranking should minimize those effects. Strong correlations exist between the test settings (Ne and n) and the magnitude of errors, so aggregating results from the 20 groups requires some care. Rankings based on comparison alone are insensitive to those correlations, and the normalization of errors makes aggregation possible without going through the complex exercise of modeling the relations between settings. Another reason for using two rankings is that method A may be slightly better than B in (say) 60% of the cases, yet its errors in the other 40% are large. The difference between the two ranks would alert us to such situations.

Finally, any number of test cases, 2000 included, is small when considering the size of tree space. Bootstrapping provides some confidence that the results are stable and not due to random noise.


Table 1 lists the rankings of 22 methods for building summary trees. The table lists the comparison and error magnitude ranks for each of the 7 error measures: root height, clades missed and called (equations 5 and 6), ages and divergence times errors (equation 1 and 4), model fit, tree likelihood and coalescent likelihood. See Additional files 1 and 2 for the complete table and detailed per method rank graphs. Table 2 provides condensed rankings for the 22 methods together with performance statistics for each method obtained by averaging over the 2000 summary trees produced by each method.

Table 1. Rankings of methods for building a summary tree

Additional file 1. Supplementary material. Information about lesser performing methods which are mentioned only briefly in the main text.

Format: PDF Size: 935KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional file 2. Posterior summary rank graphs. Method rank graphs for each error measure.

Format: GZ Size: 8.9MB Download fileOpen Data

Table 2. Condensed rankings for methods in Table1with additional performance numbers


Clearly no method in Table 1 is “best”, but several interesting trends and patterns can be identified. The agreement of ranking by comparison and magnitude is excellent, suggesting a similar distribution of errors for all methods. The table shows 22 of the 55 methods examined; most of the reduction comes from removing methods using CCD and HPF to select a topology, as MCC/TCB were significantly better for almost all combinations of methods and error criteria. This is slightly surprising, especially since we expected CCD, which assigns a proper probability to every tree topology, to fare better than heuristics such as TCB or MCC. The on-line supplement compares the four selection methods in more detail.

As expected there is a strong correlation between model fit and tree/coalescent likelihood (r=0.89 and r=0.98), but in addition the tree and coalescent likelihood are strongly correlated as well (r=0.85). Basically, methods generating trees with a good model fit tend to do well on both counts. The only exception is TP(avg) with a good tree likelihood but bad coalescent likelihood. Also, low clade age errors and low divergence errors go together (r=0.79), again with TP(avg) as the exception. Slightly unexpected at first sight is the negative correlation (r=−0.88) between clades missed and clades called. Either a method plays it safe by calling only definite clades, and tends to miss a lot (CONS), or calls everything and makes more mistakes (TP).

The table shows a second unexpected result: strong negative correlation between clade age errors and model fit (r=−0.94). Since model fit is highly correlated with branch lengths (r=0.87), no method provides good clade ages and good branch lengths/model fit. Methods optimizing branches, such as RBS, generate trees with good fit but worse ages, and methods optimizing ages exhibit the opposite. This negative correlation exists between all measures of age and fit. It is quite interesting that the two variants of the TP end up at different ends: medians give better model fit while means gives lower divergence errors.

Another performance split can be observed between pairs employing the same method for setting branch lengths but using MCC and TCB for selecting the topology. The MCC variant has better model fit, while the TCB fares better with clade calls and misses.

While Table 1 makes it easy to compare pairs of methods, it is quite hard to interpret as a whole. Table 2 complements it by aggregating some performance ranking and adding a few per-method statistics. The first statistic is the mean number of zero length branches in the summery tree, which effectively create polytomies. The methods in the table are divided into three groups: those who never create polytomies, those with occasional polytomies (up to 5%), and those with a high number (20% or more). The number of polytomies is strongly correlated with missed and called clades: methods which “resolve” conflict in the posterior by not committing and creating zero length branches miss more true clades but make less mistakes, and so have a high clade calls. Somewhat surprisingly there is no connection between zero length branches and model fit. We suspected that short branches were the main cause for low model fit, since they create non-coalescent like trees, however we see that RBS methods manage to have high model fit and around 24% polytomies. The other three statistics are the mean model fit percentile, clade age errors per clade as a percent of tree height, and the percent of missed clades from the total number of non-trivial clades. Those numbers can help in deciding how a difference in ranking translates to performance: for example, TP(avg) is seven ranks higher than TP(med) in clade time errors, but this amounts only to a difference of 0.13%, about 1/7 of the total range. On the other hand there are a seven ranks between TP(avg) and CAT in model fit, but here the difference is very large - from 45% to 93%.

Figure 1 illustrates visually how conflict in the posterior affects the summery trees generated by four methods. The posterior trees are from a preliminary analysis of the rps16 intron of Quercus, part of a niche evolution study (Xu et al., in prep). We use this example because the weak phylogenetic signal makes the differences stand out. HS sets to zero all branches with low support, effectively creating polytomies where competing topologies exist. CAT takes the chosen topology as the truth and treats the conflicting information as “noise” to eliminate. TP is somewhere in between, and RBS creates very short branches, because a long branch for a clade with low support is penalized when the tree is matched with the many posterior trees missing the clade. Clearly RBS goes somewhat astray here, but one should keep in mind that some branches produced by other methods are unreliable too. Large discrepancy between summary trees indicates a large amount of uncertainty in the posterior, and in those cases no single tree is a good representative of the full posterior.

thumbnailFigure 1. Four Summary Trees. Four summary trees generated from the same data set drawn over a DensiTree. A DensiTree draws all trees in a set of trees using transparancy so that in places where the trees in the tree set agrees there is dark colouring, while in places where there is a lot of variation there is light colouring. The bars in the tree from RBS shows the 95% credible intervals for all clades.


Properly analysing the test cases proved to be as challenging as the research itself. The number of possible methods for constructing summary trees, coupled with the number of ways of assessing the results can be overwhelming. In addition, the domain of trees is vast and the evaluation and construction methods are not independent. For example, the distance between the summary and true tree seems the most natural error measure. However, we have four ways of measuring distance and four related methods, each searching for the minimal distance tree using that distance. Not unexpectedly, each distance score finds the tree produced by its counterpart to be closer to the true tree than the trees generated using other scores. Interconnections such as these show the importance of using multiple error criteria when comparing methods. The space of tree sets is complex, and each measure sheds light on different aspects of that space. Both “Clades Missing” and “Clades Called” measure topological distance via success in detecting clades, both seem reasonable and valid, yet one is the reverse of the other. Having only one of them would give a biased view. Simultaneously examining many methods – while complicating the comparison process – can reveal general performance trends.

By examining results from a large simulation study we found there is no clear “winner”. Having low clade age errors and good branch lengths in a tree seems fundamentally exclusive. Methods setting clade ages from posterior ages tend to have lower age errors while methods matching the branch lengths produce trees with a better fit to the model.

Therefore, it makes sense to consider the purpose of the summary tree when choosing a method. If divergence times matters most, use either HS or CAT. If only topology matters, use the consensus (CONS) or TP. In both cases the decision between the two alternatives depend on whether you are conservative and prefer unresolved clades (polytomies) in areas of conflict, or whether you wish a “the best guess” at a fully resolved tree. Use RBS to get a tree with good model fit and therefore closer to a Maximum Likelihood tree.

TP(med) provides a good compromise: good model fit and low missed clades, with middle of the pack ages/divergence errors (but still better than RBS). All of these are better than the MCC as implemented in TreeAnnotator, which is middle of the pack in all measures except for doing worse on clades called and well on root height.

While the simulations show a few surprising results, we were most surprised by the performance of the “theory based” methods. We set out to replace heuristics with methods based upon firmer theoretical consideration, and strongly expected that RAS, a tree metric which takes into account both ages and branch lengths, will outperform the alternatives. Likewise, we expected the CCD to fare better than other methods for selecting a topology. However, heuristics seem to do better when measured against the main objective - recovering the true tree.

We think the different types of summaries are all valuable when the posterior trees are in conflict. Together with the full posterior as drawn by DensiTree, they provide different insights into the information contained in the posterior. We suggest that researchers generating a summary tree for annotation or publication use one of the newer methods since all of them outperform the existing consensus methods and BEAST’s own TreeAnnotator.

While we focused on obtaining a single point estimate from posterior MCMC samples, we would like to emphasize that researchers should treat single point estimates as end points, and use the full posterior whenever possible, especially for secondary analyses. In addition, one should look at several methods for extracting a point estimate when dealing with the complex space of phylogenetic trees.


A Taxa partitions formal definition

Formally, For taxa ordering L = ( x 1 , x 2 , ) and clade c, let I = ( i 1 , i 2 , ) be the set of ordered indices of c in L, that is c = ( x i 1 , x i 2 , ) and i1<i2<…<i|c|. The span of the clade is the range of consecutive integers covering I,

sp ( c , L ) = ( i 1 , i 1 + 1 , i 1 + 2 , , i | c | ) .

Now, c is compatible with L at position k if

compat ( c , L , k ) | c | = | sp ( c , L ) | and k sp ( c , L ) and sp ( S 1 ( c ) , L ) sp ( S 2 ( c ) , L ) =

where S1,2(c) are the left and right sons of c.

The contribution for the k’th split comes from all trees containing a compatible clade at this point,

V k = h ( C ) : ( L , , h ) T and C and compat ( c , L , k )

The ages are computed by taking the median (or mean) of Vk. The tree is reconstructed by picking the maximal age as the root, and recursively building the sub-trees to the left and right of the split.

B Implementation: Minimum distance tree

For a tree set (draws from the posterior) and a target topology our objective is to find a tree with topology which minimizes the total distance to under a tree distance metric dT (Equation 13). We use a generic multivariate optimizer to do part of the heavy lifting, but transforming the problem into a suitable form is far from trivial. While the details vary slightly for each distance metric, we found the following four steps essential:

1. Represent the tree as a point in m .

2. Pre-process posterior trees to speed up the evaluation of the total distance to all posterior trees.

3. Analytically compute the derivative.

4. Find a good initial starting point.

Tree parametrization

The tree T = ( L , , h ) is represented as a vector of real numbers z = ( h r , α 2 , , α m ) m , where m = | | | L | is the number of internal clades in the tree. hr is the height of T, and α 2 , α 3 , , α m are m−1 values, one per internal clade, equal to the ratio of the clade age to the age of its parent. To retrieve a clade age from z, multiply the root height by the α for all the clade ancestors. That is, h r k α k , where k ranges over the clade ancestors. By traversing the tree in pre-order (clade before its descendants) all ages can be extracted from z using just m−1 multiplications, and an additional 2(m−1) subtractions would extract all branch lengths. Each component in z has a simple bound independent of other components; 0≤hr< and 0≤αi≤1. This makes the tree a suitable optimization target for a method such as L-BGFS-B, a quasi newton algorithm for minimizing a multivariate function with simple bounds [18].

Pre-processing of posterior trees

The search for the minimum distance tree involves many evaluations of the target function, the mean distance d ( T , T ) . This evaluation is sped up by transforming the expression, which is a sum on trees, into a sum over clades. The details vary somewhat, depending on the distance metric dT. Here we elaborate for the rooted branch score case (Equation 8), and the interested reader should consult the code for details of the other metrics.

For the tree T = ( L , , h ) the total distance is expanded as follows,

d ( T , T ) = t i T RBS ( T , t i ) = t i T x i | b i ( x ) b ( x ) | = t i T x i | b i ( x ) b ( x ) | + x i b ( x ) + x i b i ( x ) = t i T x i | b i ( x ) b ( x ) | + t i T x i b ( x ) + t i T x i b i ( x ) = t i T x i | b i ( x ) b ( x ) | + x b ( x ) t i T x i + t i T x i b i ( x ) .

The terms in parentheses do not depend on T and can be precomputed, so the last two terms take O ( | | ) operations to evaluate. The first term appears to require O ( | | | T | ) but we can cut this down to O ( | | log ( | T | ) ) .

t i T x i | b i ( x ) b ( x ) | = x t i T x i | b i ( x ) b ( x ) | = x t i T x i b ( x ) > b i ( x ) b ( x ) b i ( x ) + t i T x i b ( x ) b i ( x ) b ( x ) + b i ( x ) = x t i T x i b ( x ) > b i ( x ) 1 t i T x i b ( x ) b i ( x ) 1 b ( x ) t i T x i b ( x ) > b i ( x ) b i ( x ) + t i T x i b ( x ) b i ( x ) b i ( x )

The reason for this complicated looking expression is that the last two terms in parentheses can be pre-computed, and the first is simply the number of branches in the posterior greater than b(x) minus the number of branches smaller than it. After we pre-sort the branches from the posterior for each clade, this number can be found by a binary search, taking at most O ( log ( | T | ) ) since there can be at most ( | T | ) matched branches, one for each tree in the posterior.

Analytical derivative

The search is significantly faster when a derivative can be computed analytically, since estimating a derivative requires at least m evaluations (the number of dimensions). While the details are tedious the calculations themselves are simple, since the target function is composed in a series of multiplications and additions/subtractions, so the derivative is easy to compute using the chain rule at each stage. Again the interested reader should consult the code for the exact details in each case.

Search initialization

We found that a good starting point can be vital, as under some settings the number of multiple local minima can be large. While the procedure to obtain the initial tree seems natural and obvious in hindsight, several other obvious looking approaches did not perform well at all.

The initial tree is obtained by first examining each branch independently. Each branch has its own optimal length, based on the matching branches in the posterior and the distance metric. This optimal value is computed for each branch, but since branch lengths are not independent, the next step builds a tree from those optimal values. The build assigns an age to each clade, proceeding in post-order, that is assigning an age to all descendants of a clade before assigning the clade age. The age of the clade is obtained by averaging the expected age from the direct descendants, which is the sum of their own assigned age and their optimal branch length.


AVG: Summary tree method which sets internal nodes heights using average of posterior heights; ca(): Common ancestor; CAE: Clades ages error. Equation (1); CAT: Common ancestor tree. Equation (14); CC: Clade credibility; CCD: Conditional clade probability distribution; CCE: Clades called error. Equation (6); CLL: Coalescent likelihood; CME: Clades missed error. Equation (5); DVE: Divergence times error. Equation (2); HS: Heights score. Equation (10); HSO: Heights only summary tree method; HPF: Highest posterior frequency; MCC: Maximal clade credibility; MED: Summary tree method which sets internal nodes heights using medians of posterior heights; RAS: Rooted agreement score. Equation (11); RBS: Rooted branch score. Equation (8); RH: Root height; SRBS: Squared rooted branch score. Equation (9); TCB: Total clade branch; TP: Taxa partition tree.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JH developed the methods and wrote the biopy implementation. RB Implemented the DensiTree integration. Both authors contributed to the manuscript. Both authors read and approved the final manuscript.


Many thanks to Alexei Drummond, David Bryant and Mike Steel for their useful suggestions and comments. JH and RB were funded by AJD’s Rutherford discovery fellowship from the Royal Society of NZ. JH partially funded by AWC.


  1. Cranston K, Rannala B: Summarizing a posterior distribution of trees using agreement subtrees.

    Syst Biol 2007, 56(4):578-590. PubMed Abstract | Publisher Full Text OpenURL

  2. Rambaut A: FigTree.


    [ webcite]

  3. Bouckaert RR: DensiTree: making sense of sets of phylogenetic trees.

    Bioinformatics 2010, 26(10):1372-1373. PubMed Abstract | Publisher Full Text OpenURL

  4. Bryant D: A classification of consensus methods for phylogenetics.

    DIMACS Series Discrete Math Theor Comput Sci 2003, 61:163-184. OpenURL

  5. Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees.

    BMC Evol Biol 2007, 7:214. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  6. Larget B: The estimation of tree posterior probabilities using conditional clade probability distributions.

    Syst Biol 2013, 62(4):501-511. PubMed Abstract | Publisher Full Text OpenURL

  7. Sukumaran J, Holder M: DendroPy: a Python library for phylogenetic computing.

    Bioinformatics 2010, 26(12):1569-1571. PubMed Abstract | Publisher Full Text OpenURL

  8. Huggins P, Li W, Haws D, Friedrich T, Liu J, Yoshida R: Bayes estimators for phylogenetic reconstruction.

    Syst Biol 2011, 60(4):528-540. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Heled J: Small collection of phylogenetics functions.


    [ webcite]

  10. Robinson D, Foulds L: Comparison of phylogenetic trees.

    Math Biosci 1981, 53:131-147. Publisher Full Text OpenURL

  11. Billera L, Holmes S, Vogtmann K: Geometry of the space of phylogenetic trees.

    Adv Appl Math 2001, 27(4):733-767. Publisher Full Text OpenURL

  12. Owen M, Provan J: A fast algorithm for computing geodesic distances in tree space.

    IEEE/ACM Trans Comput Biol Bioinformatics (TCBB) 2011, 8:2-13. OpenURL

  13. Mau B, Newton M, Larget B: Bayesian phylogenetic inference via Markov Chain Monte Carlo methods.

    Biometrics 1999, 55:1-12. PubMed Abstract | Publisher Full Text OpenURL

  14. Fernau H, Kaufmann M, Poths M: Comparing trees via crossing minimization. In FSTTCS 2005: Foundations Software Technology Theoretical Computer Science. Heidelberg: Springer; 2005:457-469. OpenURL

  15. Kingman J: The coalescent.

    Stochastic Process Appl 1982, 13(3):235-248. Publisher Full Text OpenURL

  16. Jukes T, Cantor C: Evolution of protein molecules.

    Mamm Protein Metab 1969, 3:21-132. OpenURL

  17. Remco B, Heled J, Kuehnert D, Vaughan T, Wu CH, Xie D, Suchard M, Rambaut A, Drummond A: BEAST 2: A software platform for Bayesian evolutionary analysis.


  18. Zhu C, Byrd R, Lu P, Nocedal J: Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization.

    ACM Trans Math Softw (TOMS) 1997, 23(4):550-560. Publisher Full Text OpenURL