We analyze phylogenetic tree building methods from molecular sequences (PTMS). These are methods which base their construction solely on sequences, coding DNA or amino acids.
Our first result is a statistically significant evaluation of 176 PTMSs done by comparing trees derived from 193138 orthologous groups of proteins using a new measure of quality between trees. This new measure, called the Intra measure, is very consistent between different groups of species and strong in the sense that it separates the methods with high confidence.
The second result is the comparison of the trees against trees derived from accepted taxonomies, the Taxon measure. We consider the NCBI taxonomic classification and their derived topologies as the most accepted biological consensus on phylogenies, which are also available in electronic form. The correlation between the two measures is remarkably high, which supports both measures simultaneously.
The big surprise of the evaluation is that the maximum likelihood methods do not score well, minimal evolution distance methods over MSA-induced alignments score consistently better. This comparison also allows us to rank different components of the tree building methods, like MSAs, substitution matrices, ML tree builders, distance methods, etc. It is also clear that there is a difference between Metazoa and the rest, which points out to evolution leaving different molecular traces. We also think that these measures of quality of trees will motivate the design of new PTMSs as it is now easier to evaluate them with certainty.
Keywords:Phylogenetic trees; Tree building methods; Maximum likelihood; Distance measures; Multiple sequence alignments; Substitution matrices; Molecular sequences
Phylogenetic tree reconstruction from molecular sequences (PTMS) was first suggested by Emile Zuckerkandl and Linus Pauling  and is now one of the major tools in the arsenal of bioinformatics. By PTMS we will understand methods which build a phylogenetic tree based solely on sequences, either coding DNA or amino acids. Of the many people who have contributed to this field, J. Felsenstein deserves special mention for his many contributions summarized in his book .
Computing phylogenies is ubiquitous, and not only of academic interest, but also quite practical: selecting model organisms , tracing disease , finding vectors , finding suitable defenses to new viruses , maximizing diversity for species conservation,  tracing ancestry and population movements [8,9] and many other problems are solved with the aid of good phylogenetic trees.
The state of testing of PTMS is far from satisfactory. This is obvious when we see the discrepancies between the results from bioinformatics and the accepted taxonomies produced by biologists, and the high confidence measures that bioinformatics has tried to attach to their results [10-12]. In short, in our experience, the distrust that biologists may have on PTMS is justifiable.
Most results in the literature supporting PTMSs use:(i) extensive simulations, (ii)measures of quality, (iii) small scale comparisons of some specific trees, (iv) some intuition. These techniques are useful, but limited. Specifically, simulations are excellent to discover errors and to find the variability that we may expect from the methods. Yet simulations usually rely on a model of evolution (e.g. Markovian evolution). It is then expected that a method which uses the same model will perform best. Measures of quality include bootstrapping, branch support confidence and indices on trees (like least squares error in distance trees or likelihood in maximum likelihood (ML) trees). These measures also rely on some statistical model which is essentially an approximation of reality. Bootstrapping values have suffered from over-confidence and/or misinterpreted and are sensitive to model violations [13-16]. Furthermore these techniques are directed towards assessing a particular tree rather than assessing the methods. Small scale comparisons are valuable but usually lack the sample size to make the results statistically strong. We consider any evidence which is in numbers less than 100 to be “anecdotal”. Any study where a subset of cases is selected is a candidate to suffer from the bias arising from an author trying to show the best examples for his/her method. Finally, intuitions are very valuable, but cannot stand scientific scrutiny. We refer as intuitions, decisions which are not based on strict optimality criteria. E.g. character weights in traditional parsimony methods; using global or local alignments; various methods for MSA computation; various measures of distances, etc.
The main problem is that there is no “gold-standard” against which methods can be evaluated. Hopefully this paper will provide two such standards.
Computing phylogenetic trees consumes millions of hours in computers around the world. Because some of these computations are so expensive and not reliable, biologists are tempted to use faster, lower quality, methods. This evaluation (which itself consumed hundreds of thousands of hours) will help bioinformaticians extract the most of their computations. In particular, as we show, some of the best PTMS are remarkably fast to compute.
We measure the quality of the PTMS in two ways, by their average difference on trees which have followed the same evolution and by their average distance to taxonomic trees. This allows us to find the best methods, and by averaging in different ways, the best components of the methods.
There is no single method that is best in all circumstances. Some of the classes of species show a preference for a particular method. This should not come as a surprise, different organisms may leave different molecular imprints of their evolution.
We now introduce the two measures on PTMSs.
The Intra measure
For a given PTMS and several orthologous groups (OGs) we can construct a tree for every OG. The trees should all follow the same evolutionary history, hence the trees should all be compatible (Figure 1, shaded yellow). The average distance between trees built from different OGs is thus a measure of quality of the method (the smaller the distance, the better the method). We call this measure the Intra measure. Since the PTMS does not get any information about the species of the input sequences, the only way for it to produce a smaller distance between trees is by extracting information from the sequences. In this sense, the best algorithm is the algorithm which extracts the most relevant information from the sequences to derive the phylogeny; which is exactly what we want. In mathematical terms the Intra measure of a PTMS M is the expected value:
where gi and gj are two different orthologous groups. The distance d(.,.) is the Robinson-Foulds distance  between two trees built with the same PTMS over different OGs. It is computed only over the species appearing in both OGs (Figure 2). We estimate this expected value from all the available pairs of OGs. The measure will be incorrect for the cases of lateral gene transfers (LGT), where sequences do not follow the same evolution. LGT events will be few and since all methods will be affected we do not expect a bias from them.
Figure 1. The process of computation of the Intra measure (shaded yellow, vertical) and the Taxon measure (shaded blue, horizontal).
Figure 2. An example of the evolution of several species recovered by two proteins and the basis for the Intra measure.
The Taxon measure
This measures how far the computed tree is from the true taxonomic tree. A smaller distance, averaged over a large number of OGs, means a better method. For a given PTMS and several orthologous groups (OGs) we compute the distance between the tree built on each OG and the true taxonomic tree (or its approximation from NCBI, Figure 1, shaded blue). We call this average distance the Taxon measure. The trees derived from the taxonomy represent the consensus and summary of many scientific papers, databases and experts and could be described as the “state of the art”. Errors in the taxonomy should affect all methods equally and will be like random noise.(Biases derived from the use of these methods for building the Taxonomy are discussed in the Caveats (iv) section.) In mathematical terms the Taxon measure of a PTMS M is the expected value:
where d(.,.) is the RF distance between two trees, g is an orthologous group, M(g) is the tree produced by M applied to the sequences in g and Tg is the taxonomic tree for the species in the group g. We estimate the expected value by the average over all the orthologous groups available to us. Notice, that while the taxonomic tree is a single tree, we will be sampling tens of thousands of different subsets of this single tree (and many hundreds of totally independent subsets). See Methods, Table 1, for full results. In [18,19] a similar idea is used, that of comparing the trees against a small, indisputable, topology.
Table 1. Taxon and Intra measure, output 1
To achieve statistical significance we consider complete genomes and apply the methods to all the OGs possible (with at least 4 species) according to the OMA database [20,21]. This gives us very large sample sizes and an unbiased sample, as almost nothing is excluded (see methods for details).
To describe the PTMSs unambiguously we need to use a descriptive name for each one. The convention that we use describes the steps which are used to build the tree. For example, stands for the name of the procedure which starts by making a multiple sequence alignment (MSA) using ClustalW, then derives the distances from the pairwise alignments induced by the MSA and finally builds a tree from these distances using the BioNJ algorithm. A method is then a sequence of components which start from the molecular sequences and end with a phylogenetic tree. The components of the tree building methods used here are listed in Table 2.
Table 2. Classification of component methods of the PTMS (see methods for full details)
Most of the possible compatible combinations were tried. Notice that the total number of methods can grow very quickly, for this study 176 PTMSs were tested.
Our main results are: the introduction of the Intra and Taxon measures to evaluate PTMSs; the excellent correlation between them; the top rated PTMSs for Metazoa and non-Metazoa; the results on best components, i.e. best MSA methods, best tree building methods and best pairwise alignment methods.
Figure 3 shows a plot of the PTMSs on their Intra vs Taxon measures. It can be seen that the two measures are extremely well correlated. Table 3 shows the same correlation in numerical form and for each species class. Here a “class” means a convenient group of related species, explained in more detail in the methods section.
Figure 3. Intra(M) vs Taxon(M) measures for all methods.
Table 3. Intra/Taxon correlation coefficients over all PTMS
It should be noted that, for a given class or set of classes, the numerical values of the Intra measure for all the PTMSs are comparable (lower values mean better methods). So are the values of the Taxon measure. But, for a given class, the Intra and Taxon measures are not numerically comparable, as they are taken over different sets, in one case over all the pairs of OGs which intersect on 4 or more leaves in the second case over all the OGs. This is why we compare the orderings (usually by computing Pearson’s correlation coefficient) of the PTMSs by each measure, but not the corresponding numerical values.
Tables 4 and 5 show the best (Taxon) scoring PTMS. The first table shows the top 3 methods for Metazoa and for non-Metazoa. The results group well in two sets. Metazoa favors codon-based methods whereas the rest favor induced distance methods. In terms of sample sizes this division is quite even, the number of OGs are in a 1:2 relation but since Metazoa has larger groups and longer sequences, the total amino acids involved are close to 1:1 (Table 6).
The symbol “≫” stands for a method which is better than another with statistical significance better than 1 in a million (p-value < 1e-6). The symbol “>” stands for a p-value < 0.05 and the symbol “≥” means its p-value > 0.05.
To justify the grouping of the classes we have computed the correlations between the classes. Table 7 shows Pearson’s correlation coefficients of the Intra measures for all the classes against each other.
Table 4. The top PTMS according to the Taxon measure
Table 5. The top PTMS according to the Taxon measure
Table 6. Sizes of the classes
Table 7. Correlation of the Intra measure of all PTMS between classes
The average correlation for non-Metazoa is 0.9867 in a tight range, from 0.9696 to 0.9964. Notice also that OtherEukaryota share the same preferences for the methods as Archaea and Bacteria away from Metazoa. All the correlations with Metazoa are much lower. The natural grouping of the classes is to have one group with Metazoa and another group with the rest. The very high correlations of the different non-Metazoa classes are the main argument supporting the quality of the Taxon measure. The measure is strong enough to replicate the rankings on several groups. This is a form of bootstrapping, as the results are replicated from independent different samples.
Averaging over the component methods
Tables 89101112 and 13 show results over component methods for the Taxon measure. We are working under the assumption that better trees derived from variants of the components (e.g. MSAs) mean better components (e.g. better MSAs). While this may be controversial, it is very difficult to argue the opposite, see . These results are aggregations of various classes and various methods. In all cases care is taken to include the same companion methods for each comparison. The numerical value Δ shows the difference of the Taxon measures (and its 95% confidence margin) between the methods. It measures the average difference of RF distances or wrong splits, e.g. Δ=1 means that on the average one method makes one additional mistake per tree. n indicates the number of OGs which have been used to measure this difference (in some cases the OGs end up used more than once, for example for different MSAs when comparing ML methods). See Methods, Table 14.
Table 8. Ranking and differences among tree builders based on MSAs
Table 9. Ranking and differences among pairwise alignments
Table 10. Ranking and differences among pairwise alignments derived from MSAs
Table 11. Ranking and differences among tree building methods based on distances
Table 12. Ranking and differences among empirical substitution matrices
Table 13. Ranking and differences among MSAs
Table 14. Output of the comparison of Mafft against Probcons over Metazoa
PhyML is the best tree builder using MSAs (Table 8). The results are consistent accross classes except for a significant worsening of Parsimony and Gap for Metazoa.
Global alignments  dominate the pairwise alignments methods (Table 9). The most significant difference between Metazoa and non-Metazoa is that CodonPAM is propelled to the front by a significant margin in Metazoa. It should be noted that the CodonPAM mutation matrix is an empirical mutation matrix based on data from vertebrates . The genomes included in Metazoa have diverged more recently than for other classes, like Archaea, which also explains the better performance of the codon-based methods. From an information-theoretic point of view, codons are over an alphabet of size 61 as opposed to 20 for amino acids, so they must carry more information. Regardless of the reason, the advantage of codon-based methods is an order of magnitude larger than the differences between the other methods. Hence codon-based methods appear unavoidable for Metazoa. Table 10 confirms the same difference at the level of MSA-induced alignments.
The distance methods, Table 11, see LST changing from last position in non-Metazoa to first position in Metazoa. In this case the absolute differences are relatively small.
Table 12 shows the comparisons of empirical substitution matrices. The differences between the best and the worst matrix are statistically significant but very minor.
Table 13 shows the results for MSAs. PartialOrder, which is an algorithm designed to deal with alternative splicings, works better for Metazoa. ClustalW, from a middle ranking in non-Metazoa, drops to a clear last for Metazoa. The rest of the rankings remain quite consistent for all species.
The most important message coming out of these results is that the best methods are minimal evolution (distance) methods over pairwise alignments induced by MSAs. A method like Mafft_InducDist_BioME is 2-3 orders of magnitude faster than ML methods and outperforms them all by a good margin.
It may appear surprising that the best method for non-Metazoa starts by using Mafft which is not the best MSA (Table 13). In general, the best PTMS may not include the best components, and vice-versa, the best individual components may not give the best PTMS. Components may combine/exploit their abilities/weaknesses. For example, an MSA method which does a very good job with amino acids but a mediocre job with gaps, may compose very well with ML methods but poorly with Gap trees. We have to remember that the analysis of components, Tables 8, 9, 10, 11, 12 and 13 is done over an average of many situations.
The statistical significance of the difference between methods is one aspect, the magnitude of the difference is also important. The testing was done over such large samples that often minor differences are still statistically significant. We consider that a difference of less than Δ=0.01 (that is in 100 trees, on the average, we get one less error) is without practical significance. A Δ=0.05 difference, on the other hand, means that one method will produce one better branch every 20 trees, which can be considered significant.
Mafft_InducDist_BioME is the top method for non-Metazoa under the Taxon measure and is ahead of the top ML method, Prank_PhyML, by Δ=0.048 correct branches per tree. The number of incorrect branches per tree of each method is 1.425 and 1.473 respectively. (For Metazoa the best methods are PartialOrder_CodonDist_BioNJ and Prank_RAxMLG with errors 2.281 and 2.555 respectively.) This shows that there is a long way to perfection. Some of this distance (the 1.425 or 2.281 in these cases) is due to the inherent randomness of the molecular mutations left by evolution, some of it may be due to imperfect PTMSs.
Caveats, what can go wrong?
Here we describe some problems that may affect the power/correctness of the PTMS evaluation (for dependencies on absolute/relative distances, number of leaves and sequence length see Methods).
· The OGs should follow the same evolutionary history. This is normally the case, except when we have lateral gene transfers (LGT) or OGs which do not follow Fitch’s definition of orthology  precisely. For the purpose of testing the methods, it is much better to skip dubious OGs. The OMA orthologous database fits best our needs [25,26], as it sacrifices recall in favor of precision.
· The Intra test measures the ability of recovering a phylogenetic signal from sequences. Other reasons for mutation of the sequences may leave their trace in the conclusions. For example, it is known that the environmental temperature affects the GC content of the sequences due to DNA stability . Consequently, we will expect a bias at the codon level that will tend to group together organisms that live in a high-temperature environment.
· The methods should produce trees with complete structure, i.e. no multifurcations, all nodes must be binary. A method which produces a tree with multifurcations will have an advantage as it will normally make fewer mistakes. In the extreme, a star tree is always correct.
· Since PTMSs have been in use for many years, the preferred methods of the community may show an undeserved good performance under the Taxon measure (but not under the Intra measure!). This is not unreasonable, since for bacteria many classical phylogenetic methods do not apply (e.g. bacterial paleontology has very few results), and taxonomies may have been constructed with some of these methods. A method which has been the favourite of the taxonomists will be displaced to the left of the main line in Figure 3 (it reduces the Taxon measure and leaves the Intra unchanged). We can see that parsimony, RaxML and PhyML show a small shift to the left and hence it is possible that these methods have biased the building of the Taxonomies. This shift is noticeable but quite small, so we can conclude that this is not a major bias.
· Finally, the Intra measure, by being a consistency measure, may be insensitive to systematic mistakes of the tree building. This would be something that affects the Intra measure but not the Taxon measure. Stephane Guindon suggested that long branch attraction (LBA) could mislead the Intra measure for methods that suffer from it, by systematically computing one of the incorrect trees. To study this properly we generated a new class called LBAExamples which is composed of a quartet ((A,C),(B,D)), where the branches leading to the leaves C and D are much longer than the other branches. This quartet is sometimes called the “Felsenstein example” and is used to demonstrate how some methods, like parsimony, systematically will reconstruct the wrong tree (and hence the name LBA). See Figure 4.We built 500 such quartets for random values of α(the length of the shortest branches) and for 5 different sets of leaves C and D with the values f=5,7,5,10,12.5 and 15 (the ratio of the long branches to the short branches). That is 100 examples of each quartet. The results are available in their entirety in the same website repository as all the other results. The most significant results of the study of these quartets are:
Figure 4. An extreme tree with 4 leaves illustrating the problems of Long Branch Attraction (LBA).
There is a clear division of methods under the Taxon measure (in this case the taxonomic tree is the correct topology) and the Intra measure. All the methods using Parsimony, Gap, SynPAM and PrankGuide suffer from LBA (and score a high value, Taxon ≥ 0.16 and Intra ≥ 0.252). All the other methods (which do not suffer from LBA) score much lower, Taxon ≤ 0.036 and Intra ≤ 0.0697. Remember that there are only 3 possible quartets and that an error in a quartet gives a distance of 1, hence for quartets the values of the Taxon measure coincide with the number of incorrect trees. The gap that separates the non-LBA from LBA methods is large in absolute and in relative values. So we can conclude that LBA is successfully detected by both measures.
Methods based on k-mer statistics (not reported here, but also evaluated in our computations) fare much worse than all the other methods in general. These are methods which count the number of, for example, tri-mers, and use as distances a statistical test (like chi-squared) on the tri-mer frequencies. For example, a method based on DNA tetra-mers scores 0.952 for the Taxon measure (it gets 95.2% of the quartets wrong!) but scores 0.0917 in the Intra measure. This is quite extreme, (the method is very poor in every context) but supports the observation that the Intra measure is a consistency measure and if the method systematically fails and there is only one way of failing, then the consistency is good. In terms of the plot of Figure 3, these cases will be points displaced to the right of the main line. This extreme case reinforces our recommendation of using both measures to conclude the performance of methods.
This side study showed additional surprises. The guide tree produced by Prank, usually quite good, but unfortunately it suffers severely from LBA (Taxon = 0.532). We were also unaware that the SynPAM methods, which are maximum likelihood methods, also suffer from LBA.
The above caveats indicate that the problems are relatively few and seldom apply to both measures. Consequently a method which does well under both measures is a very strong candidate.
We show, through a comparison of methods against trees involving tens of millions of data points, which are the most effective PTMSs. This uncovers a big surprise as one of the favorite methods among the community, the ML methods, score poorly. Methods based on MSA induced pairwise alignments and minimal evolution not only produce better trees, but are 100 to 1000 times faster to compute. This should revolutionize this niche of bioinformatics.
We also show that a new measure of quality, the Intra measure, is highly correlated with the Taxon measure (closeness to taxonomic trees) and it does not suffer from the biases of the practice. These new measures are likely to be extremely helpful in the development of new and better algorithms.
We cannot show all the computations and results in this publication because of their size (about 57Gb). We have developed a web site which allows the exploration of all the data and all the results to their most minute detail. We intend to maintain this website for as long as possible, and to upgrade it periodically both with new genomes and with new methods. It contains very useful information in our view. This can be found in:
The study was done over complete genomes for three reasons: species coverage is quite ample, 755 complete genomes were used, we can obtain a large number of very reliable OGs and since all OGs from entire genomes are used, no selection bias is possible. A complete description of the classes can be found in the OMA1000 database which is accessible at:
To do the analysis, the genomes were grouped in the classes shown in Table 6 (in this work we will call any of these groups a “class”). The column “OGs kept” shows the number of groups which had 4 or more acceptable sequences.
The study includes all the publicly available genomes as of Nov/2010, the release of OMA1000 . For proteobacteria and firmicutes, which are relatively overrepresented and many species sequenced multiple times (e.g., there are 26 genomes of different strains of E.coli), only 265 genomes of 452 were chosen for proteobacteria (89 of 177 for firmicutes) as follows: For each pair of genomes an average evolutionary distance was computed. Iteratively, one of the members of the closest pair of genomes was discarded. The discarded one was the one with lower “quality index” (a simple ad-hoc measure of quality of complete genomes). In this way we retained the “best”, most diverse, 265 proteobacteria and the most diverse 89 firmicutes. All the major versions of model organisms ended up in these classes. As a control, we also computed the same trees over all the genomes of firmicutes (177). The correlation coefficient of the Taxon measure between the full class of all firmicutes and the class with 89 genomes is 0.994058. Knowing this value we are confident that the results are not affected by removing very similar (or repeated) genomes.
Comparing within classes is better than grouping the classes together for the following reasons:
· The classes are more uniform and may reveal biases (as they do) specific to the classes.
· The missing relations - between classes - are usually so obvious that almost no method will get them wrong. It is the fine grain differences that matter.
· The computation time would be out of reach for some methods.
· The problems of some well documented LGTs, like proteins of mitochondrial origin, are avoided.
The selection of the OMA database of OGs was done because OMA is particularly careful about removing paralogous sequences at the expense of sometimes splitting groups (precision at the expense of recall). A split OG is a minor loss of data, of little consequence given our sample size, whereas the inclusion of paralogous sequences breaks the basic assumption for the correctness of the Taxon and Intra measures. The main assumption that links the Intra measure with the quality of the methods is that any pair of groups represents the same evolution path. If one group contains an orthologous pair and the corresponding pair in another group is paralogous, these will correspond to different evolutionary histories and the comparison is wrong.
Only the OGs with 4 or more sequences are used (2 or 3 sequences will never show different unrooted topologies). We also removed all but one copy of identical sequences. A method which is given a few identical sequences will most likely place them all in a single subtree. The shape of this subtree will be unrelated to the phylogeny of the sequences (there is no information available to make a decision). Since this adds noise to the results (not necessarily bias), we remove all identical sequences but one from the OGs. Additionally we remove sequences for which more than 5% of their amino acids or codons are unknown (“X”) as this is a sign of poor quality of the sequence. Both policies together remove about 3.5% of the sequences.
Bayesian methods for tree building have not been included in this study because they do not follow the PTMS definition. In principle a bayesian tree building method produces a probability distribution over all trees given the corresponding priors. If the priors are ignored and only the tree with highest probability is selected, then this is ML, not bayesian. Approaches which build consensus trees from several of the most probable trees produce multifurcating trees which contain less information and hence are not comparable to fully determined trees. Any prior which contains information about the tree which is not extracted from the sequences themselves will violate our assumptions for PTMS.
Tree building methods
We have computed 176 trees per OG, that is a total of 176 × 193138 trees or about 34 million trees. The tree building methods are a combination of several components, for example Mafft_PhyML represents the method composed of building an MSA with Mafft and then using the PhyML program. The component methods are only a subset of the existing methods. The ones chosen are the ones that we perceive as the most popular and effective in the community plus the ones which have been written locally. We welcome suggestions of promising new components to test.
Multiple sequence alignment methods from amino acid sequences:
· ClustalW - a widely used MSA program based on a guide tree computed from pairwise alignments, version 2.0.10 
· ClustalO - a recent improvement of ClustalW, 
· PartialOrder - A method based on partial order graphs, currently being developed at the CBRG in the ETH Zurich, designed to accommodate alternative splicings.
· Poa - a progressive multiple sequence alignment based on a graph representation where each new sequence is aligned by pairwise dynamic programming 
· Prank - a phylogeny-aware gap placement MSA, version 100802 . The guide tree used by Prank has its own merits and is used as one of the possible trees, under the name PrankGuide.
· Probcons - a probabilistic consistency-based MSA, version 1.12 
· Prograph - a method of progressive graph alignment, similar to Prank, currently under development at the CBRG in the ETH Zurich.
Methods which produce a tree from an MSA:
· Gap - produce a tree by parsimony, replacing all amino acids by a single symbol . In this way the only information left is gap or no-gap.
· Parsimony - equal character cost, counting gaps as a special character, as implemented in Darwin 
Pairwise alignment methods which compute a distance and variance matrix from amino acid or coding-DNA sequences. Every sequence is aligned to every other sequence. In all cases, after the alignments are done, the distances between pairs of sequences are estimated by ML:
· GlobalCodonPAM, LocalCodonPAM - global/local alignments using a codon substitution matrix (61×61) 
· LogDelGlobal, LogDelLocal - global/local alignments using GCB and a special deletion cost function based on the observed zipfian distribution of gap lengths .
· GlobalSynPAM, LocalSynPAM - global/local alignments using a codon substitution matrix (61×61) which ignores all mutations except the synonymous ones .
Pairwise alignment methods which compute a distance and variance matrix from the sequences in an MSA:
· CodonDist - estimate the ML CodonPAM distance from pairwise alignments induced by an MSA. The MSA is over amino acids, and the corresponding codons from the protein are used to replace the amino acids. 
· InducDist - estimate the ML distance from pairwise alignments induced by an MSA with the GCB rate matrices.
Distance methods which produce a tree from a distance/variance matrix:
· BioNJ - an improved version of the NJ algorithm, .
· FastME - build a tree using the minimum evolution principle, version 2.07, .
· BioME - a version of FastME with iterative improvements
· NJ - the neighbor joining method, .
We chose the NCBI [53,54] taxonomies to build the taxonomic trees, the basis of our Taxon measure. The NCBI database is detailed and extensive and it covers all the species that were included in OMA1000. The ITIS database, another well known taxonomic database, is not as complete, in particular for bacteria, where many of the entries we need are absent.
The computations were carried out in our own cluster of Linux machines, about 300 cores. These were done using Darwin  as a host system. Additionally we used the Brutus cluster, a central service of ETH. We estimate that we used about 646,000 hours of individual CPUs. Table 15 shows the top most time consuming tasks.
Table 15. Top uses of cpu time
Of the classes, most of them took time proportional to the number of OGs. The exception being Metazoa, which has bigger OGs and longer sequences. As a consequence the lion’s share of computation was taken by Metazoa.
Correlations as the main test
As mentioned above, the Intra and Taxon measures are not directly comparable (they are expected values over very different populations). Any of the measures is not comparable accross classes of species either. This is shown to be the case with Metazoa which behaves differently to the others classes. The distances are also radically different for different classes. On the other hand we are always measuring an average RF distance, hence there should be a linear relation between the different measures for different classes when comparing the different PTMSs. In other words, a suitable comparison could be through a linear transformation or a linear regression of one into the other. For the regression, the coefficients are not important, the quality of the fit is the important aspect. This is exactly what is captured by Pearson’s correlation coefficient, and hence this is the main tool we use to compare measures of different PTMS accross different populations.
Distances between trees
We use the Robinson-Foulds (RF)  distance to measure distances between trees. The RF distance basically counts how many internal branches of the unrooted trees do not have a corresponding branch which divides the leaves in the same two sets. For trees with n leaves, the RF distance may be as high as n−3.
When the taxonomic tree is not completely determined (that is, some nodes have more than two descendants), we have to correct the computation of the RF distance. This is relatively straightforward to fix. The maximum distance in these cases is less than n−3.
Absolute vs relative distances vs 0-1 distances
There are arguments to use the absolute RF distance and arguments to use relative RF distances (the absolute distance divided by n−3). Fortunately, the results are remarkably consistent for the absolute and relative RF distance. Table 16 shows Pearson’s and Spearman’s correlation coefficients between the absolute and relative measures, per class for all PTMSs. Clearly the rankings are not affected by this choice.
Table 16. Correlations between the absolute and relative distances (all leaves)
There are also arguments that the RF distance may not reflect evolutionary distance. That is to say, that sometimes a small evolutionary change produces a tree which has a large RF distance to the original and other times a large evolutionary change produces a tree which has a small RF distance. This has been recently discussed in . To take this concern in consideration we also computed the 0-1 distances, a Hamming-style distance, 0 if the trees are equal, 1 otherwise. The 0-1 distance may not be as sensitive as other distances, since it collapses all wrong trees into a single case, in particular it will give very little information about large trees when there is almost always some error. The correlation coefficients between the 0-1 distance and the RF distance for the Taxon measure are 0.9905 (non Metazoa) and 0.9004 (Metazoa). The correlation is excellent for non Metazoa, and the rankings for the Taxon or 0-1 distances have relatively insignificant differences. The correlation for Metazoa is good but lower and some methods, notably ML methods, move ahead. If we take as an example in Metazoa, we find that its 0-1 distance is 0.5417. Excluding the trees with 10 or less leaves, it is 0.9614 and excluding the trees with 20 or less leaves it is 0.9824. Even medium size trees are mostly wrong. Clearly the 0-1 measure loses too much information for large trees and reflects the quality of small trees alone. This has motivated us to study the impact of the distance functions used for the Taxon and Intra measures in depth, which will be reported in a future work. The 0-1 distances are shown as a separate column in the full Tables 1 and 14. To make safer conclusions about comparisons of methods, we should use the Taxon, Intra and 0-1 measures.
Large trees vs small trees
It may be argued that small trees are too simple and bigger trees are the important ones. To analyze this effect we divide the OGs into two groups, the ones with 15 or fewer leaves and the ones with more than 15 leaves. We then compute the correlation coefficient for the measures on all PTMS for these two groups. Table 17 shows the results for each of the classes. All the correlations are high, and those for the non Metazoa are remarkably high. From these correlations we can conclude that the number of leaves used for the quality analysis does not influence the results.
Table 17. Correlation of the Taxon measure of all PTMS for OGs with 15 or less leaves and the rest
Long sequences vs short sequences
In a similar way it may be argued that groups with long sequences behave differently than groups with short sequences. To analyze this effect we again divide the OGs into two groups, the ones with average sequence length less or equal to the median and those with average above the median. We then compute the correlation coefficient for the measures on all PTMS for these two groups. Table 18 shows the results for each of the classes. As above, the correlations are high and even higher for non Metazoa. From these correlations we can conclude that the average length of the sequences used for the quality analysis does not influence the results. In these last two comparisons, where we select the groups based on properties of the OG (like number of sequences and average length), we have to use the Taxon measure which is based on distances of a single group. The Intra measure is based on pairs of groups, and hence not suitable for these splits.
Table 18. Correlation of the Taxon measure of all PTMS for OG above and below the average sequence length median
Variance reduction techniques
To compare two building methods in the Taxon measure, we can use the average distances to the taxonomic tree over all the OGs. These averages will have a relatively large variance and the difference may not be statistically significant. To refine the comparison of two particular methods, we study the difference of distances of the two methods for each OG. The expected value of the difference coincides with the difference of the averages, but the confidence margins are much better because the variance of the difference is normally smaller. This is a well known variance reduction technique .
The author declares that he has not competing interests.
The author would like to thank and acknowledge the support of: the ETH through grant ETHIRA 0-20722-11 which supported part of this research; the ETH-Brutus cluster, which was used to do about 20% of the computation, Dr. Christophe Dessimoz and Dr. Manuel Gil who provided careful and encouraging comments to the manuscript and Stefan Zoller who wrote the web server to explore the data and results, created all the figures and helped with the typesetting. Prof. Stephane Guindon pointed out an important area of concern with the Intra measure which motivated the study of the LBA effects on PTMSs. The author is also grateful for the comments of all referees. This project was started as a framework to evaluate the PartialOrder MSA algorithm along one of the ideas of the PhD thesis of Manuel Gil.
Stat Sci 2003, 18(2):256-267. Publisher Full Text
Syst Biol 2001, 50(4):525-539. PubMed Abstract
Math Biosci 1981, 53(1-2):131-147. Publisher Full Text
Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G: OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements. In RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. Edited by McLysath A, Huson DH. Springer-Verlag; 2005::61-72.
BMC Bioinf 2008, 9:518. BioMed Central Full Text
Sievers F, Wilm A, Dineen D, Gibson T, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, et al.: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.
Gonnet GH, Benner SA: Probabilistic ancestral sequences and multiple alignments. In Algorithm Theory - SWAT ’96, 5th Scandinavian Workshop on Algorithm Theory, Reykjavík, Iceland, July 3-5, 1996, Proceedings, Volume 1097 of Lecture Notes in Computer Science. Edited by Karlsson RG, Lingas A. Springer, Reykjavik, Iceland; 1996::380-391.
Cannarozzi GM, Schneider A, Gonnet GH: Probabilistic ancestral sequences based on the Markovian Model of Evolution – algorithms and applications. In Ancestral Sequence Reconstruction. Edited by Liberles DA. Oxford University Press, ; 2007.
Schneider A, Gonnet GH, Cannarozzi GM: Synonymous codon substitution matrix. In ICCS 2006: 6th International Conference Proceedings, Part II, Volume LNCS 3992 of Lecture Notes in Computer Science. Edited by Alexandrov VN, van Albada GD, Sloot PMA, Dongarra J. Springer-Verlag, ; 2006::630-637.
Lin Y, Rajan V, Moret B: A metric for phylogenetic trees based on matching. In Bioinformatics Research and Applications, Volume 6674 of Lecture Notes in Computer Science. Edited by Chen J, Wang J, Zelikovsky A. Springer Berlin/Heidelberg; 2011::197-208. Publisher Full Text
ACM Comput Surv 1992, 24(2):195-212. Publisher Full Text