Abstract
Background
Most analysis programs for inferring molecular phylogenies are difficult to use, in particular for researchers with little programming experience.
Results
TREEFINDER is an easytouse integrative platformindependent analysis environment for molecular phylogenetics. In this paper the main features of TREEFINDER (version of April 2004) are described. TREEFINDER is written in ANSI C and Java and implements powerful statistical approaches for inferring gene tree and related analyzes. In addition, it provides a userfriendly graphical interface and a phylogenetic programming language.
Conclusions
TREEFINDER is a versatile framework for analyzing phylogenetic data across different platforms that is suited both for exploratory as well as advanced studies.
Background
Computational inference of molecular phylogenies has a wide spectrum of applications in the analysis of DNA sequences, ranging from systematic biology to population genetics and comparative genomics [1].
As a result, a large body of theoretical methodology has developed [2], along with numerous specialist software packages. However, often the most advanced of these computer programs typically provide only a very Spartan user interface and hence are too difficult to use without additional training, especially for novices in phylogeny. One notable exception is the popular commercially distributed PAUP* software [3] that implements both powerful probabilistic methods for modeling and inferring gene trees and at the same time offers a friendly graphical user interface (GUI). Unfortunately, this GUI is currently available only on the Macintosh platform.
On the other hand, a more experienced user will quickly outgrow the limits of a graphical user interface. Consequently, to facilitate complex sequence analysis corresponding scripting languages have been developed. For example, in PAUP* all elements of its GUI can also be invoked on the command line. However, for the rapid deployment of specialized phylogenetic analysis tools one still needs the additional flexibility of a programming rather than scripting language.
Therefore, in an integrative generalpurpose phylogenetic analysis environment ideally several complementary objectives are taken into account:
• platform independence and modular design,
• an easytouse GUI which masks the complexity of tree inference from the nonexpert user,
• a phylogenetic computer language that allows both scripting of all GUI functions as well as generic programming, and
• availability of powerful tree inference and related analysis approaches, and means for simulation of data and trees.
The development of the TREEFINDER software is an attempt to address these issues to provide a unified powerful framework for phylogenetic analysis for both occasional and experienced users across different platforms.
Implementation
General design
The TREEFINDER software has a modular design. It consists of a graphical frontend (written in Java) and computational kernel (written in ANSI C). Both communicate in the specialpurpose language TL ("TREEFINDER's language"). The frontend translates mouse clicks and keyboard hits into TL commands that are sent to the kernel. The kernel evaluates these commands and sends the results back to the window interface, where they are displayed.
TREEFINDER has grown into a fairly large project. The current version as of this writing (April 2004) consists of approximately 30,000 lines of C code, 9,000 lines of Java and 2,500 lines of TL. TREEFINDER is portable to any operating system where an ANSI C compiler and a Java virtual machine is available.
TREEFINDER components and language
The Java frontend provides a tree and postscript viewer, a text editor, a graphical user interface for common tasks in phylogenetic analysis, and a command line terminal to enter TL commands (see Figures 1,2,3,4). The graphical user interface makes the use of TREEFINDER very intuitive. Data files and reconstruction parameters can be chosen interactively, and the tree viewer also offers basic tree rearrangement functionality.
Figure 1. A typical screenshot of an analysis with TREEFINDER: a tree viewing window showing an inferred maximumlikelihood tree.
Figure 2. TREEFINDER' builtin text editor allows to view and manipulated sequence data sets.
Figure 3. The graphical user interface for tree reconstruction showing the corresponding methods and options available in TREEFINDER.
Figure 4. TREEFINDER's command line terminal, ready for entering TL commands.
The kernel performs the actual analysis. For an overview of the currently implemented phylogenetic procedures and algorithms see section "Results" below. In addition to these specialized tasks, the kernel implements many other general mathematical and statistical functions, including pdf, cdf, and quantile functions of common statistical distributions and most functions from the publicdomain CEPHES library [4]. It is also possible to run the kernel without the graphical frontend. In this case TL commands may simply be typed in at the operating system shell prompt or may be read from a text file.
The computer language TL developed for use with TREEFINDER is a functional language, similar to LISP and Mathematica. This makes TL ideally suited to the processing of lists and trees. The language is interpreted and provides all the common programming elements like flow control, variables, operator notations and a huge set of basic routines. It supports stack orientated programming as well as rulebased data transformations. The TL language is extensively documented in the TREEFINDER software package.
Note that the clear separation of kernel and frontend and the use of the TL language for communication between the two components greatly facilities the writing of thirdparty plugins to extend the capabilities of the kernel. Correspondingly, a substantial part of the computational library of TREEFINDER is itself written in TL.
Results
Available phylogenetic methods
The phylogenetic analysis procedures currently implemented in TREEFINDER focus mainly on probabilistic and statistical approaches. One important reason for this choice is that these methods consistently provide the most powerful and accurate inferences [2]. The following is a nonexhaustive list of features present in the TREEFINDER version of April 2004.
Substitution models
The program offers the standard set of evolutionary models for nucleotide substitution (GTR and submodels [57]) and two different models of rate heterogeneity among sites (Gamma [8], different rates for each codon position). All model parameters including the rate heterogeneity and base frequencies can be estimated from the data.
Estimation of branch lengths and absolute rates
Branch lengths estimates are obtained by the method of maximumlikelihood [9], with the optional application of a clock constraint. In addition, absolute evolutionary rates can be assigned to each edge and a corresponding calibration of the tree nodes in time can be obtained using the method of nonparametric rate smoothing [10]. TREEFINDER also allows to plot the resulting rato and chronograms (see Figure 5).
Figure 5. An example for an inferred chronogram computed by TREEFINDER using nonparametric rate smoothing [10].
Tree topology search
TREEFINDER employs a genetic algorithm for the search of the optimal mostlikely tree topology [11]. Genetic algorithms are global search procedures, and are, unlike local rearrangement methods, less prone to get trapped in local maxima. The specific details of the genetic algorithm implemented in TREEFINDER are explained in the manual. In addition to exploring the whole tree space, the tree search may also be constrained by a guide tree (a tree whose multifurcations represent the remaining degree of freedom in the search space). To speed up the evaluation of the likelihood for different tree topologies a technique called "likelihood hashing" is employed.
Tree manipulation
The tree viewer build into TREEFINDER allows to open trees in various data formats and provides access to basic rearrangement capabilities, such as rerooting, midpoint rooting, placing outgroups, and collapsing of small edges. The transformations may also be applied simultaneously to whole sets of trees. More advanced tree manipulation is available through the TL programming interface. For instance, comparison of trees and other expressions is straightforward in TL due to its functional nature.
Simulation of data and trees
A method to simulate sequence data along a specified gene tree and model of nucleotide substitution is implemented [12]. In addition, a simple procedure to generate random bifurcated trees is available. Simulation of trees and data is useful to assess the accuracy of phylogenetic methods and to generate empirical distributions for test statistics [13,14].
Rate profiles
The computation of rate and mutation profiles [15] along sequence alignments is implemented in TREEFINDER. These plots are useful in the inference of functional regions and in investigating the selective forces acting on DNA sequences.
Other features
The confidence of inferred evolutionary relationships may be assessed by bootstrap analysis [16]. Corresponding routines for computing consensus trees [17] with the option to count and output the distinct topologies in the set of samples are available. Further TL procedures include checks for compositional bias in the data and functions for reading, writing and manipulating sequence alignments.
User interface and TL language
A major design goal of TREEFINDER is to provide both a simple and easytouse graphical user interface as well as a corresponding powerful programming language for phylogenetic analysis.
Figures 1,2,3,4 give an impression of the graphical user interface for typical standard tasks: tree viewing (Figure 1), editing alignments (Figure 2), reconstructing trees (Figure 3), and the TL shell to enter commands (Figure 4). Examples for the inference of a chronogram [10] and the plot of a rate profile [15] are shown in Figures 5 and 6. Most GUI interface elements will be selfexplanatory, but a detailed description of each button etc. is available in the TREEFINDER manual.
Figure 6. An example for a rate profile, i.e. relative evolutionary rates along an alignment as computed by TREEFINDER.
All analyzes can be done also on the command line or script level. For instance, to reconstruct a phylogenetic tree from a sequence alignment contained in some "file" one enters the following simple command:
ReconstructPhylogeny ["file",
SubstitutionModel>"HKY"].
In this case, "SubstitutionModel" is a named optional argument (that takes a default value and hence need not to be specified) whereas "file" is a required positional argument.
A noteworthy detail about TL is that a programmer can choose at any time between functional and stackorientated calls. The stackorientated form of the above command is:
"file",
SubstitutionModel>"HKY",
ReconstructPhylogeny
A phylogenetic tree in TL is an object of the form
{{"a",{"b","c"}},"d",{"e","f"}}
and with edge lengths
(*) {{"a":0.15,{"b":0.1,"c":0.1}
:0.001}:0.1, "d":0.2,
{"e":0.1,"f":0.1}:0.001}.
The structure of the nested list represent the topology, as in the familiar NEWICK bracket notation for phylogenetic trees. However, the above examples are not merely a data format to store tree information in files. They are genuine TL expressions, and as such may be subjected to further transformations. For instance, if you have a "treelist" variable with value
{tree1, tree2, tree3, tree4}
one easily computes the corresponding strict consensus tree via the simple command
ConsensusTree [treelist, Strict > True].
One of TL's most advanced features is rulebased data transformation. Given a variable 'tree' containing a tree with edge lengths, it requires less than one line of code to collapse the short edges into multifurcations:
tree={{__b}:_e>__b/?_e<0.01}
will transform the above example (*) into
{{"a":0.15,"b":0.1,"c":0.1}:0.1,
"d":0.2,"e":0.1,"f":0.1}.
Furthermore, TL internally applies rulebased transformations also to algebraic expressions, e.g.,
a+a+b
will be simplified to
b+2*a.
This pattern matching property is heavily relied on in TREEFINDER's internal TL kernel routines, but it may also prove useful for writing extension.
These are only some illustrative examples of TL programming. Further code examples can be found in the TL documentation and in the 'Kernel' directory of the TREEFINDER distribution.
Relative speed and accuracy
The algorithms implemented in TREEFINDER have been tested and crosscompared with those of other likelihoodbased phylogeny softwares, such as PAUP* [3], PHYLIP [18], fastDNAml [19], and TREEPUZZLE [20].
Specifically, we conducted a simulation study to investigate the computation time, the accuracy (=probability to recover the exact true tree topology), and the expected dissimilarity of true and inferred tree [21] for several widely used programs and TREEFINDER. Varying the number of taxa between 4–60 we generated a set of random trees. Subsequently, sequence data of length 1,000 nucleotides were artificially evolved along these trees. The resulting alignments were then used to infer the original trees by the using TREEFINDER (version of April 2004), PAUP* (version 4b10), TREEPUZZLE (version 5.1) and fastDNAml (version 1.2.2). For each tree size (4–60 leafs) this procedure was repeated 100 times to assess the average relative performance of each program.
The results from our simulations are summarized in Figure 7. Essentially, it turns out that the accuracy of TREEFINDER with regard to correctly inferring tree topologies and estimating branch lengths is comparable to that of other likelihood programs such as PAUP* and fastDNAML that are often used as "gold standards". However, in terms of speed the TREEFINDER program drastically outperformed all investigated programs, in particular for large trees containing more than 30 sequences.
Figure 7. Comparison of computation times, accuracy and performance of three widelyused likelihood tree reconstruction programs with TREEFINDER.
Future work
The TREEFINDER environment, while being an versatile analysis framework already in the present version, has many options for further enhancement. This includes, most importantly, substitution models for amino acids, e.g., the classic Dayhoff model [22] or the more recent WAG model [23]. Other desirable directions for extension are the implementation of modern population genetic methods, such as tools for coalescent simulation and estimation of demographic parameters [24]. These, and other procedures, are scheduled for inclusions in future releases of TREEFINDER.
Conclusions
The TREEFINDER project is an ongoing effort in providing an easytouse and yet powerful platformindependent analysis environment for molecular phylogenetics. Currently, it offers a solid set of welltested statistical methods to infer gene trees and for related analyzes, with its functional programming interface providing an extra amount of flexibility. This article presents the current status of TREEFINDER as of version April 2004. With a release cycle of 3–4 updates per year, as in 2003, we expect that TREEFINDER will soon further mature and provide especially for beginners a convenient and quick route to phylogenetic analysis.
Availability and requirements
The TREEFINDER analysis environment can be downloaded free of charge from the web page http://www.treefinder.de webcite. Packages are currently provided for the Windows, MacOS X, SUN Solaris, and Intel Linux platforms. TREEFINDER requires the prior installation of a Java virtual machine (preferably version 1.4 or later). The TREEFINDER software is provided "as is" with no guarantee or warranty of any kind. It may be distributed noncommercially, provided that neither its manual or any other components of the software are changed (for details refer to the web page or the manual).
Authors contributions
G.J. initiated the TREEFINDER project in 1997. He is the principal designer and programmer of TREEFINDER. A.v.H. and K.S. contributed ideas and suggestions for improving TREEFINDER and supervised the project while G.J. worked in their respective research groups. G.J. and K.S. prepared the manuscript.
Acknowledgements
Development of TREEFINDER was in part supported by grants from the Deutsche Forschungsgemeinschaft and the Max Planck Gesellschaft to A.v.H. and K.S. We also thank Wolfgang Ludwig (TU Munich) for many valuable suggestions that greatly helped the TREEFINDER project.
References

Page RDM, Holmes EC: Molecular Evolution: A Phylogenetic Approach. Oxford: Blackwell Science; 1998.

Felsenstein J: Inferring Phylogenies. Sunderland, MA: Sinauer Associates; 2004.

Swofford DL: PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4. Sunderland MA: Sinauer Associates; 1998.

Moshier SL: Methods and Programs for Mathematical Functions. Upper Saddle River, New Jersey: PrenticeHall; 1989.

Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates.
J Mol Evol 1984, 20:8693. PubMed Abstract

Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.
Mol Biol Evol 1993, 10:512526. PubMed Abstract  Publisher Full Text

Hasegawa M, Kishino H, Yano K: Dating of the humanape splitting by a molecular clock of mitochondrial DNA.
J Mol Evol 1985, 22:160174. PubMed Abstract

Yang Z: Maximumlikelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods.
J Mol Evol 1994, 39:306314. PubMed Abstract

Felsenstein J: Evolutionary trees from DNA sequences: A maximumlikelihood approach.
J Mol Evol 1981, 17:36876. PubMed Abstract

Sanderson MJ: A nonparametric approach to estimating divergence times in the absence of rate constancy.

Lewis PO: A genetic algorithm for maximumlikelihood inference using nucleotide sequence data.
Mol Biol Evol 1998, 15:277283. PubMed Abstract  Publisher Full Text

Rambaut A, Grassly NC: SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.

Goldman N: Statistical tests of models of DNA substitution.
J Mol Evol 1993, 36:182198. PubMed Abstract

Goldman N: Simple diagnostic statistical tests of models for DNA substitution.
J Mol Evol 1993, 37:650661. PubMed Abstract

Simon AL, Stone EA, Sidow A: Inference of functional regions in proteins by quantifications of evolutionary constraints.
Proc Natl Acad Sci USA 2002, 99:29122917. PubMed Abstract  Publisher Full Text

Felsenstein J: Confidence limits on phylogenies: An approach using the bootstrap.

Wilkinson M: Majorityrule reduced consensus trees and their use in bootstrapping.
Mol Biol Evol 1996, 13:437444. PubMed Abstract  Publisher Full Text

Felsenstein J: PHYLIP: Phylogenetic Inference Package, version 3.5c. Seattle. Department of Genetics, University of Washington; 1993.

Olsen GJ, Natsuda H, Hagstrom R, Overbeek R: FastDNAML: A Tool for construction of phylogenetic trees of DNA sequences using maximumlikelihood.

Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREEPUZZLE: maximumlikelihood phylogenetic analysis using quartets and parallel computing.
Bioinformatics 2002, 18:502504. PubMed Abstract  Publisher Full Text

Robinson DF, Foulds LR: Comparison of phylogenetic trees.
Mat Biosci 1981, 53:131147. Publisher Full Text

Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In In Atlas of Protein Sequences and Structure. Volume 5. Edited by Dayhoff MO. Silver Springs: Natl. Biomed. Res. Found.; 1978::345352.

Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach.
Mol Biol Evol 2001, 18:691699. PubMed Abstract  Publisher Full Text

Nordborg M: Coalescent Theory. In In Handbook of Statistical Genetics. Edited by Balding D, Bishop M, Cannings C. Chichester: Wiley; 2001:179212.