Abstract
Background
The development, in the last decade, of stochastic heuristics implemented in robust application softwares has made large phylogeny inference a key step in most comparative studies involving molecular sequences. Still, the choice of a phylogeny inference software is often dictated by a combination of parameters not related to the raw performance of the implemented algorithm(s) but rather by practical issues such as ergonomics and/or the availability of specific functionalities.
Results
Here, we present MetaPIGA v2.0, a robust implementation of several stochastic heuristics for large phylogeny inference (under maximum likelihood), including a Simulated Annealing algorithm, a classical Genetic Algorithm, and the Metapopulation Genetic Algorithm (metaGA) together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. MetaPIGA v2.0 also implements the Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion for automated selection of substitution models that best fit the data. Heuristics and substitution models are highly customizable through manual batch files and command line processing. However, MetaPIGA v2.0 also offers an extensive graphical user interface for parameters setting, generating and running batch files, following run progress, and manipulating result trees. MetaPIGA v2.0 uses standard formats for data sets and trees, is platform independent, runs in 32 and 64bits systems, and takes advantage of multiprocessor and multicore computers.
Conclusions
The metaGA resolves the major problem inherent to classical Genetic Algorithms by maintaining high interpopulation variation even under strong intrapopulation selection. Implementation of the metaGA together with additional stochastic heuristics into a single software will allow rigorous optimization of each heuristic as well as a meaningful comparison of performances among these algorithms. MetaPIGA v2.0 gives access both to high customization for the phylogeneticist, as well as to an ergonomic interface and functionalities assisting the nonspecialist for sound inference of large phylogenetic trees using nucleotide sequences. MetaPIGA v2.0 and its extensive usermanual are freely available to academics at http://www.metapiga.org webcite.
Background
Phylogeny inference allows, among others, detecting orthology/paralogy relationships among gene family members (e.g., [1,2]), estimating divergence times and evolutionary rates (e.g., [35]), reconstructing ancestral sequences (e.g., [69]), identifying molecular characters constrained by purifying selection or prone to positive selection (e.g., [10]), uncovering hidden biodiversity (e.g., [11]), and mapping the evolution of morphological, physiological, epidemiological, biogeographical and even behavioral characters [12,13]. Molecular phylogeny inference is now a mature science, and an important part of the maturation process pertained to the realization (in the last 10 years) that the quest for the Holy Grail of absolute best tree should be abandoned for a much more meaningful goal: the inference of clades and trees robustness. Still, that objective remained intractable in practice because of (a) the NPhard nature of optimalitycriterionbased phylogeny inference (i.e., no algorithm can solve it in polynomial time; [14,15]) and (b) the computingtime requirements of using complex substitution models (and rate heterogeneity across sites) in the framework of what had been identified as the probable most robust optimality criterion: Maximum Likelihood (ML; [1618]). Today large phylogeny inference is incorporated, across biological disciplines, as an essential step in most comparative studies involving nucleotide or protein sequences. This has been made possible thanks to both theoretical and practical developments.
First, one key advance that made large phylogeny inference tractable is the implementation in this field of stochastic heuristics with interstep optimization, i.e., a family of approaches that existed for decades in physics and computer science and explore multidimensional solution spaces in a much more efficient manner than the older intrastep optimization hillclimbing methods. Indeed, in the latter, one prime parameter (typically, the topology of the tree) is modified and all other parameters are optimized before the new solution is evaluated whereas, in stochastic heuristics, all free parameters are optimized while the search proceeds. Interstep optimization methods include MCMC approximations of the Bayesian approach [19,20], stochastic simulated annealing [21], and genetic algorithms [2226]. The efficiency of stochastic heuristics is quite counterintuitive but can be explained by several factors: (a) poorer solutions are accepted with a nonnull probability (contrary to hillclimbing that strictly restricts moves toward better likelihood values) such that valleys in likelihood space can eventually be crossed; (b), parameters are not overoptimized (e.g., starting and intermediate trees are generally largely suboptimal, hence, optimizing model parameters on these topologies is a clear example of overfitting). We think that avoiding overoptimization at every topology evaluation generates a flatter likelihood space shape, such that valleys are more easily crossed and local optima more easily escaped. This suggestion however requires further investigation.
Second, several stochastic methods have been incorporated into robust application softwares. The importance of that point should not be underestimated. For example, the recent success of Bayesian methods is probably due as much to its incorporation into a robust and efficient software (MrBayes; [27]) as to the theoretical appeal of generating marginal posterior probabilities [19]. The software RaxML [28], enjoys welldeserved popularity because it is one of the fastest ML phylogeny inference programs available to date (despite that it does not incorporate stochastic methods) thanks to the implementation of approximations to rate heterogeneity across sites and of smart computer science tricks speeding up likelihood computation: optimized parallel code and 'Subtree Equality Vectors' (i.e., the extension of character compression to the subtree level). Similarly, highly efficient parallel code has recently been implemented for the evaluation of phylogenies on graphics processing units (GPUs), resulting in about 100fold speed increase over an optimized CPUbased computation [29]. This efficient use of new hardware, existing stochastic heuristics (in this case, an MCMC approach in a Bayesian framework), and smart code parallelization for efficient harnessing of the hundreds of GPU processing cores, allowed the authors to successfully use a 60state codon model on a dataset of 62 complete mitochondrial genomes [29].
The availability of multiple excellent softwares implementing different robust heuristics is clearly an asset for the end user: reliable results might be identified because they remain stable across softwares and methods. However, most users chose one single main software for their analyses, and this choice is sometimes dictated by availability of functionalities of importance but that do not pertain to the performances of the specific heuristic implemented (e.g., ability to perform batch analyses, availability of GTR nucleotide substitution model [30] or of rate heterogeneity [3133], possibility to partition data). Finally, given that the need to infer large trees is critical in multiple biological disciplines, the nonspecialist can be baffled by the large number of available heuristics, parameters, and softwares, such that the most userfriendly tools are often preferred even if more robust or more efficient (but less userfriendly) softwares are available. There is therefore a challenge to supply softwares that are easy to use for the nonspecialist, provide flexibility for the specialist, and allow fast and robust inference for both.
The Metapopulation Genetic Algorithm (MetaGA; [23]) is an evolutionary computation heuristic in which several populations of trees exchange topological information which is used to guide the GA operators for much faster convergence. Despite the fact that the metaGA had been implemented in a simple and unoptimized software (metaPIGA v1) together with simple nucleotide substitution models, an approximate rate heterogeneity method, and only a low number of functionalities, is has been shown as one of the most efficient heuristics under the ML criterion [23,34,35]. Furthermore, it has been suggested that multiple metaGA searches provide an estimate of the posterior probability distribution of possible trees [23] although this proposition clearly warrants much further investigation. Here, we present MetaPIGA2.0 the first phase of a robust implementation of the MetaGA (and other stochastic methods such as a classical Genetic Algorithm and Simulated Annealing) together with complex substitution models, rate heterogeneity, and high parameterization for the phylogeneticist, as well as an ergonomic interface and easytouse functionalities for the nonspecialist.
Implementation and Results
ML framework
Trees are estimated in MetaPIGA2.0 with the Maximum Likelihood criterion (ML) using any of 5 nucleotide substitution models ([2] and refs therein, [30]): Jukes Cantor (JC), Kimura's 2 parameters (K2P), HasegawaKishinoYano 1985 (HKY85), TamuraNei 1993 (TN93), and General Time Reversible (GTR). Analyses can be performed with rate heterogeneity among sites using a proportion of invariant sites (Pinv) [33] and/or a discrete Gamma distribution of rates (γdistr) [31,32]. All parameters of the model (transition/transversion ratio or components of the rate matrix, the shape parameter of the γdistr, and Pinv) can be set by the user or estimated from a Neighbor Joining (NJ) tree [36]. The same parameters plus branch lengths and amongpartition relative rates can experience intrastep optimization either periodically during the search and/or at the end of the search.
Datasets can be partitioned into character sets ("charsets") either using a graphical tool (see below) or by writing the corresponding commands in a batch file. In MetaPIGA2.0, we assume that all partitions evolve on the same topology (we therefore consider, like in the vast majority of phylogeny inference programs, that the analysis is performed on a nonrecombining piece of DNA, such that the phenomena of hybridization and incomplete lineage sorting are ignored), but all other parameters (base frequencies, substitution matrix rates, shape parameter of γdistr, and Pinv) are optimized separately for each partition. Amongpartition rate variation parameters are introduced in the likelihood equation as a factor that modifies branch length for the corresponding partition. Branch lengths are optimized as usual, but the relative rates of partitions are optimized separately (with the constraint that the weighted average of amongpartitions rates is 1; weighting is according to each partition size).
Tools shared among heuristics
Phylogeny estimation is an NPhard problem [14,15], with unknown search space topography. MetaPIGA2.0 implements four different heuristics for searching solution space (see below). A set of tools is shared by all these heuristics: the starting tree generators, the operators, and some of the stopping rules.
A Tree Generator is used to produce the starting tree(s) either as NJ tree(s) [36] or as random tree(s) (i.e., with random topology and random branch lengths) or as "Loose Neighbor Joining" (LNJ) tree(s), i.e., a pseudorandom topology (modified from [23]). For generating a LNJ tree, the user specifies a proportion value (p = [01]) and, at each step of the NJ algorithm, the two nodes to cluster, instead of corresponding to the smallest distance value, are randomly chosen from a list containing the smaller distances, where NTax is the number of taxa (sequences) in the dataset. Branch lengths are computed as in the NJ method [36]. In other words, the LNJ tree is a NJ tree with some topology randomization which amount is defined by the user. This approach is a particularly useful compromise between random starting topologies (p = 1) that require long runs of the heuristic for optimization, and a good but fixed topology (the NJ tree, i.e., p = 0) that is prone to generate solutions around a local optimum. The distance matrix used for building NJ or LNJ starting trees can be computed using any of the 5 substitution models (see above) and with or without Pinv and/or γdistr. Arbitrary starting trees can also be imported by the user.
At the core of all stochastic heuristics are the Operators, i.e., the topology and parameters' modifiers allowing the heuristic to explore solution space. In MetaPIGA2.0, we implemented 5 operators for perturbing tree topology (Nearest Neighbor Interchange, NNI; Subtree Pruning Regrafting, SPR; Tree Bisection Reconnection, TBR; Taxa Swap, TXS; Subtree Swap, STS; see [37] and [23] for details) and 6 operators for perturbing model parameters (branch lengths, internal branch lengths, rate matrix parameters, γdistr shape parameter, Pinv, and amongpartitions rate variation). These operators can be used in any combination, either at equal or userdefined frequencies. The user can choose for these frequencies to change dynamically during the search, i.e., MetaPIGA periodically evaluates the relative gains in likelihood produced by each operator and adjusts their frequencies proportionally. Minimum frequencies can be set such that operators that are inefficient early in the search remain available for increased use later in the search.
All stochastic heuristics require a stopping condition. In MetaPIGA2.0, the user can choose any combination of the following criteria: number of steps, elapsed time, likelihood stability, and convergence of branch support distribution (for replicated searches). When using the metaGA heuristic, one can additionally use a stopping condition, within each replicate, based on convergence of the populations of solutions (see below).
The heuristics
We implemented four heuristics in MetaPIGA2.0: a Simulated Annealing algorithm (SA; [21]), a classical Genetic Algorithm (GA; [22,24,25,38]) and the metapopulation Genetic Algorithm based on the Consensus Pruning principle (metaGA; [23]). As a reference, we also implemented a simple Hill Climbing (HC) algorithm that generates a new solution tree at each step (using available operators) and accepts it only if its likelihood is better than the current solution. HC algorithms are fast but tend to generate solutions trapped in local optima and are therefore highly dependent on the starting tree localization in tree space as well as on the (unknown) tree space topography.
The simulated annealing (SA)
The SA algorithm uses statistical mechanics principles to solve combinatorial optimization problems [39]; i.e., it mimics the process of minimal energy annealing in solids. SA starts with some initial state (the starting tree) and randomly perturbs that solution (using available tree operators). If the new state is better (lower energy, better likelihood), it is kept as the new current state; if the new state is worse (higher energy, worse likelihood), it is accepted as the current state with the Boltzmann Probability e^{ΔE/T}, where ΔE is the negative difference in energy (here, the difference of likelihood) between the two states, and T is the socalled 'temperature' of the system. If T is lowered slowly enough, the algorithm is guaranteed to find the optimal solution. The obvious asset of the algorithm is its ability to momentarily accept suboptimal solutions, allowing it to escape local optima whereas its obvious drawback is the difficulty to define the shape and speed of the "cooling schedule" (i.e., the rate of the decrease in T). Efficient schedules highly depend on the dataset. We implemented 14 highlyparametrized cooling schedules in MetaPIGA2.0, including one specifically developed for phylogeny inference [40]. The user can control all cooling schedule parameters: the starting temperature computation method, the maximum acceptance probability, the temperature decrease frequency, and the possibility of 'reheating'.
The Genetic Algorithm (GA)
The GA is an evolutionary computation approach that implements a set of operators mimicking processes of biological evolution such as mutation, recombination, selection, and reproduction (e.g., [41]). After an initial step of generating a population of trees, the individuals (specific trees and model parameters) within that population are (i) subjected to mutation (a stochastic alteration of topology, of branch lengths or of any model parameter) and/or recombination, and (ii) allowed to reproduce with a probability that is a function of their relative fitness value (here, their likelihood). Because selection preferentially retains changes that improve the likelihood, the mean score of the population improves across generations. However, because suboptimal solutions can survive in the population (with probabilities that depend on the selection scheme), the GA allows, in principle, escaping local optima. In MetaPIGA2.0, we implemented 5 alternative selection schemes ("Rank", "Tournament", "Replacement", "Improve", and "Keep the Best", see [23]) and one recombination scheme where each suboptimal individual has a probability (determined by the user) to recombine with a better individual. Recombination is performed by exchanging subtrees defined by one (if any) of the identical taxa partitions in the two parental trees (i.e., one internal branch that defines subtrees including the same taxa but with potentially different subtopologies). A recombination can be viewed as a large number of simultaneous topological mutations. Beside the selection scheme, the major parameter in the GA is the population size (set by the user).
The metapopulation Genetic Algorithm (metaGA)
This approach relies on the coexistence of P interacting populations [23] of I individuals each (P and I defined by the user): the populations are not fully independent as they cooperate in the search for optimal solutions. Within each population, a classical GA is performed: trees are subjected to evaluation, selection (5 alternative selection schemes are available as in the GA, see above), and mutation events. However, all topological operators are guided through interpopulation comparisons defined and controlled by 'Consensus Pruning' (CP; [23]): topological consensus among trees across populations defines the probability with which different portions of each tree are subjected to topological mutations. These comparisons allow the dynamic differentiation between internal branches that are likely correct (hence, that should be changed with nil or low probability) and those that are likely incorrect (hence, that should be modified with high probability). Although CP allows for the elaboration of many alternative interpopulation communication procedures [23], we implemented in MetaPIGA2.0 the two that we identified (data not shown) as the most useful: 'Strict CP' (internal branches shared by all trees across all populations cannot be affected by topological mutations; all other internal branches are unconstrained) and 'Stochastic CP' (topological mutations affecting a given branch are rejected with a probability proportional to the percentage of trees across all populations that agree on that branch).
As constraining entirely an internal branch from being affected by topological mutations necessarily increases the likelihood to be trapped in a local optimum, a tolerance parameter t (defined by the user) is implemented, allowing any internal branch to be affected with a probability t even if the branch is shared by all trees. The user of MetaPIGA2.0 has the choice between a 'blind' and a 'supervised' procedure for handling constrained partitions. In the former, a topological mutation that affects a constrained branch is simply aborted and the tree is left unchanged, whereas in the latter, topological operators exclusively target branches in a pool of acceptable (unconstrained) candidates.
The MetaGA allows for two, nonmutually exclusive, recombination flavors: 'intrapopulation recombination' where each suboptimal individual at each generation has a probability (instead of being mutated) to recombine with a better individual from that population (as in the GA above), and 'interpopulation hybridization' (Figure 1) where, at each generation, there is a probability (defined by the user) that all suboptimal individuals from one random population are recombined with one individual from another population; suboptimal individuals from other populations experience the normal mutation procedure.
Figure 1. The MetaPIGA2.0 heuristic setting window. The user can set the parameters of the chosen heuristic: here, for the metaGA, the consensus type, the selection scheme, the operator behavior, the number of cores/processors (here, 2 cores) onto which the populations are distributed, the number of populations and the number of individuals per population, the tolerance parameter (frequency with which internal branches are affected by mutational operators even if that branch is present in all trees across all populations), and the frequency of hybridization among populations. See text for details.
Comparing, across generations, the frequencies of internal branches shared among the P*I trees (irrespective of how the trees are assigned to populations) provides a means for assessing whether the populations converge towards a stable set of solutions, i.e., towards a consensus with stable branch frequencies. Hence, a stopping rule, not available to other heuristics, can be used under CP: the user can choose to stop the search when a series of mean relative error (MRE) values remains, across generations, below a threshold defined by the user. To increase independence among samples, consensus trees are sampled every n > 1 (i.e., nonsuccessive) generations. For example, given two consensus tree, T_{i }and T_{j}, corresponding to the consensuses among the P*I trees at generations 5000 and 5005, respectively, the MRE is computed as follows:
, where nPartition is the sum of taxa bipartitions observed in T_{i }and T_{j }(but identical partitions are counted once), and and are the consensus values of bipartition p in trees T_{i }and T_{j}, respectively. Note that if either both and are nil, or if the corresponding internal branch does not exist in either T_{i }or T_{j}. Internal branches that are absent from both T_{i }and T_{j }are not considered. If the MRE(gen5000, gen5005) is above the userdefined threshold (e.g., 5%), it is discarded and a new MRE is computed for the comparison of generations 5005 and 5010. On the other hand, if MRE(gen5000, gen5005) is below the threshold, a counter is incremented and a new MRE is computed for the comparison of generations 5000 with the next sample (here, corresponding to generation 5010). The user defines for how many samples the MRE must remain below the specified threshold before the search stops.
Replicates
For any heuristic chosen by the user in MetaPIGA2.0, the search can be repeated many times, generating a majorityrule consensus tree among the replicates. Note that when the metaGA is the selected heuristic, it has been suggested that the frequencies of clades in the amongreplicates consensus might approximate the corresponding posterior probabilities [23]. The user can either fix the number of replicates, or specify a range of minimum and maximum number of replicates and let MetaPIGA2.0 stop automatically, exploiting the MRE metric in a similar way as the consensus across populations in a single metaGA search (see above). Here, however, the MRE is computed using consensuses across replicates, i.e., T_{i }is the consensus among the final trees that have been obtained in replicates 1 to i. No additional replicate is produced when the MRE among N replicates (one can use consecutive replicates because they are independent) remains below a given threshold. As an example, if N is set to 10, and the first MRE below the userdefined threshold (e.g., 5%) involves replicates 1241 and 1242, the MRE is computed 9 additional times, i.e., between the reference consensus T_{1241 }and T_{j}, for j corresponding to replicates 1243, then 1244, then 1245, etc. The search stops if the interreplicates MRE remains below 5% for 10 consecutive replicates. On the other hand, the counter is reset to zero as soon as the MRE exceeds 5%, and the new reference tree for computing MRE is then set to T_{1current replicate}.
The intergenerations (= intrareplicate) MRE stopping rule can be used in combination with the interreplicate MRE stopping rule, letting MetaPIGA decide both when to stop each replicate and when to stop executing additional replicates (i.e., when to stop the entire analysis).
Language, formats, and interface
MetaPIGA2.0 is written in Java 1.6 such that the single code runs on 32 and 64bits platforms under MacOS X, Linux, and Windows. Computing and storing the likelihood of large trees requires large amount of RandomAccess Memory (RAM). Whereas 32bits systems can allocate a maximum of ~2Gb of memory to the Java Virtual Machine (JVM), 64bits systems are virtually limited only by the amount of memory installed on the computer (as the theoretical limit is about 18 billions gigabytes). MetaPIGA2.0 uses the Java MultiThreading technology to take advantage of multiprocessor and multicore computers, such that some tasks can be run in parallel. As replicates are independent, they are particularly prone to parallelization: any number of different cores/processors can be assigned to different replicates. In addition, the metaGA heuristic itself is well suited to parallel implementation because processes such as mutation, selection, and likelihood computation are unrelated to CP and are therefore independent across populations. Hence, different metaGA populations can be distributed to different cores/processors. Parallelization of metaGA populations can be combined with parallelization of replicates (e.g., 16 cores allow running simultaneously 4 metaGA replicates with 4 populations/replicate).
MetaPIGA2.0 uses standard formats: reading and writing datasets in Nexus format [42] and trees in Newick format http://evolution.genetics.washington.edu/phylip/newicktree.html webcite. All search settings can be saved in a metaPIGA block incorporated into the Nexus file, allowing easy management and command line runs. A Nexus file without a metaPIGA block will be correctly interpreted by MetaPIGA2.0 and will run with default parameters.
MetaPIGA2.0 can be run in command line but it also offers an extensive graphical user interface (GUI) for access to all search settings: defining and managing charsets; including/excluding taxa, characters, and charsets; managing dataset partitions; choosing and parametrizing heuristics (Figure 1); defining substitution models and their parameters (Figure 2); choosing starting tree options; controlling operators (Figure 3); defining stop criteria and replicates. All settings are associated with an interactive "mouseover" help system. MetaPIGA2.0 also implements three statistical methods (Figure 2) for selecting the substitution model that best fits the data ([43]; and refs therein): the Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion. The MetaPIGA2.0 GUI provides a detailed run window showing graphs specific to the chosen heuristic (e.g., for a metaGA search with replicates: current best likelihood progression of each population as well as the current topology, branch support values, and the average branch lengths of the consensus tree; Figure 4).
Figure 2. The MetaPIGA2.0 model setting window. The user can choose a substitution model and set the corresponding parameters: here, the GTR model (and estimated starting values of the rate matrix) with rate heterogeneity (discrete Gamma model) and no proportion of invariable sites has been selected automatically after performing a likelihood ratio test (lower left buttons). The user can also choose how and when intrastep optimization of target parameters (here, branch lengths, rate matrix parameters, and the alpha shape parameter of the Gamma distribution) will be performed (here, at the end of the search, using a genetic algorithm). Note that, as the metaGA is a stochastic heuristic, most of the parameters optimization occurs interstep, i.e., across generations under the effect of operators (see Figure 3).
Figure 3. The MetaPIGA2.0 operators setting window. The user selects the operators (affecting topology, branch lengths, and model parameters), their frequencies (unless they are ordered or randomly selected; upper right radiobutton panel) and whether their frequencies are dynamically adapted (here, every 100 generations but never set below 4%) depending on their relative efficiencies in improving the besttree likelihood. See text for details.
Figure 4. The MetaPIGA2.0 run window. Here, a metaGA search with multiple replicates has been chosen. Hence, the run window shows, for 3 successive replicates (lower left panel), the current besttree likelihood progression in each population, as well as (right panel) the current topology, metaGA branch support values, and average branch lengths of the consensus among the best trees (one for each population) from all replicates.
Batch files are particularly useful for running sequentially a single data set under multiple different settings and/or several datasets with the same settings. MetaPIGA2.0 supports the use of batch files which can be written manually or generated using tools available in the GUI: datasets and their settings can be duplicated, settings can be copypasted from one dataset to another, and multiple combinations of datasets and settings can be saved in a batch file that can be run either in the GUI (with various graphical information on search progress) or using command line.
Input and result trees are manipulated in Newick format but visualized graphically in the GUI and can be exported for other programs. MetaPIGA2.0 also integrates a Tree Viewer (Figure 5) that allows viewing, rerooting, and printing trees as well as computing the likelihood of any tree (under any substitution model) and optimizing its model parameters. Three other tools are implemented in MetaPIGA2.0: a tree generator (using the starting tree settings), a consensus builder (using user trees and/or trees saved in the Tree Viewer), and a memory setting tool defining the maximum amount of memory allocated to the program.
Figure 5. The MetaPIGA2.0 Tree Viewer. The tree selected in the list of available trees (left panel) is show in the right panel with its likelihood and model parameters. Trees can be viewed, rerooted, and printed. Likelihood can be recomputed after changing the substitution model or optimizing model parameters. Various tools allow modifying the list of trees.
Discussion and Conclusions
The metaGA resolves the major problem inherent to classical GA approaches: should one use a soft or a stringent selection scheme? Indeed, strong selection produces good solutions in a short computing time but tends to generate suboptimal solutions around local optima, whereas mild selection schemes considerably improve the probability to escape local optima and find better solutions but greatly increase computing time. As the metaGA involves several parallel searches, initial interpopulation variation can be very high (especially if random or LNJ pseudorandom starting trees are used), and somewhat maintained during the search, even under extreme intrapopulation selection.
Although the metaGA has been shown to perform very well [23,34,35], it had not been implemented together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. Here, we performed such an implementation together with a hill climbing, a classical GA, and a SA algorithm. This implementation into a single software will allow a rigorous identification of the optimal parameters' values under each of these heuristics as well as a meaningful comparison of performances among these algorithms. Comparing the performances (speed and accuracy) between metaPIGA2.0 and other popular softwares such as, among others, MrBayes [27], RaxML [28], Garli [38], and PhyML [44] is well beyond the scope of the present manuscript. However, our preliminary analyses (data not shown) with large datasets indicate that metaPIGA2.0 and MrBayes3.1.2 generate very similar candidate trees and consensus cladograms (under stopping rules based on the interreplicate MRE metric or average standard deviation of split frequencies, respectively) and require similar running times.
An indepth assessment of the statistical significance of metaGA branch support values is warranted. There might be some correspondence between metaGA branch support values and posterior probabilities [23] but theoretical and additional empirical analyses are required. For example, it would be important to asses how changing metaGA (versus MCMC) settings would affect sampling and estimates of probability distributions.
MetaPIGA2.0 will constitute a platform on which we will incorporate additional functionalities (e.g., aminoacid and codon substitution models, and inference of ancestral sequences), improve performances (e.g., by parallelization on graphics processing units), identify optimal combinations of default parameters values, improve current heuristics, and possibly combine them for the development of higherlevel metaheuristics. Meanwhile, MetaPIGA2.0 already gives access both to high customization for the phylogeneticist, as well as to an ergonomic interface and functionalities assisting the nonspecialist for sound inference of large phylogenetic trees using nucleotide sequences.
Availability and Requirements
Project name: MetaPIGA2.0
Project home page: http://www.metapiga.org webcite and http://www.lanevol.org webcite
Operating systems: Platform independent
Programming language: Java
Other requirements: Java 1.6 virtual machine
License: Free for Academics, license needed for nonacademics.
Authors' contributions
MCM conceived and designed the software, and wrote the manuscript. RH improved the design of many functionalities and wrote the code. Both authors read and approved the final manuscript.
Authors' Information
MCM heads the Laboratory of Artificial & Natural Evolution (LANE) at the University of Geneva (Switzerland), and works on various aspects of EvoDevo, phylogenomics, phyloinformatics, experimental evolution, and conservation genetics. RH is a computer scientist in the department of Biology of Namur University (Belgium) and specializes in the development of bioinformatics tools.
Acknowledgements
We are grateful to Alan Lemmon, anonymous reviewers, and the Editor of BMC Bioinformatics for useful suggestions regarding software implementation and critical comments to previous versions of the manuscript. We thanks Eric Depiereux for providing office space and additional computing facilities to RH. Funds were provided to MCM by the University of Geneva (Switzerland), the Swiss National Science Foundation (FNSNF, grant 31003A_125060), the Société Académique de Genève, the Georges and Antoine Claraz Foundation, and the Ernst and Lucie Schmidheiny Foundation. Funds were provided to RH by the Facultés Universitaires NotreDame de la Paix (Namur, Belgium).
References

Gabaldon T: Largescale assignment of orthology: back to phylogenetics?
Genome Biol 2008, 9(10):235. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li WH: Molecular evolution. Sunderland, MA.: Sinauer; 1997.

Thorne JL, Kishino H: Divergence time and evolutionary rate estimation with multilocus data.
Syst Biol 2002, 51(5):689702. PubMed Abstract  Publisher Full Text

Cassens I, Vicario S, Waddell VG, Balchowsky H, Van Belle D, Ding W, Fan C, Mohan RS, SimoesLopes PC, Bastida R, et al.: Independent adaptation to riverine habitats allowed survival of ancient cetacean lineages.
Proc Natl Acad Sci USA 2000, 97(21):1134311347. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Thorne JL, Kishino H, Painter IS: Estimating the rate of evolution of the rate of molecular evolution.
Molecular Biology and Evolution 1998, 15(12):16471657. PubMed Abstract  Publisher Full Text

Blanchette M, Green ED, Miller W, Haussler D: Reconstructing large regions of an ancestral mammalian genome in silico.
Genome Res 2004, 14(12):24122423. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chang BS, Jonsson K, Kazmi MA, Donoghue MJ, Sakmar TP: Recreating a functional ancestral archosaur visual pigment.
Molecular biology and evolution 2002, 19(9):14831489. PubMed Abstract  Publisher Full Text

Chang BS, Ugalde JA, Matz MV: Applications of ancestral protein reconstruction in understanding protein function: GFPlike proteins.
Methods Enzymol 2005, 395:652670. PubMed Abstract  Publisher Full Text

Williams PD, Pollock DD, Blackburne BP, Goldstein RA: Assessing the accuracy of ancestral protein reconstruction methods.
PLoS computational biology 2006, 2(6):e69. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang J, Nielsen R, Yang Z: Evaluation of an improved branchsite likelihood method for detecting positive selection at the molecular level.
Molecular biology and evolution 2005, 22(12):24722479. PubMed Abstract  Publisher Full Text

Meegaskumbura M, Bossuyt F, Pethiyagoda R, ManamendraArachchi K, Bahir M, Milinkovitch MC, Schneider CJ: Sri Lanka: an amphibian hot spot.
Science 2002, 298(5592):379. PubMed Abstract  Publisher Full Text

Springer MS, Stanhope MJ, Madsen O, de Jong WW: Molecules consolidate the placental mammal tree.
Trends in ecology & evolution (Personal edition) 2004, 19(8):430438. PubMed Abstract  Publisher Full Text

Bossuyt F, Brown RM, Hillis DM, Cannatella DC, Milinkovitch MC: Phylogeny and biogeography of a cosmopolitan frog radiation: Late cretaceous diversification resulted in continentscale endemism in the family ranidae.
Syst Biol 2006, 55(4):579594. PubMed Abstract  Publisher Full Text

Graham RL, Foulds LR: Unlikelihood that Minimal Phylogenies for a Realistic Biological Study Can Be Constructed in Reasonable Computational Time.
Math Bioscience 1982, 60:133142. Publisher Full Text

Chor B, Tuller T: Maximum likelihood of evolutionary trees: hardness and approximation.
Bioinformatics 2005, 21(Suppl 1):i97106. PubMed Abstract  Publisher Full Text

Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach.
Journal of molecular evolution 1981, 17:368376. PubMed Abstract  Publisher Full Text

Felsenstein J: Inferring Phylogenies. Sunderland: Sinauer Associates Inc; 2004.

Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers JS: Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods.
Syst Biol 2001, 50(4):525539. PubMed Abstract  Publisher Full Text

Holder M, Lewis PO: Phylogeny estimation: traditional and Bayesian approaches.
Nat Rev Genet 2003, 4(4):275284. PubMed Abstract  Publisher Full Text

Huelsenbeck JP, Larget B, Miller RE, Ronquist F: Potential applications and pitfalls of Bayesian inference of phylogeny.
Syst Biol 2002, 51(5):673688. PubMed Abstract  Publisher Full Text

Salter LA, Pearl DK: Stochastic search strategy for estimation of maximum likelihood phylogenetic trees.
Syst Biol 2001, 50(1):717. PubMed Abstract  Publisher Full Text

Katoh K, Kuma K, Miyata T: Genetic algorithmbased maximumlikelihood analysis for molecular phylogeny.
J Mol Evol 2001, 53(45):477484. PubMed Abstract  Publisher Full Text

Lemmon AR, Milinkovitch MC: The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation.
Proc Natl Acad Sci USA 2002, 99(16):1051610521. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lewis PO: A genetic algorithm for maximumlikelihood phylogeny inference using nucleotide sequence data.
Mol biol evol 1998, 15(3):277283. PubMed Abstract  Publisher Full Text

Matsuda H: Protein phylogenetic inference using maximum likelihood with a genetic algorithm. In Pacific symposium on biocomputing '96. London: World Scientific; 1996:512523.

Zwickl DJ: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Austin, Tx, USA.: The University of Texas; 2006.

Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models.
Bioinformatics 2003, 19(12):15721574. PubMed Abstract  Publisher Full Text

Stamatakis A: RAxMLVIHPC: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 2006, 22(21):26882690. PubMed Abstract  Publisher Full Text

Suchard MA, Rambaut A: Manycore algorithms for statistical phylogenetics.
Bioinformatics 2009, 25(11):13701376. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tavaré S: Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences.
American Mathematical Society: Lectures on Mathematics in the Life Sciences 1986, 17:5786.

Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods.
J Mol Evol 1994, 39(3):306314. PubMed Abstract  Publisher Full Text

Yang Z: Amongsite rate variation and its impact on phylogenetic analyses.
Trends in Ecology & Evolution 1996, 11(9):367372. Publisher Full Text

Gu X, Fu YX, Li WH: Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites.
Mol biol evol 1995, 12(4):546557. PubMed Abstract  Publisher Full Text

Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.
Syst Biol 2003, 52(5):696704. PubMed Abstract  Publisher Full Text

Stamatakis A, Ludwig T, Meier H: RAxMLIII: a fast program for maximum likelihoodbased inference of large phylogenetic trees.
Bioinformatics 2005, 21(4):456463. PubMed Abstract  Publisher Full Text

Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees.
Molecular Biology and Evolution 1987, 4(4):406425. PubMed Abstract  Publisher Full Text

Felsenstein J: Inferring Phylogenies. Sunderland: Sinauer Associates Inc; 2002.

Zwickl DJ: Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Austin, TX, USA.: The University of Texas; 2006.

Kirkpatrick S, Gelatt CD Jr, Vecchi MP: Optimization by Simulated Annealing.
Science 1983, 220(4598):671680. PubMed Abstract  Publisher Full Text

Lundy M: Applications of the Annealing Algorithm to Combinatorial Problems in Statistics.
Biometrika 1985, 72(1):191198. Publisher Full Text

Holland J: Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press; 1975.

Maddison DR, Swofford DL, Maddison WP: NEXUS: an extensible file format for systematic information.
Syst Biol 1997, 46(4):590621. PubMed Abstract

Posada D, Crandall KA: Selecting the bestfit model of nucleotide substitution.
Syst Biol 2001, 50(4):580601. PubMed Abstract  Publisher Full Text

Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximumlikelihood phylogenies: assessing the performance of PhyML 3.0.
Syst Biol 2010, 59(3):307321. PubMed Abstract  Publisher Full Text