Email updates

Keep up to date with the latest news and content from BMC Evolutionary Biology and BioMed Central.

Open Access Highly Accessed Methodology article

Selecting optimal partitioning schemes for phylogenomic datasets

Robert Lanfear12*, Brett Calcott3, David Kainer1, Christoph Mayer4 and Alexandros Stamatakis56

Author Affiliations

1 Ecology Evolution and Genetics, Research School of Biology, Australian National University, Canberra, ACT, Australia

2 National Evolutionary Synthesis Center, Durham, NC, USA

3 Philosophy Program, Research School of Social Sciences, Australian National University, Canberra, ACT, Australia

4 Zoologisches Forschungsmuseum Alexander Koenig, Bonn, Germany

5 The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

6 Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach 6980, 76128 Karlsruhe, Germany

For all author emails, please log on.

BMC Evolutionary Biology 2014, 14:82  doi:10.1186/1471-2148-14-82

Published: 17 April 2014

Abstract

Background

Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics.

Methods

We develop two novel methods for estimating best-fit partitioning schemes on large phylogenomic datasets: strict and relaxed hierarchical clustering. These methods use information from the underlying data to cluster together similar subsets of sites in an alignment, and build on clustering approaches that have been proposed elsewhere.

Results

We compare the performance of our methods to each other, and to existing methods for selecting partitioning schemes. We demonstrate that while strict hierarchical clustering has the best computational efficiency on very large datasets, relaxed hierarchical clustering provides scalable efficiency and returns dramatically better partitioning schemes as assessed by common criteria such as AICc and BIC scores.

Conclusions

These two methods provide the best current approaches to inferring partitioning schemes for very large datasets. We provide free open-source implementations of the methods in the PartitionFinder software. We hope that the use of these methods will help to improve the inferences made from large phylogenomic datasets.

Keywords:
Model selection; Partitioning; Partitionfinder; BIC; AICc; AIC; Phylogenetics; Phylogenomics; Clustering; Hierarchical clustering