An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines
1 Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
2 Department of Biology, Linfield College, McMinnville, OR 97128, USA
3 Section of Evolution and Ecology & Center for Population Biology, University of California, Davis, CA 95616, USA
4 Department of Biology, Virginia Commonwealth University, Richmond, VA 23284, USA
5 Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA
6 Pacific Northwest Research Station, USDA Forest Service, Corvallis, OR 97331, USA
7 Department of Plant Sciences, University of California, Davis, CA 95616, USA
8 Department of Biology, Stanford University, Stanford, CA 94305, USA
BMC Evolutionary Biology 2014, 14:67 doi:10.1186/1471-2148-14-67Published: 29 March 2014
As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size.
Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ∼47 kilobases of sequence at 121 loci. Each “strategy” for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies.
When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.