Abstract
Background
As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size.
Results
Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ∼47 kilobases of sequence at 121 loci. Each “strategy” for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional nontopological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimizedeepcoalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies.
Conclusions
When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a twostage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.
Background
In phylogenetic studies, it has become increasingly common to sequence large numbers of individuals at many loci (e.g., [14]). While these multilocus datasets provide the potential to improve the accuracy of phylogeny inferences over large sets of taxa, for a variety of reasons, topologies of trees inferred at different loci might not match [5]. One source of this gene tree discordance is incomplete lineage sorting, the phenomenon in which sets of sampled lineages fail to coalesce in the population in which they are first capable of coalescing [6]. With incomplete lineage sorting, several species tree inference methods—including Concatenation [7], Democratic Vote Consensus [8], Greedy Consensus [9], MajorityRule Consensus [9], Matrix Representation with Parsimony [10], and Minimize Deep Coalescences [11]—can be misled by discordance of gene trees across loci.
Numerous approaches have been used for estimating species tree topologies from multilocus sequence data. Consensus methods construct species tree topologies from gene trees according to deterministic rules applied to the input set of trees [12,13]. These methods take as input a set of gene trees produced from individual loci, allowing for separate input evolutionary histories at each locus. Because genetic lineages in different species sometimes have relatively few sequence differences, however, the information in a locus can be insufficient to accurately infer gene trees, thereby allowing incorrect gene trees to adversely influence the constructed species tree (e.g., [14,15]). Concatenation methods concatenate a set of multiple alignments and construct a tree from the concatenated alignment, treating the estimated “supergene” tree as a species tree estimate [6,13,16]. Because concatenation combines all loci to form a single locus, and because different loci can have different evolutionary histories that are disregarded in the analysis of the concatenated alignment, the analysis of loci in this way can lead to incorrect species tree inferences [7]. Consensus and concatenation have in common that they are “twostage” methods, in which species trees are inferred in two steps—inference of gene trees and then species tree inference for consensus, and inference of the supergene tree followed by a conceptually substantial though methodologically trivial pronouncement that this tree is the species tree estimate for concatenation [6,12,13]. A third class of approaches can be labeled “singlestage” methods, in which species trees are inferred by simultaneously modeling the evolution of sequences among all sampled loci to output a species tree estimate [1720]. These singlestage modelbased methods often have desirable statistical properties, but because they typically explore large spaces of possibilities rather than algorithmically constructing estimated trees, they can be computationally intensive and applicable only to smaller datasets.
Properties of species tree inference methods can be examined using a variety of frameworks, including theory, simulations, and empirical assessments. Theoretical investigations are often concerned with limiting properties as the number of loci approaches infinity [911,15,2127]. Indepth explorations of inference methods often rely on simulation studies, which are commonly used to investigate the performance of species tree inference methods on simulated multilocus datasets [10,2832]. These theoretical and simulationbased studies have the advantage of knowing the true species tree, but the disadvantage that the scenarios they examine lack the complexity of empirical data.
An alternative approach is to evaluate methods on an empirical dataset in which the space of parameter values is defined by the evolutionary history of a group of species. Recent studies have empirically investigated the performance of species tree methods from multilocus datasets in a variety of organisms, including birds [3,3336], insects [37,38], newts [39], plants [40], primates [41,42], rice [1], rodents [43], snakes [44], and yeast [4,16,36,45,46]. While some of these studies constructed highlysupported species trees, others did not, possibly due to high levels of genealogical discordance resulting from incomplete lineage sorting, hybridization, and ancient rapid radiations.
In one such study, [47] found that samples from a multilocus dataset of North American pines displayed widespread genealogical discordance. This pattern of incomplete lineage sorting is a common feature of longlived shrubs and trees (e.g., [4850]), and likely arises from factors such as large effective population sizes and long generation times [51]. Because gene tree discordance is needed for different algorithms to produce different estimates, high levels of gene tree discordance make North American pines an interesting group in which to compare species tree inference methods.
In this study, we take an empirical approach to the evaluation of species tree inference methods by examining the performance of 72 strategies for inferring species tree topologies using multilocus data from North American pines. Each “phylogenetic inference strategy” consists of three components: a method of constructing species trees from gene trees (e.g., consensus or concatenation), a gene tree inference method (e.g., maximum likelihood, maximum parsimony, or neighborjoining), and an outgroup species. Our framework thus focuses on twostage inference strategies that can be separated into gene tree inference and species tree inference steps, so that the effect of the choices of gene tree and species tree estimators can be directly evaluated. We examine ∼47 kilobases (kb) of sequence spanning 121 nuclear loci sequenced in 120 individuals from eight ingroup species of Pinus subsection Strobus (Table 1) and three outgroup species of Pinus subsection Gerardianae. With our empirical approach, unlike in simulationbased and theoretical evaluations, the true species tree is not known. It is still possible, however, to evaluate properties of species tree estimators without knowledge of a true tree, by comparing the features of species trees inferred by different methods. We apply techniques from multivariate statistical analysis to sets of inferred species trees to compare characteristics of species trees estimated by different strategies and to identify groups of strategies that behave similarly.
Table 1. Ranges and morphological characteristics differentiating eight North American species of Pinus subsection Strobus
Methods
North American white pine dataset
A total of 120 individuals were sequenced in eight ingroup species of North American white pines from Pinus subsection Strobus (Pinus albicaulis Engelmann, P. ayacahuite Ehrenberg ex Schlechtendal, P. chiapensis (Martínez) Andresen, P. flexilis James, P. lambertiana Douglas, P. monticola Douglas ex D. Don, P. strobiformis Engelmann, and P. strobus L.) and three outgroup species from Pinus subsection Gerardianae (P. bungeana Zuccarini ex Endlicher, P. gerardiana Wallich ex D. Don, P. squamata X. W. Li), the identified sister lineage to Pinus subsection Strobus[47,54]. Sequencing was conducted on haploid templates generated from DNA extractions of seed megagametophyte tissue; as a single haploid sequence was generated for each individual at each locus, no phasing was necessary. Gene sequences were obtained from 245 putative nuclear loci chosen from among ∼7,500 loci recently resequenced for loblolly pine (Pinus taeda L., http://loblolly.ucdavis.edu/bipod/ftp/ webcite) using single pass, bidirectional Sanger sequencing of PCR products amplified from haploid megagametophyte tissue excised from seeds of each species. Further description of laboratory protocols appears in [55]. Sequence data were preprocessed and organized using PINESAP[56], a bioinformatics pipeline that combines PHRED[57], PHRAP[58], and MUSCLE[59,60] to call bases and align sequencing reads. Reported nucleotide sequences consisted only of A, C, G, T, missing, and gap information, with no other ambiguity codes used. After preprocessing, the data were manually assembled and aligned using CODONCODE (CodonCode Corporation, Dedham, MA). Bases were called using a minimum PHRED score [57,61] of 25 for aligned bases. All polymorphisms were visually validated. All alignments were further aligned to resequencing data from P. taeda (unpublished data) using the profileprofile option in MUSCLE[59,60]. These alignments are publicly available as part of the Dendrome project (http://loblolly.ucdavis.edu/bipod/ftp/ webcite). GenBank accession numbers for sequences in the study appear in Additional file 1: Table S1.
Additional file 1. GenBank accession numbers for sampled sequences.
Format: XLSX Size: 411KB Download file
Of 245 loci sequenced initially, 37 were dropped from further consideration due to low overall quality of sequence reads. An additional 15 loci were discarded due to possible chloroplast or mitochondrial contamination, on the basis of BLAST analysis against pine organellar sequences in GENBANK[62]. Two loci were dropped due to sequence similarity to retroelementlike proteins, leaving 191 highquality nuclear gene alignments. We then eliminated 70 loci for which at least one of the 11 species contained no data. This filter reduced the dataset to 121 loci, covering ∼47 kb of aligned sequence data.
Coding regions (i.e., site annotations) could be confidently identified for 112 of the 121 loci by further analysis using TBLASTX against proteincoding genes in Arabidopsis, Oryza, Picea, and Populus. For these 112 loci, the gene for the highestscoring TBLASTX hit, in combination with the expressed sequence tag from loblolly pine, was used to identify coding regions. Site annotations for each alignment were validated with BLASTP analysis of the amino acid sequences derived from the inferred coding intervals against the gene that was used to derive the site annotations. For the data from the 112 annotated loci, ∼62% represents exonic regions, ∼18% represents intronic regions, ∼1% is from 5’ UTRs, and ∼19% is from 3’ UTRs. Because 112 of the loci could be confidently identified as belonging to coding regions, with a substantial fraction of exonic sequence, the data likely contain a mixture of nonneutral and neutral regions.
Overview of the analysis
The procedure for obtaining results for each of the 72 phylogenetic inference strategies (listed in Additional file 2: Table S1) appears in Figure 1. For a given strategy, we started from a dataset D with L loci. To generate distributions on the set of clades inferred by a given strategy, we used the bootstrap, creating bootstrap replicates by randomly choosing with replacement B sets of L loci. As many of the loci are coding and the eight pine species are closely related, we chose not to bootstrap across sites within a locus to ensure that bootstrapped alignments would contain reasonable levels of variation. Next, we applied a gene tree inference method to each bootstrap replicate dataset. Based on the set of inferred gene trees in a bootstrap replicate, we then applied a species tree construction method to estimate a species tree topology with one of the three outgroup species. For each phylogenetic inference strategy, we constructed B=1000 independent bootstrap datasets, thereby estimating 1000 species tree topologies. From these topologies, we created a list of clades, each with a corresponding count of its number of appearances in the 1000 bootstrap replicates. Clade lists were then analyzed to assess differences among the estimates produced by different strategies.
Additional file 2. Supplementary information on phylogenetic inference strategies, sample collection locations, and Procrustes analysis.
Format: PDF Size: 263KB Download file
This file can be viewed with: Adobe Acrobat Reader
Figure 1. Flow diagram representing the procedure in which we obtained results on the behavior of phylogenetic inference strategies. A boldface number attached to a downward arrow indicates the number of phylogenetic inference strategies generated by the box immediately above the arrow. Absence of a number indicates a value of 1. The number of strategies for a particular path from the topmost box to the bottom layer is calculated as the product of the boldface numbers visited during the traversal of the path. The number of phylogenetic inference strategies analyzed is 72, the sum over all paths from the topmost box to boxes in the bottom layer.
Creating datasets
Our final set of 121 loci contains many loci that are highly conserved across multiple species. Because of the high level of conservation, for these loci, little information exists for identifying relationships among lineages. Thus, if methods for inferring gene trees were applied to certain loci, the resulting gene trees would be highly unresolved and would therefore provide little information to species tree construction methods. This issue motivates the construction of datasets that attempt to reduce the chance of inferring highly unresolved trees, and that provide phylogenetic inference strategies with the maximal amount of sequence data available.
We therefore analyzed four carefully selected subsets of the initial dataset (Table 2; Additional file 3). Two of these are datasets of multiple alignments that contain information on a single individual per species ( and ). The other two contain information on multiple individuals per species ( and ). These four datasets are constructed such that each possesses desirable properties for certain strategies in the collection of 72 phylogenetic inference strategies, providing the strategies with as much information as possible to infer resolved phylogenies. For example, because it is desirable for a pair of species to have nonzero distance, we require pairs of distinct species to be separated by at least one observed mutation. Furthermore, because it is desirable to minimize missing data, we choose individuals that yield minimal missing data in a multiple alignment. One of the two datasets with a single individual sampled per species is optimized for locusbylocus gene tree inference (), whereas the other is optimized for gene tree inference from a concatenated alignment (). Similarly, one of the two datasets with multiple individuals sampled per species is optimized for locusbylocus gene tree inference (), whereas the other is optimized for gene tree inference using multiple loci simultaneously (). The procedures used for constructing these datasets appear in Sections on “Datasets with one individual per species” and “Datasets with multiple individuals per species”.
Table 2. Datasets
Additional file 3. Zip archive containing data files,,,, and a description of these files.
Format: ZIP Size: 375KB Download file
Let S_{k}, k=1,2,…,11, denote the set of individuals from pine species k, considering eight ingroup species (S_{1},S_{2},…,S_{8}) and three outgroup species (S_{9},S_{10},S_{11}). Denote the amount of overlapping nongap nonmissing sequence between a pair of individuals x and y by n_{xy} and denote the number of nongap nonmissing nucleotide differences between a pair of individuals x and y by d_{xy}(0≤d_{xy}≤n_{xy}). Further, denote the final dataset of L=121 loci by , where is the set of aligned sequences at locus ℓ for individuals from all 11 species. It is from dataset that we create the four optimized datasets as summarized in Table 2 and Figure 2.
Figure 2. Schematic for creating the four subsets,,, and from dataset. For the matrices of datasets , , , , and (see Table 2), each row is an individual and each column is a locus. Thick black lines in these matrices separate the individuals in different species. Gray boxes indicate missing sequences. (A) At each locus, a single sequence from each species (indicated in red) is selected from dataset . These selected sequences are used to create such that there exists a single sequence sampled per species at each locus. Sequences from a subset of loci in (indicated in yellow) are used to create dataset such that each locus has at least one nucleotide difference between each distinct pair of species other than pairs from distinct outgroups. (B) Dataset is the full starting dataset . At each locus ℓ, a distance matrix is created according to eq. 2. Sequences from a subset of loci (indicated in red) in are used to create dataset such that each locus has a nonzero pdistance between each distinct pair of species other than pairs from distinct outgroups. Observe that the matrix includes loci 3 and 7, which are not included in the matrix. Loci 3 and 7 are included in but not in because in , pairs of species contain at least one pair of individuals with different sequences, whereas in , at least one pair of the 11 selected individuals have identical sequences. Therefore, the set of loci in is a superset of the set of loci in , and the number of loci in is always greater than or equal to the number of loci in .
Datasets with one individual per species
The first dataset, , consists of alignments with a single individual sampled per species at each locus (not necessarily the same individual across loci). That is, we generate a dataset of multiple alignments at each of L loci with only one individual per species, thereby creating multiple alignments of 11 individuals. This dataset is used by phylogenetic inference strategies that utilize the concatenationbased species tree construction methods with the maximum likelihood, maximum parsimony, and neighborjoining gene tree inference methods (see “Inferring gene trees” and “Inferring species trees”). To create , we choose the subset of 11 sequences at locus ℓ by first maximizing the total overlap sequence and then, if there is a tie for the overlap , maximizing the total number of substitutions . In other words, for any other set of aligned sequences at locus ℓ with a set R of only one individual sampled per species, the amount of overlapping nongap nonmissing sequence in is no larger than in , i.e., . We note that the quantity n_{xy} represents a calculation only on a pair of individuals x and y, whereas considers all pairs of individuals. Further, for any other set of aligned sequences at locus ℓ with a set R of only one individual sampled per species and , the total number of pairwise sequence differences in is no larger than in , i.e., . If multiple sets R of 11 individuals share the same values of n and d, then we choose the set of 11 individuals randomly among the tied sets. We choose the “optimal” set of 11 individuals at each locus in this way both to maximize the sequence contributions of individual loci to the inference of gene trees (maximizing n) and to maximize the potential for creating resolved gene trees (maximizing d).
The second dataset, , is a subset of with L_{s,0}≤L loci that consists of only those loci in for which there exists at least one nucleotide difference between each distinct pair of species (other than pairs of outgroup species). In other words, for any pair of individuals x and y with and x≠y, d_{xy}≥0, and d_{xy}>0 if x, y, or both are from species 1 through 8. This condition of at least one nucleotide difference between species pairs assists in constructing gene trees that are bifurcating. Dataset is used by phylogenetic inference strategies that utilize consensus methods with maximum likelihood, maximum parsimony, and neighborjoining (see “Inferring gene trees” and “Inferring species trees” for details).
Datasets with multiple individuals per species
The third dataset, , is identical to our starting dataset . Thus, strategies that use consider all available sequences. Dataset is used by phylogenetic inference strategies that employ the concatenationbased species tree construction methods with the neighborjoining gene tree inference method using multiple individuals (see “Inferring gene trees” and “Inferring species trees”).
Consider a dataset D of L loci sampled randomly with replacement from . Define
where the indicator random variable equals 1 if x∈S_{i} and y∈S_{j} and 0 otherwise. The distance matrix defined by eq. 1 is used to estimate gene trees for all strategies applied to . Given distinct species S_{i} and S_{j}, represents the pdistance (fraction of nucleotide differences; [63]) averaged over pairs of individuals, one from species i and the other from species j. Note that eq. 1 represents a weighted rather than unweighted average for the mean pdistance between species i and j. Although the distance is weighted, it is the same as a distance between pairs of species calculated on a concatenated alignment.
The fourth dataset, , is a subset of with L_{p,0}≤L loci. This subset consists of only those loci in for which there exists a pair of individuals in each distinct pair of species (other than pairs from distinct outgroups) with at least one nucleotide difference between them. Define
where is an indicator random variable that equals 1 if x∈S_{i} and y∈S_{j} and 0 otherwise. The numerator of represents the number of pairwise sequence differences, summed over pairs of individuals, one from species S_{i} and the other from species S_{j}, at locus ℓ. The denominator represents the sum across pairs of individuals, one from S_{i} and the other from S_{j}, of the nongap nonmissing sequence shared between pairs of individuals at locus ℓ. To construct , we create a subset of that consists only of those loci in for which the pdistance () is nonzero between each distinct pair of species (excluding pairs from distinct outgroups). This dataset is utilized by phylogenetic inference strategies that employ consensus methods with gene trees inferred by neighborjoining using multiple individuals (see “Inferring gene trees” and “Inferring species trees”). Similarly to dataset , this condition of a nonzero pdistance between species pairs assists in constructing bifurcating gene trees. We note that the species tree estimation approach taken in this study neither requires pairs of individuals in the same species to have nonzero distances nor to have distances of zero. We only enforce that the distance calculated between pairs of species is nonzero.
Inferring gene trees
For each of the four datasets , , , and , we inferred gene trees from bootstrap samples [6365] that contain loci randomly sampled with replacement from the dataset. For strategies applied to datasets and , we inferred gene trees from sequence alignments by applying either maximum likelihood (ML; [63], ch. 9) under a general timereversible substitution model ([63], ch. 13), maximum parsimony (MP; [63], ch. 1), or neighborjoining (NJ; [63], ch. 11) to a pdistance matrix calculated between pairs of alignments. For strategies applied to and , we inferred gene trees by applying neighborjoining to the P^{all} and P^{ℓ}pdistance matrices, respectively. We term the method for inferring gene trees from the P^{all} and P^{ℓ}pdistance matrices “neighborjoining using multiple individuals” (M). Gene trees were inferred using PAUP^{∗}[66]. Note that the estimation of gene trees on the scale explored in this study would be computationally intensive on the full set of sampled individuals; thus, we do not consider gene tree inference directly from alignments with multiple lineages sampled within species, and when exploring multiple lineages (as in M), we do so only with distance matrices between pairs of species rather than pairs of lineages.
Inferring species trees
We view as a species tree inference method any method that outputs a species tree estimate. The six species tree inference methods in this study are Concatenation [16,67], SuperMatrix Rooted Triple (SMRT; [22]), STEAC [21], STAR [21], Rooted Triple Consensus (RTC; [68]), and Minimize Deep Coalescences (MDC; [69,70]). Concatenation and SMRT are concatenationbased, and STEAC, STAR, RTC, and MDC are consensus methods. Because we have adopted a unified twostage framework for phylogenetic inference strategies in which gene trees are first inferred by one approach and species trees are then inferred from gene trees by a second approach, we did not investigate singlestage approaches such as BEST [18,19], and *BEAST [20] that bypass gene tree inference or that perform gene tree inference simultaneously with species tree inference. Our analysis pipeline explores the performance of twostage inference strategies when the roles of gene tree and species tree inference are separated, and it therefore requires that strategies estimate species trees from inferred gene trees and that they permit different gene tree inference methods to provide input to a given species tree method. The six species tree methods investigated in this article satisfy both of these conditions, whereas species tree methods such as BUCKy [17], BEST [18,19], and *BEAST [20] do not. Further, the methods we have selected are wellsuited to a computationally intensive bootstrap approach included in our pipeline for generating distributions of species tree topologies, and the more computationally intensive of the singlestage methods would not be easily accommodated within this framework. Given the large number of twostage methods available, it would not be possible to be comprehensive; we have thus chosen a limited number of methods that represent a range of underlying principles. Our choice of methods permits a diverse set of criteria for estimating species trees to be evaluated, and the conceptual differences in the underlying methods enable some differentiation in behavior across methods.
Consider a set of L loci (multiple alignments) with m ingroup and one outgroup species. Concatenation methods concatenate the L alignments to create a single “super locus” consisting of an alignment of the m+1 species across L loci. From this alignment, a gene tree is inferred by either maximum likelihood, maximum parsimony, or neighborjoining—note that the definition of Concatenation does not require that gene trees be estimated using any specific method—and is then taken as the species tree estimate. Similarly, SMRT creates a concatenated alignment of the m+1 species from a set of L alignments. However, SMRT then constructs from this concatenated alignment all concatenated alignments of three ingroup species and an outgroup species. Rooted threetaxon gene trees are then inferred from each of the concatenated alignments. A supertree algorithm is then applied to the set of rooted threetaxon gene trees to estimate an mtaxon species tree topology. This study uses the Modified Mincut supertree algorithm implemented in the program SUPERTREE[71] to construct a species tree from rooted threetaxon gene trees.
Consider a set of (m+1)taxon gene trees (m ingroup and one outgroup species) inferred at each of L loci. STEAC estimates a species tree topology by using estimated mean coalescence times. For distinct species S_{i} and S_{j}, the mean coalescence time is computed as the estimated coalescence time for S_{i} and S_{j}, averaged over all L gene trees. This resulting mean is placed into a matrix of distances between species, to which neighborjoining is applied to estimate the species tree topology (using the R PHYBASE package).
STAR estimates a species tree topology by using average coalescence ranks. It assumes that the rank of the root of a gene tree is equal to the number of species in the tree (m + 1 in our case). An internal node of a gene tree is then assigned one less than the rank of its immediate ancestor. For distinct species S_{i} and S_{j}, the average coalescence rank is computed as the rank of the node that connects S_{i} and S_{j}, averaged over all L gene trees. Similarly to STEAC, these average coalescence ranks specify a matrix of distances between species pairs. Neighborjoining is applied to the matrix to estimate the species tree topology using PHYBASE.
RTC estimates a species tree from rooted threetaxon tree topologies. At each locus ℓ, ℓ=1,2,…,L, RTC finds the set of rooted tree topologies of three ingroup and one outgroup species that are displayed by the inferred gene tree at locus ℓ. RTC then applies quartet puzzling [72] to the topologies to estimate the species tree topology (using the program TRIPLEC).
A coalescence event between a pair of lineages is considered “deep” if the coalescence does not occur in the first population in which the pair of lineages is capable of coalescing. Given a gene tree, the number of deep coalescences on a species tree is defined as the total number of “extra lineages”, summed across branches of the species tree topology, that is needed to fit the gene tree within the species tree topology. Here, the number of extra lineages for a branch is one fewer than the number of lineages that survive to the ancestral node of the branch; if incomplete lineage sorting does not occur, then only one lineage persists from a branch to a more ancestral branch, and there are no extra lineages. For a set of L gene trees, the number of deep coalescences for a species tree is the total number of deep coalescences for the species tree given a gene tree, summed across the L gene trees. MDC estimates a species tree topology by minimizing the number of deep coalescences. That is, MDC finds a species tree topology for which the number of deep coalescences that will fit the set of L gene trees within the species tree topology is minimal. This study utilizes the MDC implementation in PHYLONET[70].
Multivariate analysis
We aim to determine which of the 72 phylogenetic inference strategies perform similarly, and we use multivariate analyses to define clusters of strategies that provide similar species tree estimates. Consider a 72×145dimensional data matrix S in which rows represent strategies and columns represent 145 observed clades, among the possible nontrivial clades (i.e., clades that contain more than one species and fewer than all analyzed species) of eight species. Entry S_{ij} in column i and row j of S is the number of times that strategy i infers clade j in 1000 bootstrap replicates across loci.
Principal components analysis (PCA) was applied to S to create a 72×2dimensional matrix V, with rows representing strategies and the first and second columns representing the first and second principal components, respectively. Plotting strategies onto the space defined by these principal components yields a twodimensional spatial “map” of phylogenetic inference strategies. We similarly applied multidimensional scaling (MDS) to a distance matrix for all pairs of strategies, computing pairwise distance as the mean RobinsonFoulds distance [73] across all 10^{6} pairs of bootstrap trees, and extracting the first two components. We calculated the RobinsonFoulds distance using TREEDIST in PHYLIP.
To compare spatial maps of phylogenetic inference strategies, we used Procrustes analysis [7476]. In particular, we compared the spatial distribution of a subset of 72r strategies when analyzed alone to the spatial distribution for all strategies. The comparison enabled us to quantify the influence that a set of r strategies with a particular feature (i.e., species tree construction method, gene tree inference method, or outgroup species) has on the full spatial distribution. Consider a proper subset Σ={σ_{1},σ_{2},…,σ_{72r}} of the full set of strategies. Consider a (72r)×145dimensional data matrix S_{Σ} in which rows represent the strategies in set Σ and columns represent observed clades. S_{Σ} is a submatrix of S, in which the rows corresponding to strategies in Σ are selected from S. Consider a (72r)×2 target matrix X and a (72r)×2 comparison matrix Y. X is matrix V restricted to the set of strategies Σ. Y represents the first two principal components in the PCA applied to matrix S_{Σ}. Now consider a (72r)×2 matrix Z=bYT+C, where b is a scaling factor, T is a 2×2 matrix that rotates and reflects Y, and C is a (72r)×2 matrix that has constant columns and that is used to translate the matrix. Procrustes analysis seeks to find b, T, and C to minimize the sum of squared differences between X and some (72r)×2 matrix Z^{⋆}=bYT+C. That is, Z^{⋆} is formally defined as . The dissimilarity measure between X and Z^{⋆} is computed as
where is the jth entry of the centroid of X. This measure takes the sum of squared differences between points on the spatial maps defined by X and Z^{⋆} and normalizes it by the sum of squared differences between the points on the spatial map defined by X and their centroid.
Define a cluster as a set of strategies and let the centroid of a cluster be the location in the 145dimensional space of clades whose coordinates are the means of those of all strategies in the cluster. Hierarchical clustering was performed by first creating a matrix of Euclidean distances between all pairs of 145dimensional vectors represented by the matrix S. Define the withincluster sum of squared Euclidean distance as the squared Euclidean distance between a point in a cluster and the cluster centroid, summed over all points in the cluster. From the 72×72dimensional matrix of Euclidean distances between strategies, a dendrogram relating the strategies was constructed using the Ward algorithm [77], which iteratively merges clusters until all points are contained within a single cluster. For a given iteration, two clusters are merged if their merged cluster has a smaller withincluster sum of squared Euclidean distances than any other potential merged cluster. The nesting of clusters created by the algorithm defines the dendrogram.
We performed Kmeans clustering on the 72 145dimensional vectors, using K clusters, K=2,3,…,9. Given K, strategies were separated into K clusters on the basis of the squared Euclidean distance between all pairs of the strategies in a 145dimensional space. We ran 10^{4} replicates with random starting locations. Each replicate yielded a total withincluster sum of squared distances for the set of K clusters, representing the withincluster sum of squared distances between points in a cluster and the cluster centroid, summed over all K clusters. We then chose the set of cluster assignments that had the minimum total withincluster sum of squared distances, where the minimum was taken over all 10^{4} replicate starting locations.
To compute the Pearson correlation coefficient between a pair of strategies, we only used points in the 145dimensional vector that were nonzero in both strategies being compared.
Results
We accounted for the variable outcomes of individual phylogenetic inference strategies by applying the strategies to bootstrap datasets instead of their respective full datasets. Our analysis identified 145 distinct clades observed in the set of 72 phylogenetic inference strategies, among 246 possible nontrivial clades on eight species, across 1000 bootstrap replicates for each strategy. From these clades, we created a 72×145 matrix S in which each row is a strategy and each column is a clade. The value of S_{ij}, the cell in row i and column j, is the number of times among the 1000 bootstrap replicates that strategy i inferred a species tree with clade j. This summarized dataset S of clade counts was used for all further analyses.
Clade size
We first investigated the level of balance [7881] in the tree topologies inferred by each phylogenetic inference strategy. The distribution of clade sizes (number of taxa within a clade) provides a basis for measuring tree topological balance. Topologies with numerous small clades tend to be more balanced than topologies with large clades. For example, consider the topologies T_{bal}=(((AB)(CD))((EF)(GH))) and T_{unbal}=(((((((AB)C)D)E)F)G)H). Topology T_{bal} is the most balanced eighttaxon topology whereas T_{unbal} is the most unbalanced eighttaxon topology. Considering nontrivial clades, T_{bal} has four clades of size two and two clades of size four. T_{unbal} has one clade each of size two, three, four, five, six, and seven. Thus, the clades of T_{bal} are smaller than those of T_{unbal}. The mean clade size for T_{bal} is ∼2.67 and the mean clade size for T_{unbal} is 4.5.
Figure 3A displays the cumulative distribution of clade sizes for each of the 72 phylogenetic inference strategies, considering all 1000 bootstrap replicate species trees for each strategy. This cumulative distribution increases most quickly for strategies based on MDC, for which most of the distribution is located in clades of size two. By contrast, it increases most slowly for strategies based on SMRT and STEAC, for which much of the distribution is located in clades of size six and seven. Figure 3B displays a bar graph of the mean clade size for each of the 72 phylogenetic inference strategies. This graph shows that among all six species tree construction methods, the 12 MDC strategies have the smallest mean clade size as well as the smallest variance in mean clade size across the 12 combinations of outgroup and gene tree inference method. In contrast, SMRT and STEAC in general have the largest mean clade size. However, all 12 SMRT strategies infer trees with large mean clade size, whereas the mean clade size of STEAC varies across the 12 combinations of outgroup and gene tree inference method. Interestingly, the mean clade size averaged over all 12 strategies based on MDC is ∼2.79, a value that is close to the mean clade size for T_{bal} of ∼2.67.
Figure 3. Distribution of clade size for phylogenetic inference strategies.(A) Cumulative distribution of clade sizes. Each bar represents a strategy, of which there are 12 per color. (B) Mean clade size for phylogenetic inference strategies, calculated as the mean size of all clades inferred across 1000 bootstrap replicates. Vertical lines centered at the top of vertical bars represent standard errors of mean clade sizes.
Clustering of strategies
We next used PCA, MDS, hierarchical clustering, Kmeans clustering, and correlation analysis on the matrix of clades S to identify phylogenetic inference strategies that perform similarly. Figure 4 displays plots of the first two principal components, which account for 38.94% and 18.96% of the variation across strategies, respectively. Figure 4A shows that separate clusters are formed by strategies that are based on Concatenation, SMRT, and STEAC, and that strategies based on STAR, RTC, and MDC form a cluster together. Further, a larger cluster is formed by strategies that are based on Concatenation, SMRT, and STEAC, and another larger cluster is formed by strategies that are based on STAR, RTC, and MDC. These larger clusters have a simple interpretation in that one of the larger clusters contains topologicallybased strategies (STAR, RTC, and MDC) and the other contains strategies that are not strictly topologicallybased (Concatenation, SMRT, and STEAC). Strategies are classified as topologicallybased if they only use information on tree topologies to construct a species tree. In contrast, strategies are classified as not strictly topologicallybased if they use information other than the gene tree topologies, such as sequence or branch length information, to construct a species tree. Relabeling the points in Figure 4A according to gene tree inference method, Figure 4B shows that strategies that are based on M (i.e., multiple individuals) form a cluster, and that there are no separate clusters for strategies that are based on ML, MP, or NJ. Figure 4C, which labels points according to outgroup, shows that no strategies separate into clusters based on the choice of outgroup. When we apply MDS to RobinsonFoulds distances between the sets of bootstrap replicate trees produced by pairs of strategies (Figure 4), we obtain similar observations of the clusters of strategies, detecting an important role for M and for the difference between topologicallybased and nontopologicallybased strategies, and no strong signal for the outgroup choice.
Figure 4. Principal components analysis (PCA) and multidimensional scaling (MDS) of phylogenetic inference strategies. PCA was applied to 72 phylogenetic inference strategies, in which each strategy is viewed as a point in a 145dimensional space of clades. MDS was applied to a distance matrix between all pairs of strategies, where the distance between a pair of strategies was the mean RobinsonFoulds distance between all pairs of bootstrap trees from one strategy and another strategy (i.e., mean of 10^{6} comparisons). The plots show the first and second components. On a given row, each plot represents the same 72 points in the space of components 1 and 2; the three plots are colored differently to highlight different features. (A) Colors represent different methods for constructing species trees. (B) Colors represent different gene tree inference methods. (C) Colors represent different outgroups. The points on each graph represent different combinations of the three factors that form phylogenetic inference strategies. Each line in part A represents the resultant vector (scaled by a constant to lie within the span of the 72 points) for all 12 points of a certain method for constructing species trees. Each line in part B represents the resultant vector for all 18 points of a gene tree inference method (scaled by a constant). Each line in part C represents the resultant vector for all 24 points of an outgroup (scaled by a constant). The scaling constants in parts A, B, and C are distinct. Each of the shaded regions in parts A and B is a convex hull of the points from a particular method.
From Figure 4, we can see that much of the variation across the 72 phylogenetic inference strategies, as explained by PCA and MDS, is caused by M. Strategies based on M are more similar in clade outcomes to other strategies based on M than they are to other strategies that are not based on M. The magnitude of this effect can be quantified using Procrustes analysis, which demonstrates that M has a large influence on the spatial relationship among all other phylogenetic inference strategies (Additional file 2: Figure S1).
Figure 5 shows the results of our cluster and correlation analyses. The main clusters formed by phylogenetic inference strategies involve strategies based on the species tree construction methods Concatenation, SMRT, STEAC, and MDC or the gene tree inference method M (Figure 5). The clusters of strategies formed by Kmeans and the large groupings of strategies formed by hierarchical clustering are quite similar. Additionally, the correlation coefficient between clade vectors inferred by pairs of phylogenetic inference strategies is generally higher for pairs of strategies that are placed into the same cluster by either Kmeans or hierarchical clustering than for pairs of strategies that are not placed into the same cluster (Figure 5).
Figure 5. Cluster and correlation analysis of phylogenetic inference strategies. Each leaf of the dendrogram corresponds to a different phylogenetic inference strategy for obtaining the rooted phylogeny of eight ingroup pine species. Blue squares directly below the dendrogram indicate the features used to construct a rooted phylogeny for the eight pine species. The first six rows below the dendrogram represent different species tree construction methods. The next four rows below the dendrogram represent gene tree inference methods. The following three rows below the dendrogram represent the outgroup species. The dendrogram was constructed by hierarchical clustering using the Ward algorithm [77] applied to a matrix of Euclidean distances between all pairs of 145dimensional vectors (each dimension representing a distinct clade). The remaining nine rows below the outgroups show the results of Kmeans clustering applied to the 72 145dimensional vectors with K clusters, K=2,3,…,9. Below the cluster analysis is a heat map of the correlation coefficients between all pairs of phylogenetic inference strategies. An entry in the heat map represents the Pearson correlation coefficient between a pair of strategies by only using points in the 145dimensional vector that were nonzero in both strategies being compared.
Interestingly, the clustering of strategies by PCA and MDS in Figure 4 matches well with the groupings observed in Figure 5, which is likely driven by similar signals. In Figure 5, three large clusters are represented by the subtree to the left of the root of the dendrogram (i.e., the blue color in the Kmeans clustering with K=3) and by two subtrees to the right of the root (i.e., the pink and orange colors at K=3). The two subtrees to the right of the root (or pink and orange clusters defined by Kmeans clustering) involve strategies that are based on M (pink Kmeans cluster or left subtree on the right of the root of the dendrogram) or strategies that are based on species tree construction methods that are topologicallybased (orange Kmeans cluster or right subtree on the right of the root of the dendrogram). That is, strategies that correspond to the orange cluster are based on either STAR, RTC, or MDC. In contrast, the subtree to the left of the root (or the blue cluster defined by Kmeans clustering) contains only strategies that use species tree construction methods that are not strictly topologicallybased (i.e., Concatenation, SMRT, or STEAC).
From Figures 4 and 5, we find that phylogenetic inference strategies form three basic clusters: a cluster that involves strategies that are based on M, a cluster that involves strategies that are topologicallybased, and a cluster that involves strategies that are not strictly topologicallybased.
Clade flow
Following the “haplotype flow” computations of [82], we can view “clade flow” as a proportion of clades inferred by one phylogenetic inference strategy that are also inferred by another strategy. Figure 6 displays a heat map that represents a form of clade flow, where the cell at row i and column j in the heat map represents the fraction of clades inferred by strategy i that were not inferred by strategy j. By definition, the heat map is not symmetric. As can be seen from the mostly white and yellow boxes for rows corresponding to strategies based on M, these strategies tend to infer clades that are supported by other strategies. That is, if a species tree topology is inferred by a strategy that is based on M, then clades displayed by that topology will often also be present on species tree topologies inferred by other strategies. In Additional file 2: Figure S1, strategies based on M contribute to the most variation across strategies. A possible explanation for this observation is that the flow of clades is largely unidirectional. That is, if a strategy is based on M, then clades that are inferred by that strategy also tend to be supported by other strategies; however, if a strategy not based on M infers a clade, then that clade is not often supported by strategies based on M. Because clades inferred by strategies based on M also tend to be supported by other strategies, it follows that strategies based on M tend to infer clades that are also supported by other strategies based on M. This sharing of clades among strategies based on M causes those strategies to be more similar to each other than they are to strategies not based on M. In contrast to the results for M, as can be seen from the mostly dark boxes in rows for strategies based on MDC, strategies based on MDC tend to infer clades that are not supported by other strategies (especially when compared with strategies based on M).
Figure 6. Heat map representing the “flow” of clades between phylogenetic inference strategies. We use clade flow to measure the proportion of clades inferred by one strategy that were not inferred by a different strategy. Phylogenetic inference strategies are ordered using the dendrogram. The cell at row i and column j represents the fraction of clades inferred by strategy i that were not inferred by strategy j. By definition, the heat map is asymmetric. Darker colors indicate lower levels of “flow” from a row to a column.
Similarly to the behavior of MDC, strategies that are based on Concatenation, SMRT, and STEAC together with ML, MP, or NJ share more clades with other such strategies (mostly white and yellow boxes) than with the remaining strategies (mostly dark boxes). In contrast, as was observed with M, strategies based on STAR and RTC together with ML, MP, or NJ share similar numbers of clades among other such strategies as with the remaining strategies (mostly yellow boxes). These results suggest that strategies that are topologicallybased (i.e., STAR and RTC) tend to infer clades that are also supported both by other topologicallybased strategies and by strategies that are not strictly topologicallybased, whereas strategies that are not strictly topologicallybased (i.e., Concatenation, SMRT, and STEAC) tend to infer clades that are not supported by strategies that are strictly topologicallybased (i.e., STAR, RTC, and MDC).
Representative topologies
We next wanted to use a set of representative species tree topologies to highlight similarities and differences in topologies constructed by various strategies. Topologies were estimated using the Greedy Consensus algorithm [12] applied to clade counts. Because our previous results (Figures 45) indicate that the choice of outgroup species does not strongly influence the overall inferred topologies, it is sensible to average across outgroups. Therefore, we first present topologies for each of the 24 species tree–gene tree inference method pairs constructed from clade counts that were averaged over the three outgroups (Figure 7). Next, to obtain a clearer picture of the types of topologies that are inferred by the six species tree inference methods, we present topologies for each of the six species tree inference methods, constructed from clade counts that were averaged over gene tree inference methods and outgroup species (Figure 8). Finally, to assess the influence that various gene tree inference methods have on the overall inferred species tree topology, we present topologies for each of the four gene tree inference methods, constructed from clade counts that were averaged over species tree inference methods and outgroup species (Figure 9).
Figure 7. Rooted consensus trees of phylogenetic inference strategies averaged over outgroups. For a given subset of the 72 phylogenetic inference strategies considered, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created using the clade counts in the set. Each clade count in the set has a maximum value of 1000, because each element of the set is an average over values that each have a maximum of 1000. Each consensus tree is the greedy consensus tree based on clade counts averaged over outgroup species. These trees disregard branchlength information.
Figure 8. Rooted consensus trees of phylogenetic inference strategies averaged over outgroups and gene tree inference methods. For a given subset of the 72 phylogenetic inference strategies considered in this article, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created using the clade counts in the set. Each clade count in the set has a maximum value of 1000, because each element of the set is an average over values that each have a maximum of 1000. These trees disregard branchlength information. (A) Trees constructed using the 12 strategies that utilize Concatenation; (B) SMRT; (C) STEAC; (D) STAR; (E) RTC; (F) MDC.
Figure 9. Rooted consensus trees of phylogenetic inference strategies averaged over outgroups and species tree construction methods. For a given subset of the 72 phylogenetic inference strategies considered in this article, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created using the clade counts in the set. Each clade count in the set has a maximum value of 1000, because each element of the set is an average over values that each have a maximum of 1000. These trees disregard branchlength information. (A) Trees constructed using the 18 strategies that utilize ML; (B) MP; (C) NJ; (D) M.
Figure 7 displays 24 topologies with clade support values for each combination of a species tree construction method and a gene tree inference method. The clade {P. chiapensis, P. strobus} is generally highly supported, appearing for 22 of 24 strategies, with support ranging from 382 to 982 among 1000 bootstrap replicates. The smallest support values for {P. chiapensis, P. strobus} occur in strategies that use SMRT with ML, MP, and NJ, producing support values of 382, 406, and 395, respectively. The largest support values for this clade occur in strategies that use M, with values ranging from 824 to 982. Further, although strategies based on SMRT with ML, MP, and NJ yield lower support values than other strategies, when SMRT is combined with M, the support for {P. chiapensis, P. strobus} is 905. In addition, although two of the strategies based on STEAC do not support {P. chiapensis, P. strobus}, when STEAC is combined with M, the support for the clade is 982. Another clade that is highly supported is {P. ayacahuite, P. flexilis, P. strobiformis}. This clade is observed across all strategies, with support among nonMDC strategies out of 1000 bootstrap replicates ranging from 858 to 1000. Strategies that use MDC with ML, MP, and NJ yield support values for {P. ayacahuite, P. flexilis, P. strobiformis} of 560, 407, and 415, respectively. However, using MDC with M yields a support value of 933 for {P. ayacahuite, P. flexilis, P. strobiformis}. Across the 24 trees, the topological positions of P. albicaulis, P. lambertiana, and P. monticola are variable and are generally poorly supported. Each of these species is found in a variety of positions across all trees.
Figure 8 displays six topologies with clade support values for each species tree construction method. Similarly to Figure 7, the clade {P. chiapensis, P. strobus} is generally highly supported across all six species tree construction methods, with support ranging from 522 to 876 among 1000 bootstrap replicates. Also, as in Figure 7, the clade {P. ayacahuite, P. flexilis, P. strobiformis} is highly supported across all six species tree construction methods, with support ranging from 579 to 999 among 1000 bootstrap replicates. From these topologies, we can also observe that in agreement with the clade size distribution, strategies based on Concatenation, SMRT, and STEAC tend to produce more unbalanced trees than strategies based on STAR, RTC, and MDC (Figure 3). Strategies based on Concatenation, SMRT, and STEAC support topologies in which P. lambertiana is on the opposite side of the root from the other seven species. In contrast, strategies based on STAR, RTC, and MDC place P. monticola and P. lambertiana as sister species. These results support the observations from Figures 4, 5, and 6 that strategies based on species tree construction methods that are topologicallybased behave differently from strategies that are not strictly topologicallybased.
Figure 9 displays four topologies with clade support values, considering each gene tree inference method and combining species tree construction methods for each gene tree inference method. As in Figures 7 and 8, the clades {P. chiapensis, P. strobus} and {P. ayacahuite, P. flexilis, P. strobiformis} are generally highly supported across all four gene tree inference methods, with supports among 1000 bootstrap replicates respectively ranging from 610 to 931 and from 858 to 988.
Discussion
In this article, we have empirically evaluated strategies for inferring species tree topologies from multilocus sequence data. We have found that MDC tends to infer balanced topologies, whereas SMRT and STEAC tend to infer more unbalanced topologies. This bias toward balanced topologies exhibited by MDC is a consequence of the nature of the criterion that MDC uses to construct species trees, reflecting a theoretical finding that species trees with more balance have lower deep coalescence costs [83].
The strategies that we have examined fall into three classes in terms of the species tree inferences they produce: strategies applied only to datasets including all available sequenced individuals (i.e., M), topologicallybased strategies (i.e., STAR, RTC and MDC), and strategies that are not strictly topologicallybased (i.e., Concatenation, SMRT, and STEAC). While it is not unexpected that some approaches would behave similarly, it is surprising that strategies did not cluster based on the dataset or approach used (e.g., consensus or concatenation). Instead, strategies that take quite different species tree construction approaches (e.g., consensusbased STEAC and concatenationbased Concatenation and SMRT) form a cluster. Topologicallybased strategies tend to infer clades that are supported by other strategies, whereas strategies that are not strictly topologicallybased tend to infer clades that are not always wellsupported by other strategies. For example, clades inferred from strategies that are not strictly topologicallybased tend not to appear on trees that were inferred from strategies that are topologicallybased. A possible reason for this observation could be a lack of phylogenetic signal to properly infer branch lengths; that is, if a phylogenetic inference strategy is not strictly topologicallybased, then sequences with little phylogenetic signal (e.g., due to low substitution rate or short length) can strongly influence the species tree inferred by that strategy. Because STEAC uses branchlength information to infer a species tree topology, sequences with little signal can reduce its performance relative to topologicallybased methods such as STAR [15].
Although our main goal has been to use North American pines to investigate relationships among phylogenetic inference strategies, our results also provide some information about the phylogenetic placement of the pine species in the study. This analysis is the first multilocus study to provide substantial confidence for a sister relationship of P. chiapensis and P. strobus. Pinus chiapensis is a threatened species of Mexico and Guatemala whose phylogenetic affinity has been uncertain. Morphological and molecular evidence have been used to alternately argue for a sister relationship between P. chiapensis and P. strobus, from eastern North America, or between P. chiapensis and P. monticola, from western North America [47]. Here, 22 of 24 trees in Figure 7 grouped P. chiapensis and P. strobus as sister taxa, mostly with reasonably high bootstrap support. When phylogenetic inference strategies were averaged over either gene tree methods (Figure 8) or species tree methods (Figure 9), the {P. chiapensis, P. strobus} clade was always recovered.
The close phylogenetic affiliation of P. ayacahuite, P. flexilis, and P. strobiformis has long been suspected, as these three species represent similar forms that are continuously distributed from southern British Columbia and Alberta into Honduras [84]. Here, the {P. ayacahuite, P. flexilis, P. strobiformis} clade is wellsupported, although relationships among these three species are less clear. Two possibilities, namely ((P. flexilis, P. strobiformis), P. ayacahuite) and ((P. ayacahuite, P. strobiformis), P. flexilis), appear more likely based on our analysis (Figures 7, 8 and 9). Interestingly, Figure 7 finds that the ((P. ayacahuite, P. strobiformis), P. flexilis) clade is wellsupported by all strategies that use M.
Beyond these clades, the full phylogeny of this group of pines remains unclear. Considering the trees inferred in Figure 7, relationships among P. albicaulis, P. lambertiana, P. monticola, and the clades {P. chiapensis, P. strobus} and {P. ayacahuite, P. flexilis, P. strobiformis} are not stable across inference strategies, and bootstrap support is generally low. We might have expected greater resolution in this study, due to the exhaustive sample of the ingroup, extensive intraspecific sampling, large molecular dataset, and ease of species delimitation (the eight ingroup species include welldefined taxa that are morphologically, ecologically, phenologically, and generally geographically distinct).
We can attribute the lack of resolution in the pine phylogeny to several possible sources. First, the loci in the study were chosen because they amplify across a broad range of taxa from subgenus Strobus (only eight of whose members are included here), and might therefore be more slowly evolving and less informative for phylogenetic inference than typical loci. Thus, the size of the dataset might not be indicative of its information content for phylogenetic inference. Second, we have focused on strategies that have been implemented for ease of comparison and have not explored the full collection of available methods (e.g., [6,20,85,86]), nor have we considered such techniques as investigation of different subsets of taxa or loci on the basis of the strategies that we have studied (e.g., [16]). A study with a primary goal of resolving the pine phylogeny might achieve greater resolution through analyses that deviate from our standardized procedure. Third, the speciation events of interest might have occurred fast enough that retention of ancestral polymorphisms, as has been observed elsewhere among conifers [47,48,50,87], might inhibit convergence on a stable, wellsupported topology. Further work with more loci or fasterevolving loci will be important for distinguishing among these alternatives.
One caveat for interpreting our results is that except in our analyses based on M, we only considered a single lineage sampled within a species. Information on multiple lineages of the same species can have a significant effect on the performance of species tree inference methods, and many methods can use information on coalescences within and between species as part of the inference process (e.g., [1921,23,2527,8890]). Therefore, it is important to keep in mind that we have used one of a number of potential schemes for sampling individuals within our data, as sampling scheme can have an impact on the efficacy of species tree estimators [15,20,36,91,92].
Another caveat is that some of the datasets were obtained from procedures designed to maximize information content at each locus. These optimization procedures yielded datasets with one sequence sampled per species. Because the sequences within these optimized datasets are no longer randomly sampled within each species, a possible concern is that our results are not representative of random samples. This concern might be warranted when considering the inferred relationships of the various pine species in Figures 7, 8 and 9. However, as the strategies applied to each of these optimized datasets retain their general relationships across datasets (e.g., those that are topologicallybased and those that are not strictly topologicallybased), the conclusions drawn in this article should also hold for large noncoding datasets. Additionally, it is important to mention that an identical dataset is not used for all strategies considered, and notably, strategies that used M relied on different datasets from those that used ML, MP, and NJ. Although this is a limitation, the clustering pattern suggests it is not a major concern. In our principal components (Figure 4), cluster (Figure 5), and correlation (Figure 5) analyses, though the strategies were split by whether they used M or were either topologicallybased or not strictly topologicallybased, these three categories do not precisely map onto the different datasets. Therefore, though the dataset varies across the 72 strategies, other factors beyond the difference in datasets are contributing substantially to the difference in results.
Finally, a third caveat is that to maintain a uniform procedure across strategies, we did not estimate the nucleotide substitution model before applying maximum likelihood. This choice might have caused some systematic bias in ML gene tree estimates by overparametrizing the substitution model. However, we found that our inference strategies did not cluster by whether maximum likelihood, maximum parsimony, or neighborjoining was used (see Figures 4B and 5), suggesting that any systematic bias due to using a general timereversible substitution model did not drive our observed clustering patterns.
Conclusions
Based on a collection of twostage strategies that we have investigated, representing a subset of all available methods, our analyses have highlighted several aspects of phylogenetic inference strategies that enable recommendations for inferring rooted phylogenies from largescale multilocus data. First, it is beneficial to examine multiple strategies [93], considering some methods that use only topological information (e.g., STAR, RTC, and MDC) and others that also incorporate additional information (e.g., Concatenation, SMRT, and STEAC). If species tree topologies returned by these different classes of species tree construction methods agree, then an investigator can be more confident in the inferred tree topology. Second, estimates should not be based solely on species tree construction methods that appear to be biased toward certain types of topologies (e.g., MDC). Instead, it is preferable to utilize these types of methods in conjunction with other approaches. For example, after obtaining an unbalanced inferred tree from an inference method, if MDC also infers the same unbalanced topology, then we can be more confident that the true species topology is actually unbalanced. Finally, it is best to utilize as much information as is available on individuals at every locus. That is, if multiple individuals are sampled within a species at a given locus, then we should use all available sequence data from the species (i.e. as many sampled individuals as possible, as in strategies that are based on M). This point is supported by the observation that clades inferred by M tend to “flow” to other strategies (Figure 6). Based on our findings, we recommend the joint consideration of multiple approaches to estimating species trees that originate in different locations in the space of methods and that exhibit diverse properties in their species tree estimates.
Availability of supporting data
The data used in this study are available in Additional file 3.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MD, JS, and NAR designed the project. DBN, AJE, JS, AL, and RC generated and prepared the data. MD and JS performed the data analysis. MD, JS, and NAR wrote the paper, with contributions from all authors. All authors read and approved the final manuscript.
Acknowledgements
We thank Zachary Szpiech, Cuong Than, Chaolong Wang, and two anonymous reviewers for their helpful comments and Ben Figueroa, Ismael Grachico, Erik Grimstad, Brian Knaus, John Liechty, and Jill Wegrzyn for help with data generation and analysis. This work was supported by NSF grants DBI1103639, DBI0638502, DBI1146722, DEB0317103, and by the Murdock College Science Research Program.
References

Cranston KA, Hurwitz B, Ware D, Stein L, Wing RA: Species trees from highly incongruent gene trees in rice.
Syst Biol 2009, 58:489500. PubMed Abstract  Publisher Full Text

Song S, Liu L, Edwards SV, Wu S: Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model.
Proc Natl Acad Sci USA 2012, 109:1494214947. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

McCormack JE, Harvey MG, Faircloth BC, Crawford NG, Glenn TC, Brumfield RT: A phylogeny of birds based on over 1,500 loci collected by target enrichment and highthroughput sequencing.
PLoS One 2013, 8:e54848. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salichos L, Rokas A: Inferring ancient divergences requires genes with strong phylogenetic signals.
Nature 2013, 497:327331. PubMed Abstract  Publisher Full Text

Rannala B, Yang Z: Phylogenetic inference using whole genomes.
Annu Rev Genomics Hum Genet 2008, 9:217231. PubMed Abstract  Publisher Full Text

Degnan JH, Rosenberg NA: Gene tree discordance, phylogenetic inference and the multispecies coalescent.
Trends Ecol Evol 2009, 24:332340. PubMed Abstract  Publisher Full Text

Kubatko LS, Degnan JH: Inconsistency of phylogenetic estimates from concatenated data under coalescence.
Syst Biol 2007, 56:1724. PubMed Abstract  Publisher Full Text

Degnan JH, Rosenberg NA: Discordance of species trees with their most likely gene trees.
PLoS Genet 2006, 2:e68. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Degnan JH, DeGiorgio M, Bryant D, Rosenberg NA: Properties of consensus methods for estimating species trees from gene trees.
Syst Biol 2009, 58:3554. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang Y, Degnan JH: The performance of matrix representation with parsimony for inferring species from gene trees.

Than CV, Rosenberg NA: Consistency properties of species tree inference by minimizing deep coalescences.
J Comput Biol 2011, 18:115. PubMed Abstract  Publisher Full Text

Bryant D: A classification of consensus methods for phylogenies. In BioConsensus. Edited by Janowitz M, Lapointe FJ, McMorris FR, Mirkin B, Roberts FS. Providence, Rhode Island: American Mathematical Society Press; 2003:163183.
DIMACS Series in Discrete Mathematics and Theoretical Computer Science

Gadagkar SR, Rosenberg MS, Kumar S: Inferring species phylogenies from multiple genes: Concatenated sequence tree versus consensus gene tree.
J Exp Zool B Mol Dev Evol 2005, 304:6474. PubMed Abstract  Publisher Full Text

Huang H, Knowles LL: What is the danger of the anomaly zone for empirical phylogenetics?
Syst Biol 2009, 58:527536. PubMed Abstract  Publisher Full Text

DeGiorgio M, Degnan JH: Robustness to divergence time underestimation when inferring species trees from estimated gene trees.
Syst Biol 2014, 63:6682. PubMed Abstract  Publisher Full Text

Rokas A, Williams BL, King N, Carroll SB: Genomescale approaches to resolving incongruence in molecular phylogenies.
Nature 2003, 425:798804. PubMed Abstract  Publisher Full Text

Ané C, Larget B, Baum DA, Smith SD, Rokas A: Bayesian estimation of concordance factors.
Mol Biol Evol 2007, 24:412426. PubMed Abstract  Publisher Full Text

Liu L, Pearl DK: Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions.
Syst Biol 2007, 56:504514. PubMed Abstract  Publisher Full Text

Liu L: BEST: Bayesian estimation of species trees under the coalescent model.
Bioinformatics 2008, 24:25422543. PubMed Abstract  Publisher Full Text

Heled J, Drummond AJ: Bayesian inference of species trees from multilocus data.
Mol Biol Evol 2010, 27:570580. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu L, Yu L, Pearl DK, Edwards SV: Estimating species phylogenies using coalescence times among sequences.
Syst Biol 2009, 58:468477. PubMed Abstract  Publisher Full Text

DeGiorgio M, Degnan JH: Fast and consistent estimation of species trees using supermatrix rooted triples.
Mol Biol Evol 2010, 27:552569. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu L, Yu L, Pearl DK: Maximum tree: a consistent estimator of the species tree.
J Math Biol 2010, 60:95106. PubMed Abstract  Publisher Full Text

Liu L, Yu L, Edwards SV: A maximum pseudolikelihood approach for estimating species trees under the coalescent.

Mossel E, Roch S: Incomplete lineage sorting: consistent phylogeny estimation from multiple loci.

Helmkamp LJ, Jewett EM, Rosenberg NA: Improvements to a class of distance matrix methods for inferring species trees from gene trees.
J Comput Biol 2012, 19:632649. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jewett EM, Rosenberg NA: iGLASS: an improvement to the GLASS method for estimating species trees from gene trees.
J Comput Biol 2012, 19:293315. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Huang H, He Q, Kubatko LS, Knowles LL: Sources of error inherent in speciestree estimation: impact of mutational and coalescent effects on accuracy and implications for choosing among different methods.
Syst Biol 2010, 59:573583. PubMed Abstract  Publisher Full Text

Chung Y, Ané C: Comparing two Bayesian methods for gene tree/species tree reconstruction: simulations with incomplete lineage sorting and horizontal gene transfer.
Syst Biol 2011, 60:261275. PubMed Abstract  Publisher Full Text

Leaché AD, Rannala B: The accuracy of species tree estimation under simulation: a comparison of methods.
Syst Biol 2011, 60:126137. PubMed Abstract  Publisher Full Text

Swenson MS, Suri R, Linder CR, Warnow T: An experimental study of Quartets MaxCut and other supertree methods.
Algorithms Mol Biol 2011, 6:7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Yang J, Warnow T: Fast and accurate methods for phylogenomic analyses.

Jennings WB, Edwards SV: Speciational history of Australian grass finches (Poephila) inferred from thirty gene trees.
Evolution 2005, 59:20332047. PubMed Abstract

Brumfield RT, Liu L, Lum DE, Edwards SV: Comparison of species tree methods for reconstructing the phylogeny of bearded manakins (Aves: Pipridae, Manacus) from multilocus sequence data.
Syst Biol 2008, 57:719731. PubMed Abstract  Publisher Full Text

Carling MD, Brumfield RT: Integrating phylogenetic and population genetic analyses of multiple loci to test species divergence hypotheses in Passerina buntings.
Genetics 2008, 178:363377. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu L, Pearl DK, Brumfield RT, Edwards SV: Estimating species trees using multipleallele DNA sequence data.
Evolution 2008, 62:20802091. PubMed Abstract  Publisher Full Text

Carstens BC, Knowles LL: Estimating species phylogeny from genetree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers.
Syst Biol 2007, 56:400411. PubMed Abstract  Publisher Full Text

Linnen CR, Farrell BD: Comparison of methods for speciestree inference in the sawfly genus Neodiprion (Hymenoptera: Diprionidae).
Syst Biol 2008, 57:876890. PubMed Abstract  Publisher Full Text

Espregueira Themudo G, Wielstra B, Arntzen JW: Multiple nuclear and mitochondrial genes resolve the branching order of a rapid radiation of crested newts (Triturus, Salamandridae).
Mol Phylogenet Evol 2009, 52:321328. PubMed Abstract  Publisher Full Text

Buerki S, Forest F, Salamin N, Alvarez N: Comparative performance of supertree algorithms in large data sets using the soapberry family (Sapindaceae) as a case study.
Syst Biol 2011, 60:3244. PubMed Abstract  Publisher Full Text

Takezaki N, Nei M: Empirical tests of the reliability of phylogenetic trees constructed with microsatellite DNA.
Genetics 2008, 178:385392. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hird S, Kubatko L, Carstens B: Rapid and accurate species tree estimation for phylogenetic investigations using replicated subsampling.
Mol Phylogenet Evol 2010, 57:888898. PubMed Abstract  Publisher Full Text

Belfiore NM, Liu L, Moritz C: Multilocus phylogenetics of a rapid radiation in the genus Thomomys (Rodentia: Geomyidae).
Syst Biol 2008, 57:294310. PubMed Abstract  Publisher Full Text

Kubatko LS, Gibbs HL, Bloomquist EW: Inferring specieslevel phylogenies and taxonomic distinctiveness using multilocus data in Sistrus rattlesnakes.
Syst Biol 2011, 60:393409. PubMed Abstract  Publisher Full Text

Gatesy J, Baker RH: Hidden likelihood support in genomic data: can fortyfive wrongs make a right?
Syst Biol 2005, 54:483492. PubMed Abstract  Publisher Full Text

Edwards SV, Liu L, Pearl DK: Highresolution species trees without concatenation.
Proc Natl Acad Sci USA 2007, 104:59365941. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Syring J, Farrell K, Businský R, Cronn R, Liston A: Widespread genealogical nonmonophyly in species of Pinus subgenus Strobus.
Syst Biol 2007, 56:163181. PubMed Abstract  Publisher Full Text

Bouillé M, Bousquet J: Transspecies shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae): implications for the longterm maintenance of genetic diversity in trees.
Am J Bot 2005, 92:6373. PubMed Abstract  Publisher Full Text

Ma XF, Szmidt AE, Wang XR: Genetic structure and evolutionary history of a diploid hybrid pine Pinus densata inferred from the nucleotide variation at seven gene loci.
Mol Biol Evol 2006, 23:807816. PubMed Abstract  Publisher Full Text

Willyard A, Cronn R, Liston A: Reticulate evolution and incomplete lineage sorting among the ponderosa pines.
Mol Phylogenet Evol 2009, 52:498511. PubMed Abstract  Publisher Full Text

Savolainen O, Pyhäjärvi T: Genomic diversity in forest trees.
Curr Opin Plant Biol 2007, 10:162167. PubMed Abstract  Publisher Full Text

Kral R: Pinus. In Flora of North America (North of Mexico). Edited by Flora of North America editorial committee F, Flora of North America editorial committee F . New York: Oxford University Press; 1993:373398.

Perry JP: The Pines of Mexico and Central America. Portland: Timber Press; 1991.

Syring J, Willyard A, Cronn R, Liston A: Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple lowcopy nuclear loci.
Am J Bot 2005, 92:20862100. PubMed Abstract  Publisher Full Text

Eckert AJ, van Heerwaarden J, Wegrzyn JL, Nelson CD, RossIbarra J, GonzálezMartínez SC, Neale DB: Patterns of population structure and environmental associations to aridity across the range of Loblolly pine (Pinus taeda L., Pinaceae).
Genetics 2010, 185:969982. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wegrzyn JL, Lee JM, Liechty J, Neale DB: PineSAP—sequence alignment and SNP identification pipeline.
Bioinformatics 2009, 25:26092610. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ewing B, Hillier L, Wendl MC, Green P: Basecalling of automated sequencer traces using phred. I. Accuracy assessment.
Genome Res 1998, 8:175185. PubMed Abstract  Publisher Full Text

Lee WH, Vega VB: Heterogeneity detector: finding heterogeneous positions in Phred/Phrap assemblies.
Bioinformatics 2004, 20:28632864. PubMed Abstract  Publisher Full Text

Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity.
BMC Bioinformatics 2004, 5:113. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res 2004, 32:17921797. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ewing B, Green P: Basecalling of automated sequencer traces using phred. II. Error probabilities.
Genome Res 1998, 8:186194. PubMed Abstract  Publisher Full Text

Parks M, Cronn R, Liston A: Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes.
BMC Biol 2009, 7:84. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Felsenstein J: Inferring Phylogenies. Sunderland, MA: Sinauer Associates; 2004.

Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap.
Evolution 1985, 39:783791. Publisher Full Text

Efron B, Tibshirani RJ: An Introduction to the Bootstrap. New York: Chapman and Hall; 1993.

Swofford DL: PAUP*. Phylogenetic analysis Using Parsimony (* and Other Methods). Version 4. Sunderland, MA: Sinauer Associates; 2003.

de Queiroz A, Gatesy J: The supermatrix approach to systematics.
Trends Ecol Evol 2007, 22:3441. PubMed Abstract  Publisher Full Text

Ewing GB, Ebersberger I, Schmidt HA, von Haeseler A: Rooted triple consensus and anomalous gene trees.
BMC Evol Biol 2008, 8:118. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Maddison WP: Gene trees in species trees.
Syst Biol 1997, 46:523536. Publisher Full Text

Than C, Nakhleh L: Species tree inference by minimizing deep coalescences.
PLoS Comput Biol 2009, 5:e1000501. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Page RDM: Modified mincut supertrees. In Algorithms in Bioinformatics, Second International Workshop, WABI, 2002, Rome, Italy, September 17–21, 2002, Proceedings (Lecture Notes in Computer Science Vol. 2452). Edited by Guigó R, Gusfield D. Berlin: Springer; 2002:537552.

Strimmer K, von Haeseler A: Quartet puzzling: a quartet maximumlikelihood method for reconstructing tree topologies.
Mol Biol Evol 1996, 13:964969. Publisher Full Text

Robinson DR, Foulds LR: Comparison of phylogenetic trees.
Math Biosci 1981, 53:131147. Publisher Full Text

Dryden IL, Mardia KV: Statistical Shape Analysis. Chichester: Wiley; 1998.

Cox TF, Cox MAA: Multidimensional Scaling, 2nd edition. Boca Raton: Chapman and Hall; 2001.

Gower JC, Dijksterhuis GB: Procrustes Problems. New York: Oxford University Press; 2004.

Ward JH: Hierarchical grouping to optimize an objective function.
J Am Stat Assoc 1963, 58:236244. Publisher Full Text

Sackin MJ: “Good” and “bad” phenograms.
Syst Zool 1972, 21:225226. Publisher Full Text

Colless DH: Phylogenetics, the theory and practice of phylogenetic systematics (book review).
Syst Zool 1982, 31:100104. Publisher Full Text

Shao KT, Sokal RR: Tree balance.
Syst Zool 1990, 39:266276. Publisher Full Text

Kirkpatrick M, Slatkin M: Searching for evolutionary patterns in the shape of a phylogenetic tree.
Evolution 1993, 47:11711181. Publisher Full Text

Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK: A worldwide survey of haplotype variation and linkage disequilibrium in the human genome.
Nat Genet 2006, 38:12511260. PubMed Abstract  Publisher Full Text

Than CV, Rosenberg NA: Mathematical properties of the deep coalescence cost.

Critchfield WB, Little EL Jr: Geographic Distribution of the Pines of the World. Washington, DC: USDA Forest Service Miscellaneous Publication 991; 1966.

Liu L, Yu L, Kubatko L, Pearl DK, Edwards SV: Coalescent methods for estimating phylogenetic trees.
Mol Phylogenet Evol 2009, 53:320328. PubMed Abstract  Publisher Full Text

Knowles LL, Kubatko LS: Estimating Species Trees: Practical and Theoretical Aspects. Hoboken: WileyBlackwell; 2010.

Syring J, del Castillo RF, Cronn R, Liston A: Multiple nuclear loci reveal the distinctiveness of the threatened neotropical pine Pinus chiapensis.
Syst Bot 2007, 32:703717. Publisher Full Text

Kubatko LS, Carstens BC, Knowles LL: STEM: species tree estimation using maximum likelihood for gene trees under coalescence.
Bioinformatics 2009, 25:971973. PubMed Abstract  Publisher Full Text

Liu L, Yu L: Estimating species trees from unrooted gene trees.

Wu Y: Coalescentbased species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood.
Evolution 2012, 66:763775. PubMed Abstract  Publisher Full Text

McCormack JE, Huang H, Knowles LL: Maximum likelihood estimates of species trees: how accuracy of phylogenetic inference depends upon the divergence history and sampling design.
Syst Biol 2009, 58:501508. PubMed Abstract  Publisher Full Text

Camargo A, Avila LJ, Morando M, Sites JW Jr: Accuracy and precision of species trees: effects of locus, individual, and base pair sampling on inference of species trees in lizards of the Liolaemus darwinii group (Squamata, Liolaemidae).
Syst Biol 2012, 61:272288. PubMed Abstract  Publisher Full Text

Anisimova M, Liberles DA, Philippe H, Provan J, Pupko T, von Haeseler A: Stateof the art methodologies dictate new standards for phylogenetic analysis.
BMC Evol Biol 2013, 13:161. PubMed Abstract  Publisher Full Text  PubMed Central Full Text