Department of Biological Science, Louisiana State University, Baton Rouge, LA, 70803, USA

Department of Ecology, Evolution and Organismal Biology, The Ohio State University, Columbus, OH, 43210, USA

Abstract

Background

Species are considered the fundamental unit in many ecological and evolutionary analyses, yet accurate, complete, accessible taxonomic frameworks with which to identify them are often unavailable to researchers. In such cases DNA sequence-based species delimitation has been proposed as a means of estimating species boundaries for further analysis. Several methods have been proposed to accomplish this. Here we present a Bayesian implementation of an evolutionary model-based method, the general mixed Yule-coalescent model (GMYC). Our implementation integrates over the parameters of the model and uncertainty in phylogenetic relationships using the output of widely available phylogenetic models and Markov-Chain Monte Carlo (MCMC) simulation in order to produce marginal probabilities of species identities.

Results

We conducted simulations testing the effects of species evolutionary history, levels of intraspecific sampling and number of nucleotides sequenced. We also re-analyze the dataset used to introduce the original GMYC model. We found that the model results are improved with addition of DNA sequence and increased sampling, although these improvements have limits. The most important factor in the success of the model is the underlying phylogenetic history of the species under consideration. Recent and rapid divergences result in higher amounts of uncertainty in the model and eventually cause the model to fail to accurately assess uncertainty in species limits.

Conclusion

Our results suggest that the GMYC model can be useful under a wide variety of circumstances, particularly in cases where divergences are deeper, or taxon sampling is incomplete, as in many studies of ecological communities, but that, in accordance with expectations from coalescent theory, rapid, recent radiations may yield inaccurate results. Our implementation differs from existing ones in two ways: it allows for the accounting for important sources of uncertainty in the model (phylogenetic and in parameters specific to the model) and in the specification of informative prior distributions that can increase the precision of the model. We have incorporated this model into a user-friendly R package available on the authors’ websites.

Background

A common challenge faced by empirical researchers in studies of ecological communities is to identify individuals at the species level from limited information collected from a broad taxonomic range of organisms. In many cases, useful taxonomic keys for particular groups or regions are not available. This is because many diverse groups are morphologically cryptic, contain many undescribed taxa, or existing taxonomic literature is conflicting, an issue referred to as the “taxonomic impediment”

Methods used for delimitation of species from barcode data are a subset of those developed for the larger problem of species delimitation. They can be considered species discovery methods because they must be functional in the absence of good

Pons et al.

The GMYC model as presently implemented, however, does not account for three potentially large sources of error. First, it is widely recognized that a variety of factors can cause the genealogy from a particular locus to be discordant with the true history of speciation

In order to address the second and third potential sources of error, we introduce a Bayesian implementation of this model with flexible prior distributions in the statistical scripting language R

Methods

Model

Given an ultrametric phylogenetic tree estimated from a set of sequences consisting of multiple species and multiple individuals within species, the GMYC model decomposes the tree into its component waiting times between branching events. These waiting times are the data to be modeled

where the waiting times (_{
i
}) are assumed to be exponentially distributed and a function of: the branching rate (_{
i
}), and a rate change parameter that accounts for the possibility of increasing or decreasing diversification rate with time (

where the branching rate (λ) can be interpreted as 1/N_{e}μ (where μ is the per generation mutation rate, or the number of generations per year, depending on the branch length units of the tree) and the rate change parameter (

The GMYC model combines the above models, and the Likelihood of the full model is calculated by assigning lineages in each waiting interval to either the Yule process or one of the coalescent processes such that:

Making the full Likelihood of a waiting time:

where _{
k+1
} and _{
k+1
} are the branching rate and rate change parameters for the Yule process, and _{
j
} and _{
j
} are the branching rate and population size change parameters for the coalescent process. Following Pons et al.
_{
j
} and _{
j
} to be identical across coalescent processes. The number of lineages assigned to the Yule and coalescent processes in each waiting interval are _{
i,k+1
} and _{
i,j
}, respectively. Assignment of lineages in this case is determined by the selection of a threshold.

Because the sequence data employed in these analyses are typically from short fragments, and thus likely to yield trees with high levels of uncertainty in topology and branch lengths we implemented this model in a Bayesian statistical framework. It eliminates the reliance on point estimates of the phylogeny and model parameters and by estimating the marginal probabilities of the identity of species, allows one to incorporate that uncertainty in downstream analyses. Our implementation operates as follows. First, the posterior distribution of trees and branch lengths are characterized using BEAST

where

Simulation testing

We evaluated the utility of this implementation of the GMYC using three simulation experiments. In each, we simulated gene trees from species trees using ms

In the first experiment we examined the effect of tree depth on model accuracy. We simulated 50 species trees as above and scaled them to four different depths (20 N, 40 N, 80 N, 160 N generations, where N is the effective number of diploid individuals in the species). When considering how the results translate to haploid, maternally inherited organellar DNA, the equivalent tree depths are halved (e.g. 10 N, 20 N…) and N becomes the effective number of females in the population. We then simulated a single gene tree from each species tree at each depth, sampling five alleles per species. For each of these trees we sampled from the posterior for 100,000 generations, discarding the first 10,000 generations as burn-in and thinning every 100 generations, assessed stationarity by examining plots of the parameters by eye, and characterized the posterior distribution of the threshold parameter, which determines the species limits given a tree. Priors on all parameters were uniform distributions; in the case of the threshold parameter, from U(2,250) and for the

In the second experiment we looked at the influence of sampling. The species trees with a depth of 80 N from the first experiment were used with four different sampling schemes: 2 alleles per species, 5 alleles per species, 10 alleles per species, and a random number of alleles per species, drawn from a lognormal distribution, with a mean and standard deviation of 1 (an average of 5 alleles per species; approximately 17% of species were represented by singletons). We used the lognormal distribution because it approximates some real species-abundance distributions

In the third experiment, we tested the effect of nucleotide sampling and tree estimation on the accuracy of the model (in our simulations, sequence length is directly correlated with the number of variable sites). We selected 10 of the simulated gene trees from 10 species trees scaled to 160 N generations for which the confidence intervals in the analysis overlapped the true value of 50 species. We then simulated DNA sequences on those gene trees of 300 bp, 600 bp, 1200 bp and 2400 bp using Seq-Gen
_{e} of 250,000 and a mutation rate of 1.5% per million generations) and an HKY + G model. We characterized the posterior distribution of trees using the true model of sequence evolution and a strict clock model in BEAST. We pruned all identical sequences and ran BEAST for 10 million generations, discarding the first million as burn-in, at which point all parameters for all replicates had effective sample sizes above 150 and most above 200. We then ran independent GMYC MCMC analyses on 100 trees sampled every 50,000 generations from the BEAST posterior distribution of trees, pooled the results and characterized the marginal posterior distribution of the threshold parameter compared to the distribution produced using the true tree.

Empirical data analyses

To illustrate how this implementation of the GMYC could be applied; we downloaded from GenBank and reanalyzed the dataset from Pons et al., the original publication of the GMYC (Coleoptera:Carabidae:_{
k+1
}

Results and Discussion

Simulation tests

We first tested the influence of tree depth on model performance. When deeper trees are simulated, coalescent and Yule branching processes are expected to occur on more distinct time scales, and thus in general the model should perform better. The influence of tree depth is actually confounded by two issues, however. First, as the tree depth becomes shallower the implied rate of speciation increases because all trees contain 50 species. If the rate of speciation approaches the rate of coalescence within species, then a sharp transition between processes should not be detectable. Second, as the implied rate of speciation increases, more species originate relatively recently. The expected time to coalescence for a diploid, panmictic population is 4 N generations. Cladogenic events occurring more recently than this are expected to be increasingly difficult to delimit for two reasons: they are more likely to yield species that are not monophyletic and thus impossible to accurately identify under this model, and the most recent common ancestor (MRCA) of the daughter species is more likely to occur more recently than the threshold point. Assuming species monophyly, the expected time to the MRCA for two species that diverged 4 N generations ago is 6 N generations. Therefore all probability should be on thresholds older than 4 N generations, and most on thresholds older than 6 N generations. Again, when considering maternally inherited, haploid, organellar DNA, equivalent times in N generations are halved, and N becomes the effective number of females in the population. This would give an expected time to MRCA of 3 N generations.

The results of this test are dramatic (Figure

**Figures S**
**, S**
**, S**
**.** These figures display the distribution of MCMC samples for each treatment and each replicate within treatments for simulated data. **S1** is results from the tree depth simulation, **S2** is the results from the allele sampling simulation and **S3** is the results from the nucleotide sampling simulation.

Click here for file

Testing the effects of tree depth

**Testing the effects of tree depth.** Legend: The left pane shows the effect of increasing tree depth on the threshold parameter (number of species). The dots are posterior means of individual replicates. The blue line shows the expected number of species (50) and the red line shows the mean number of species diverging earlier than 4 N generations. The right pane shows the 95% HPD interval for each replicate. Gray points indicate the HPD overlapped the true number of species, black ones that it did not.

These results indicate that the model performs well under demographic or sampling conditions that result in coalescent and Yule processes occurring on very different time scales. It does not, however, perform optimally when those conditions are not met.

Ideally one would hope that as inference of the threshold point became more difficult, that the 95% HPDs would increase, but still encompass the true value 95% of the time. This is not the case at the 20 N and 40 N tree depths. HPDs generally become broader, but for increasing numbers of simulation replicates, they fail to encompass the true value. 50 species arising in 40 N generations constitutes a very rapid radiation, with an average of 89% of branches in the species tree shorter than the expected population coalescence time of 4 N generations. Failure to accurately assess credibility intervals in this case is likely because in this area of parameter space, the GMYC is no longer an accurate approximation of the real branching process in the gene tree. Rather than there being a threshold between coalescent and speciation branching processes, the two processes are intermixed because there is little time for the independent evolution of lineages prior to speciation. Note that these conditions will cause any DNA barcode-based method of species discovery to fail and will also challenge more realistic models utilizing multilocus data and prior information on population assignment.

Next we examined the effect of intraspecific sampling. Because the data points used by the model are waiting times between branching events, we expected that with 50 species, we would not need extremely high sampling to accurately characterize the model, and that the distribution of samples among species would not be particularly important. Our expectations were met. We found that sampling of 2 individuals per species yielded poor results (Figure

Testing the effect of allele sampling within species

**Testing the effect of allele sampling within species.** Legend: We chose four sampling schemes: 2, 5, 10 alleles and a lognormally distributed (mean = 5) number of alleles per species. All species trees had a depth of 80 N generations. The left pane shows the effect of sampling scheme on the threshold parameter (number of species). The dots are posterior means of individual replicates. The blue line indicates the number of expected species (50) and the red line shows the mean number of species diverging earlier than 4 N generations. The right pane shows the 95% HPD interval for each replicate. Gray points indicate the HPD overlapped the true number of species, black ones that it did not.

Finally, we tested the effects of nucleotide sampling and the incorporation of phylogenetic uncertainty. We expected to find wider HPDs with less sequence data, as uncertainty, particularly in branch lengths should be greater. We found a mild reduction in accuracy of the posterior means with up to 600 bp of sequence, but after that, posterior means converged on those of the true tree. The 95% HPDs improved with the addition of more sequence, but had not quite converged on those estimated from the true tree, even at 2400 bp (Figure

Testing the effect of DNA sequence length

**Testing the effect of DNA sequence length.** Legend: We compared results from 300 bp, 600 bp, 1200 bp and 2400 bp with those estimated from the true tree. The left pane shows the effect of nucleotide sampling on the threshold parameter (number of species). The dots are posterior means of individual replicates. The blue line indicates the number of expected species (50) and the red line shows the mean number of species diverging earlier than 4 N generations. The right pane shows the 95% HPD interval for each replicate. Gray points indicate the HPD overlapped the true number of species, black ones that it did not.

Three factors that could influence the accuracy of the model that were not explored here: migration, population substructure and selection. Papadopoulou et al.

Papadopoulou et al.’s simulations assumed complete demic sampling, but Lohse

While Lohse shows convincingly that this interaction of parameter space with sampling can mislead the GMYC, it is not clear to what extent these problematic areas of parameter space exist in real datasets. We simulated 10 genealogies using ms under the conditions above and observed that the average time to coalescence of all lineages was 3,940 N generations (N is the size of a population in one deme), with the scattering phase taking the first 4-6 N generations. If we assumed that these 200 demes were species level taxa, each with

Empirical analyses

We reanalyzed the empirical data used by Pons et al. to illustrate the original formulation of the GMYC so as to provide a direct comparison of the implementations using representative data. The BEAST run converged after 27 million generations and we discarded 2.7 million trees as burn-in. The estimate of the standard deviation of the lognormal distribution of rates did not overlap 0, so we could not use a strict clock with these data. When using samples of trees from the BEAST posterior distribution, the mean number of species estimated by the Bayesian GMYC was 44 and the 95% HPD ranged from 34 to 57. The rate change parameter for the Yule process ranged as high as 1.9. In this model, the fold change in speciation rate from the root to the last speciation event is equal to ^{
p
}/

Comparing Methodological Approaches

**Comparing Methodological Approaches.** Legend: We compared the new Bayesian implementation with the Likelihood method, with the effect of varying priors, and the inclusion of phylogenetic uncertainty. 4**A** shows Akaike Weights from the Likelihood method at each threshold point (gray circles), posterior probabilities given a U(0,2) prior (black triangles) and a U(0,1,2) prior (black crosses) on the Yule rate-change parameter. All three results were calculated from the maximum clade credibility tree. 4**B** shows the same results, except the posterior probabilities were calculated by running the analysis on 100 trees sampled from the posterior distribution of trees generated in BEAST.

Summary of empirical analyses

**Summary of empirical analyses.** Legend: We compare the results of Likelihood and Bayesian analyses of the Pons et al.

Conclusions

Our results demonstrate that the Bayesian implementation of the GMYC model is reasonably reliable given two caveats. First, the length of the DNA sequence is important. We found that when we sampled only 300 bp, or only 2 alleles per species, that the performance of the model declined strongly. Second, the model is only useful when the underlying history of the species under consideration lies in particular regions of parameter space. Species that have recently diverged, or clades undergoing rapid radiation are unlikely to be identifiable under the model. In the latter case, the model may provide misleading estimates and confidence. Cases such as these, however, may be recognizable because the results may be highly unexpected in the context of other sources of data such as morphology or geography.

Our implementation of the model provides two main improvements over the original. First, it allows the specification of prior probabilities on model parameters. It is our experience that very high values of the Yule process rate change parameter sometimes have high likelihood and result in high uncertainty in the threshold parameter (unpublished empirical data). These high values may be biologically unrealistic, and the specification of an informative prior can reduce the posterior probability of those areas and produce a more accurate estimate of diversity. Second, it allows for the characterization of species limits without use of a point estimate of the phylogeny. We know that many datasets are associated with substantial uncertainty owing to limited sequence data collection. The Bayesian GMYC method provides marginal probabilities of species identities and will allow downstream estimates of species diversity and community structure (which are often the goal of environmental sequencing studies;

An important future direction for this work is to implement the multiple-threshold version of the model proposed by Monaghan et al.

It is widely acknowledged that single-locus data are not optimal for the inference of phylogeny, historical demography, or species limits

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

NR and BC developed the original concept and designed the simulation experiments. NR wrote all software and conducted the analyses described in the manuscript. NR and BC interpreted results and wrote the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We thank the National Science Foundation (DEB-0918212) for funding aspects of this work. We thank Jeremy Brown for conversations that initiated this research, and members of the Carstens Lab (Sarah M. Hird, John D. McVay, Tara A. Pelletier and Jordan Satler) at Louisiana State University for discussions related to and comments on this manuscript. We thank Dr. Timothy Barraclough and two anonymous reviewers for helpful correspondence regarding this work and comments on drafts of the manuscript.