Allan Wilson Centre for Molecular Ecology and Evolution, University of Auckland, Private Bag 92019, Auckland, New Zealand

Computational Evolution Group, University of Auckland, Private Bag 92019, Auckland, New Zealand

Departments of Biomathematics and Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095, USA

Department of Biostatistics, UCLA School of Public Health, Los Angeles, CA 90095, USA

Abstract

Background

Relaxed molecular clock models allow divergence time dating and "relaxed phylogenetic" inference, in which a time tree is estimated in the face of unequal rates across lineages. We present a new method for relaxing the assumption of a strict molecular clock using Markov chain Monte Carlo to implement Bayesian modeling averaging over random local molecular clocks. The new method approaches the problem of rate variation among lineages by proposing a series of local molecular clocks, each extending over a subregion of the full phylogeny. Each branch in a phylogeny (subtending a clade) is a possible location for a change of rate from one local clock to a new one. Thus, including both the global molecular clock and the unconstrained model results, there are a total of 2^{2n-2 }possible rate models available for averaging with 1, 2, ..., 2

Results

We propose an efficient method to sample this model space while simultaneously estimating the phylogeny. The new method conveniently allows a direct test of the strict molecular clock, in which one rate rules them all, against a large array of alternative local molecular clock models. We illustrate the method's utility on three example data sets involving mammal, primate and influenza evolution. Finally, we explore methods to visualize the complex posterior distribution that results from inference under such models.

Conclusions

The examples suggest that large sequence datasets may only require a small number of local molecular clocks to reconcile their branch lengths with a time scale. All of the analyses described here are implemented in the open access software package BEAST 1.5.4 (

Background

In 1967, Allan Wilson and his then doctoral student Vincent Sarich described an "evolutionary clock" for albumin proteins and exploited the clock to date the common ancestor of humans and chimpanzees to five million years ago

Researchers have grappled with the tension between molecular and non-molecular evidence for evolutionary time scales ever since. Recently, a number of authors

Local molecular clocks are another alternative to the global molecular clock ^{2n−2}, where

In this paper we employ Markov chain Monte Carlo (MCMC) to investigate a Bayesian random local clock (RLC) model, in which all possible local clock configurations are nested. We implement our method in the BEAST 1.x ^{2n− 2 }possible local clock models on all possible rooted trees. Because the RLC model includes the possibility of zero rate changes, it also serves to test whether one rate is sufficient to rule all the gene sequences at hand, as was Wilson and Sarich's view of the African primate albumins.

Methods

Basic evolutionary model

We begin by considering data **Y**, consisting of aligned molecular sequences of length **Y **= (**Y**_{1}, ..., **Y**_{S}**Y**_{s }** τ **. The tree

Letting **Φ **= (**Λ**,**Y**_{s}**τ**, **Φ**). Felsenstein's peeling/pruning algorithm **Y**_{s}**τ**,**Φ**). Assuming that sites are independent and identically distributed given (**τ**,**Φ**) yields the complete data likelihood

Branch-specific rate variation

We take the opinion that variation in the rate of molecular evolution is widespread

Model parameterization

We introduce the RLC model that allows for sparse, possibly large-scale changes while maintaining spatial correlation along the tree. We start at the unobserved branch leading to the most recent common ancestor (MRCA) of the tree and define the composite rate _{MRCA }= 1. Substitutions then occur on each branch

where pa(** ϕ **= (

Allowing all elements in ** ϕ **to vary independently leads to a completely non-clock-like model with, even worse, far too many free parameters for identifiability with the divergence times in

Bayesian stochastic search variable selection

To infer which branch-specific rates _{k }**X**_{1},...,**X**_{P }**Y**. For example, the full model becomes **Y **= [**X**_{1},...,**X**_{P}** β + ϵ**, where

Recent work in BSSVS ** δ **= (

To map BSSVS into the setting of rate variation, let _{k }_{k }_{pa(k)}. Conversely, when _{k }_{k }_{pa(k) }implying that _{k }** ϕ **play an analogous role to the regression coefficients in BSSVS. An important difference is that

Prior specification

To specify a prior distribution over ** δ **= (

where λ is the prior expected number of rate changes along the tree ** τ **. Choosing λ = log2, for example, sets 50% prior probability on the hypothesis of no rate changes.

Completing the RLC prior specification, we assume that all rate multipliers in ** ϕ **are

When _{k }_{k }_{k }_{k }

Normalization

To translate between the expected number of substitutions _{k }_{k}

where ** τ **, the parameterization in Equation (5) again leads to more degrees-of-freedom than are identifiable. We solve this difficulty through a further normalization constraint

To maintain this scaling, we sum Equation (5) over all branches and substitute the result into Equation (6). This eliminates the unknown

Posterior simulation

We take a Bayesian approach to data analysis and draw inference under the RLC model via MCMC. MCMC straightforwardly generates random draws with first-order dependence through the construction of a Markov chain that explores the posterior distribution. Via the Ergodic Theorem, simple tabulation of a chain realization {^{(1)},...,^{(L)}} can provide adequate empirical estimates. To generate a Markov chain using the Metropolis-Hastings algorithm ^{(ℓ) }and randomly proposing a new state ** θ*** drawn from an arbitrary distribution with density

The first term in the acceptance probability above is the ratio of posterior densities and the term involving the transition kernel is the Hastings ratio. The beauty of the algorithm is that the posterior densities only appear as a ratio so that intractable normalizing constant cancels out.

Transition kernels

We employ standard phylogenetic transition kernels via a Metropolis-within-Gibbs scheme, as implemented in BEAST ** ϕ **and all possible local clock indicator

where 0 _{f }<

Transition kernels on ** δ **are more challenging. One natural way to construct a Markov chain on a bit-vector state space, such as

At first glance, the transition kernel density ** δ***|

To determine _{k }_{k }

This derivation provides an important lesson for those new to MCMC implementation; the Hastings ratio may vary depending on the model parameterization; it is, therefore, necessary to calculate the ratio as a function of the same parameterization as the prior.

In cases where the swap event relaxes the prior variance on the rate multiplier _{k}

Proposals involving changes to the tree topology are based on existing tree proposal moves in the BEAST software framework with a small modification to track the augmented data at the nodes [see Additional file

**Supplementary Information**. This is a PDF file describing some additional details of the described methods including (i) a description of the proposal distribution for trees used in the RLC model and (ii) a summary of the analysis of the influenza data using a "fixed epoch" model that allows the rate of evolution to change at a specific time in the past.

Click here for file

Model selection

Statistical inference divides into two intertwined approaches: parameter estimation and model selection. For the former, parameter inference relies on empirical estimates of ** θ|Y**) that we tabulate from the MCMC draws. Model selection often represents a more formidable task. The natural selection criterion in a Bayesian framework is the Bayes factor

and informs the phylogeneticist how she (he) should change her (his) prior belief _{1})/_{0}) about the competing models in the face of the observed data. Involving the evaluation of two different normalizing constants, Bayes factors are often challenging to estimate.

By fortuitous construction, we side-step this computational limitation when estimating the Bayes factor in favor of a global clock (GC) model ℳ_{GC }over the RLC model ℳ_{RLC}. Model ℳ_{GC }occurs when _{RLC}. Consequentially, the _{RLC}) equals the prior probability of ℳ_{GC}, and **Y**,ℳ_{RLC}) yields _{GC}|**Y**). Given this, a Bayes factor test of ℳ_{GC }only requires simulation under the RLC model. The Bayes factor in favor of a global clock

To calculate the ratio of marginal likelihoods we need only an estimator **Y**, ℳ_{RLC}). The Ergodic Theorem suggests that we let

where 1{·} is the indicator function. Occasionally **Y**,ℳ_{RLC}) decreases below ϵ or increases above 1 - ϵ for ϵ ≈ 1/L. In such situations, there are alternatives that depend on MCMC chains generated under several different prior probabilities _{RLC})

Results

To explore the utility of the RLC model, we consider three well-studied examples that span the evolutionary scales from millions of years down to annual seasons. The first example investigates rate variation of several nuclear genes across the radiation of mammals

Radiation of rodents and other mammals punctuated by local clocks

Figure

Bayesian inference of random local clocks on mammalian data

**Bayesian inference of random local clocks on mammalian data**. Most probable evolutionary tree relating three nuclear genes from 42 mammals

Amongst the very small collection of local clock models that

Prior and posterior distributions of the number of rate changes for three molecular data sets

**Prior and posterior distributions of the number of rate changes for three molecular data sets**. Comparison of posterior (red) to prior (blue) probability mass functions of the number of rate changes **(a) **mammal, **(b) **primate and **(c) **influenza examples. In all examples, the prior probability of a global molecular clock (

Anthropoids' global clock

As an example in which a global clock should hold, we re-examine the seven anthropoids sequences under the RLC model. We employ the HKY85

Figure

Inferred mtDNA rates for primate phylogeny

**Inferred mtDNA rates for primate phylogeny**. Most probable evolutionary tree relating seven mtDNA sequences from primates _{k }

An important use of the molecular clock hypothesis is in estimating divergence times, and this ability remains under the RLC model. Near the tree branches in the figure, we also report 95% BCIs for the branch-specific relative rates _{k}_{GC }from knowledge of the model prior and an estimate _{GC }= 3.3. While this Bayes factor is far from offering extreme support

Temporal rate patterns in influenza

We examine hemagglutinin gene evolution from 69 strains of human influenza A

**Human.H3.81-98 _{-}local_{-}gamma.xml**. This is a BEAST XML input file compatible with BEAST 1.5.4 that implements the model combination used to analyze the influenza data set under the RLC model.

Click here for file

Figure

Influenza A data analysis

**Influenza A data analysis**. **(a) **Most probable evolutionary tree relating 69 hemagglutinin sequences from human influenza A. Branch coloring indicates inferred rates of nucleotide substitution, with blue denoting the slowest rates and red the fastest. **(b) **Rate heterogeneity of hemagglutinin sequence evolution over time. The plot traces the marginal distribution of relative substitution rates across time. White indicates low posterior density, and yellow/red indicates high density. The estimated rates are higher towards the present, with a notable jump in rate approximately six and ten years before the last sequence sample.

Figure

Figure

We caution against over-interpretation of the punctuated form of the transitions between epochs seen in Figure

**Human.H3.81-98 _{-}2rate.xml**. This is a BEAST XML input file compatible with BEAST 1.5.4 that implements the "fixed epoch" model used to confirm the signal for a temporal ate change in the influenza data set.

Click here for file

Discussion

Although it has been clear for quite some time that no universal molecular clock exists, a new question is emerging about what is the phylogenetic footprint of local molecular clocks. With increasing densely sampled phylogenetic trees, we should start to be able to get estimates of the extent of local clocks.

A major limitation of local clock models has been a dearth of methods to appraise all the possible rate assignments for various lineages ^{2n-2 }possible local clock models and automatically returns the most parsimonious descriptions of the data.

The RLC description finds notable similarity to a compound Poisson process for rate variation

Compared to the auto-correlated rate models

Further, hybrid models remain within reach in which rate multipliers ** ϕ **draw

While the transition kernels we employ in this paper successfully explore the posterior distribution for the three examples, we can envision datasets for which our algorithm would have difficulties producing accurate estimates of the posterior distribution. High correlation most likely exists between the evolutionary tree ** τ **and location indicators

Alternatively, ** τ **or all possible

Conclusions

We have proposed an efficient method to sample over random local molecular clocks while simultaneously estimating the phylogeny. The new method conveniently allows a comparison of the strict molecular clock against a large array of alternative local molecular clock models. We have illustrated the method's utility on three example data sets involving mammal, primate and influenza evolution. We also explored a method to visualize the complex posterior distribution on the influenza data set which led to discovery of a strong temporal signal for the evolutionary rate in that data set, although this observation may well be attributed to temporal variation in sampling pattern. The examples that we have investigated suggest that large sequence datasets may only require a relatively small number of local molecular clocks to reconcile their branch lengths with a time scale. All of the analyses described here are implemented in the open access software package BEAST 1.5.4

Authors' contributions

Both authors developed the idea and conducted the main experiments. AJD implemented the Bayesian stochastic search variable selection in the BEAST 1.5 and BEAST 2 open source software packages. Both authors debugged the software and wrote supporting software to analyze and visualize the results. Both authors were involved in the writing of the manuscript.

Acknowledgements

This paper was conceived in New Zealand, the new Middle Earth. We thank the Department of Computer Science, University of Auckland for hosting M.A.S. as an Honorary Research Fellow. We thank Andrew Rambaut for assisting with the fixed-epoch analysis of the Influenza data set. This work is supported in part by the John Simon Guggenheim Memorial Foundation, the National Evolutionary Synthesis Center (NSF #EF-0423641) and NIH R01 GM086887.