Department of Computer Science, University of Auckland, Auckland, New Zealand

Allan Wilson Centre for Molecular Ecology and Evolution, New Zealand

Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand

Abstract

Background

The multispecies coalescent model has become popular in recent years as a framework to infer a species phylogeny from multilocus genetic data collected from multiple individuals. The model assumes that speciation occurs at a specific point in time, after which the two sub-species evolve in total isolation. However in reality speciation may occur over an extended period of time, during which sister lineages remain in partial contact. Inference of multispecies phylogenies under those conditions is difficult. Indeed even designing simulators which correctly sample gene histories under these conditions is non-trivial.

Results

In this paper we present a method and software which simulates gene trees under both the multispecies coalescent and migration. Our approach allows for both population sizes and migration rates to change over the species lifetime. Also, migration rates are specified in units of fraction of emigrants per time unit, which makes them easier to interpret. Overall this setup covers a wide range of migration scenarios. The software can be used to investigate properties of gene trees under different migration settings and to generate test cases for programs which infer species trees and/or migration from sequence data. Using simulated data we investigate the effect of migrations between sister lineages on the inference of multispecies phylogenies and on post analysis detection.

Conclusions

Our results indicate that while estimation of species tree topology can be quite robust to the presence of gene flow, the inference and detection of migration is problematic, even with methods based on full likelihood models.

Background

The multispecies coalescent model

In all of these implementations, strict divergence is a standard assumption of the multispecies coalescent. Under strict divergence, a species is a perfectly mixing Wright-Fisher population until the moment of splitting, and from that point onwards the two sub-species evolve in total isolation. Strict divergence is a simplifying assumption, one which is violated by the presence of horizontal gene transfer, reassortment, migration or any other means of gene flow. Such simplifying assumptions are common in scientific models due to incomplete understanding of the processes involved, unavailability of analytical solutions or limitations in computational resources.

Here we focus on the effect of violating the central assumption of strict divergence. We model one specific type of gene flow – migration – and investigate its effects on the Bayesian inference of multispecies phylogenies. There are several software packages which infer species trees from multiple loci

Models of genetic differentiation in subdivided populations go back more than 70 years. In 1943 Wright introduced the “Island Model” in which “

There are a large number of existing coalescent simulators

Given that the gradual decline of gene flow after divergence could well be a likely occurrence, we consider the effect this migration has on inference of species trees. It has been previously shown

Wright

Implementation

Model for two species with time-dependent migration rates and population sizes

We begin by extending the two species model (Figure

Classical migration model for two populations

**Classical migration model for two populations.** Standard migration model for two populations. In the standard model, population sizes and migration rates are constant throughout the species time-span.

The model specifies how lineages from two species interact over time. Just like the coalescent, it is best viewed as going back in time. Starting at time zero (present) with _{
a
} and _{
b
} lineages from _{
a
} by one. Also, a lineage may “jump” from

The instantaneous rate at which coalescent events occur depends on the effective population size, _{
e
}(_{
e
}
^{−6} Myr) those translate to one hundred thousand individuals (i.e. _{
e
}=^{−6}) over one million years. If on the other hand time was measured in thousands of years, with a generation time of 1 year ^{−3} the same parameter values would equate to 100 individuals over one thousand years. If time is measured in generations (_{
e
} and

The instantaneous rate of coalescence is 1/

Modelling two species with reciprocal migration requires two population functions, _{
a
}(_{
b
}(_{
a→b
}(_{
b→a
}(

Migration for time-dependent population sizes and migration rates for two populations

**Migration for time-dependent population sizes and migration rates for two populations.** Migration model for two populations where population size and migration rates vary over time. A migration rate of zero indicates complete separation.

Migration rates are specified in terms of _{
a→b
}(_{
a
}(_{
a→b
}(_{
a
}(_{
a→b
}(

It may seem that unequal migration rates would cause population sizes to change over time but this is not the case. The model is in fact an extension of the classic Wright-Fisher model; under Wright-Fisher the parent of every individual is chosen uniformly at random from all individuals in the previous generation. When migration is allowed, the ancestor of _{
a→b
} for having an ancestor from the other population is the ratio of emigrants and effective population size,

Since migration is a non-homogeneous Poisson process, the density of migration waiting time from

Equation (2) is the continuous equivalent of the “backward migration rate” (Lemma 1 in

Migration under a species tree

The two-populations model can be extended to a species tree in a natural way. When population _{1} and _{2} there are six migration processes operating in parallel between the three populations; two between _{1} and _{2}, two between _{1} and two between _{2} (Figure

Migration rates for a species tree

**Migration rates for a species tree.** (**A**) Migration between **B**) Migration between A, _{1} and _{2}.

The total rate between _{1}∪_{2} after the second split, going forward in time, is **as if** the two _{1} and _{2}. The same logic applies to additional splits.

Note that in principle there are many possible ways a split may affect the migration. Here, we assume that the split is B’s “internal affair” and that the ability of individuals to migrate is unaffected by the split (Figure

Three possible effects of a split on migration

**Three possible effects of a split on migration.** (**A**) Graphical view of migration between _{1,2} after the split of _{1} or _{2} to migrate to **B**) An alternative way for a split. Now, only migration between _{1} and **C**) A second alternative showing uneven contact between _{1,2}.

Drawing event waiting times for two populations

We begin by describing the simulation process for two sister lineages. With two species there are four possible events at any time, two coalescences and two migrations, each with its own rate. Since those processes are independent and memoryless, the waiting times starting at zero (now) and going back

1. Start at time _{
a
},_{
b
} lineages in populations

2. Independently draw waiting times for each possible event. Let

3. Terminate if

4. Record the event with the smallest time. For example, if this is a coalescence in _{
a
} by one, and if this is a migration from _{
b
} by 1 and decrease _{
a
} by 1.

5. Increase

Impossible events such as coalescence for less than two lineages or migration for zero lineages get infinite waiting time. When all the population and migration functions are constant this reduces to the classic model. In that case it is possible to draw a single number – the waiting time to the first event – instead of drawing all times as we do in step 2. This computational speedup is not available here since we let both population sizes and migration rates vary over time. Drawing the required waiting times is relatively straightforward using the classic inverse transform which can be applied to sample from any density

where

On those sub-intervals, the migration fraction and both of the effective population size functions are linear, so the migration rate can be rewritten as follows,

for suitable coefficients _{0}, _{1} and _{2}. All those terms are easily integrated.

Simulating a gene tree with migration under the multispecies coalescent

Simulating migration and coalescence for two species can be generalized to _{
s
} species in a straightforward way. Again, we move back in time from the present (_{
k
} species with _{
i
} lineages in species _{
i
}>1. We pick the event with the smallest waiting time and apply it as previously explained, unless the event passes over a divergence time (i.e. a species union when going back in time). In that case the event is rejected, time is advanced to the divergence point, and the species lineages are merged, and the number of species is reduced by one.

A simple parametrization of migration on a species tree

While a user can explicitly specify the migration rates for any species tree, doing so for more than a few cases is time consuming and prone to human bias. A specification via a more generic scenario where migration rates are set stochastically from a few parameters is more convenient and enables generating sets of test cases for quantitative exploration of the effect of migration on species inference.

One natural scenario is _{
d
}, which declines linearly to zero at complete separation _{
s
}, that is, _{
s
}≤_{
d
}.

Results

To quantitatively explore the effect of migration we generated several data sets using the gradual separation scenario. Unless stated otherwise, each set is composed of several test cases generated as follows: first draw a species tree at random using a Yule birth model with a rate ^{
S
}/_{2λ
} and standard deviation 0.25 in log space, ^{1}/_{2λ
} being the expected length of the species tree branch in a Yule tree _{1}/ _{2} is not allowed after immigration between _{1} and _{2} stops. Note that this is not a limitation in the model, and our software allows continued migration if required.

With the setup and methods as described, gene trees for 5 species with 10 individuals per species were simulated subject to coalescence and migration. The species tree has an average height of 1.6

Migration events as a function of

How do values of

Expected number of migration events

**Expected number of migration events.** (**A**) Expected number of migration events in one gene tree. (**B**) Expected number of gene tree coalescences inconsistent with the species tree.

The near symmetry around the

In real multilocus sequence data, migration events are not observed directly – they alter the relation between the species tree and gene trees. Their effect can vary: a coalescence involving a migrating lineage can create an inconsistency between a gene tree and the strict species tree. Migrations not involved in such coalescence have a more subtle effect by altering coalescence waiting times. Note that the number of inconsistent coalescences (Figure

Also note that those are expected values. With

Weak and strong speciation

In the presence of migration there are several interpretations for the divergence times in inferred species trees. At one end, there is the

To explore this issue, we simulated data for a range of M and S values. For each combination of parameters, we generated 100 replicate data sets, each set comprising 4 loci with 1600bp for 5 species, with 10 individuals per species. We generated samples from the posterior distribution using *BEAST. While BEST

For each data set we generated a chain of 8.8M trees, discarding 10% burnin, and then computed the posterior mean distance from trees in the sample to the weak and strong species trees respectively. We used the normalized rooted branch score of

Table _{
d
},_{
s
},_{
w
} are the estimated divergence, divergence time in the strong tree and in the weak tree, respectively.

**M**

**S**

**1600bp**

**
∞
**

Percentage of test cases where posterior trees are closer to the weak speciation. Closer here means a smaller distance between trees based on the Rooted Branch Score. Shown are a few choice values of

0.5

0.5

97%

85%

1

0.5

83%

71%

2

0.5

66%

43%

3

0.5

46%

30%

3

0.8

24%

12%

**M**

**S**

**Branch score**

**Pair divergence times**

**Mean pair location**

Three different measures assessing the relation of posterior samples from a ⋆BEAST run and their strong and weak species tree. Same data set as for Table

0.5

0.5

97%

90%

0.93

1

0.5

83%

91%

0.79

2

0.5

66%

67%

0.68

3

0.5

46%

57%

0.63

3

0.8

24%

37%

0.51

One should keep in mind that there is not a single obvious way to match divergence times from the posterior to those from a fixed tree. The approach we took here is to use divergence times from all possible taxa pairs. This may lead to various types of bias which may depend on details of the species tree, or on the fact that there are more pairs with earlier divergence times than with later ones.

Post analysis detection

The interplay between gradual speciation and divergence time estimation would be expected to have a significant impact on those methods using divergence times to test for gene flow and hybridization. One such method is JML, a program for detecting hybridization events using posterior predictive checking

To test the performance of JML in detecting gradual separation we generated 100 species trees for 5 species using a pure birth (Yule) process with a birth rate of 0.4. Population sizes were assigned randomly with a spread of ±20^{5}/_{8} (half of expected species lifetime ^{1}/_{2}×^{1}/_{2×0.4}). The same 100 species trees were used to generate two ⋆BEAST data sets, one without migration and the other with ^{1}/_{2}. Both sets used the same number of loci and individuals (4 and 6 respectively), The Jukes-Cantor substitution model with a strict clock with a rate of 0.005, and sequences of length 1600bp. With those settings, the sequences identity of two random individuals is on average 99.4%, while two individuals from different species are 97.1% identical, for the set of trees without migration

⋆BEAST was run for each analysis and JML version 1.00 was run for each of the 4 loci with a significance level threshold of 0.1. JML detected migration in 63 cases out of the 100 for the first set (without migration), detecting 1 migration in 38 cases and 2 migration in 13 cases. JML detected migration in 69 cases out of the 100 in the second set. Out of a total of 887 pairs containing an inconsistent coalescence event occurring after the pair divergence, JML correctly detected 95 and falsely detected 48.

Validation

The software code was tested extensively by comparing event time distributions from the code with distributions from a simpler process which proceeds backwards in time as follows: in each small time step

Additionally we can derive the coalescence time distributions under basic settings, and compare those with the results from a large set of simulated trees. Figure

Root Height distribution in two simple cases

**Root Height distribution in two simple cases.** The distribution of the root height in two simple cases. The values from 20,000 simulated trees in shown in blue, while the theoretical values are shown by the red line. (**A**) 2 species _{a}=1, _{b}=2 and _{ab}=1. 1 lineage in each species. (**B**) Same settings, with 2 lineages from

Discussion

It is somewhat surprising to find that ⋆BEAST detects incipient species before they are fully separated! Only at around 3 migrants per generation, over half of the species’ lifetime, the tide turns towards estimates of the species divergence times that reflect the respective times of complete species separation. This result does not seem to depend on the method used to measure the distance; the three measures shown in Table

It is fairly obvious that small

Those observations may clarify the large difference between our analysis results for simulations of finite short sequences versus infinite length sequences. With infinite length sequences, coalescence events in gene trees are fixed in time, so we get estimates corresponding to complete separation times with smaller

Comparing distance distributions

**Comparing distance distributions.** Comparing the distributions of posterior rooted branch scores of data sets with migration and without. The fundamental difference between short sequences (left) and infinite sequences (right) is clear.

When considering the results presented here we should keep in mind that migration can be modelled in many ways. We examined mainly gradual separation in species undergoing rapid radiation, where the amount of genetic diversity between sequences is relatively low. We have examined only an infinitesimal part of the problem domain. We used only a fixed birth rate, assigned population sizes in a particular way, considered only a few combinations of species, individuals and loci, and used

Conclusions

We describe a technique for simulating genealogies according to the multi-species coalescent with time-dependent migration. Coalescent based simulators have the advantage of being more computational efficient than forward simulators, however the constraints of the coalescent sometimes make it more difficult to model complex evolutionary phenomena.

A key feature of our simulator is that it can incorporate variation in migration rates along the lifetime of a species. This is particularly important when exploring the dynamics of speciation, and the impact different forms of speciation have on the inference of species trees and demographics.

The complexity inherent in considering both (gradual) migration and incomplete lineage sorting necessitates an incomplete treatment of the problem. We have investigated only a tiny fraction of the parameter space that could be simulated. We have also not considered other inference packages (Bayesian or otherwise) that treat incomplete lineage sorting. The effects of gradual migration on these other methods remains to be determined. Some work has been done on the effect of migration on the estimation of species delimitation

Our experimental results suggest, however, that inference of migration from observed data is difficult, even with a full-likelihood model. Even the simpler task of detecting migration is problematic, as demonstrated by JML “finding” migrations in approximately 2/3 of the test cases with and without migration. Naturally, the amount of signal will vary by context and the full extent to which parameters can be identified in practice remains unknown. Nevertheless, our initial observations signal the need both for caution and continued research. The simulation software we have presented here should provide a tool for this investigation.

Availability and requirements

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JH, DB and AJD designed the research plan. JH wrote the code, performed the analyses and wrote the first draft of the manuscript. All authors contributed to the final manuscript.

Acknowledgements

JH and AJD were funded by a Rutherford Discovery Fellowship from the Royal Society of New Zealand awarded to AJD.