ARL Division of Biotechnology, University of Arizona, AZ 85721, USA

Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94143, USA

Abstract

Background

Despite intensive efforts devoted to collecting human polymorphism data, little is known about the role of gene flow in the ancestry of human populations. This is partly because most analyses have applied one of two simple models of population structure, the island model or the splitting model, which make unrealistic biological assumptions.

Results

Here, we analyze 98-kb of DNA sequence from 20 independently evolving intergenic regions on the X chromosome in a sample of 90 humans from six globally diverse populations. We employ an isolation-with-migration (IM) model, which assumes that populations split and subsequently exchange migrants, to independently estimate effective population sizes and migration rates. While the maximum effective size of modern humans is estimated at ~10,000, individual populations vary substantially in size, with African populations tending to be larger (2,300–9,000) than non-African populations (300–3,300). We estimate mean rates of bidirectional gene flow at 4.8 × 10^{-4}/generation. Bidirectional migration rates are ~5-fold higher among non-African populations (1.5 × 10^{-3}) than among African populations (2.7 × 10^{-4}). Interestingly, because effective sizes and migration rates are inversely related in African and non-African populations, population migration rates are similar within Africa and Eurasia (e.g., global mean Nm = 2.4).

Conclusion

We conclude that gene flow has played an important role in structuring global human populations and that migration rates should be incorporated as critical parameters in models of human demography.

Background

Reconstructing human history requires an accurate picture of global human population structure

Models of population structure reflecting the (A) island, (B) splitting and (C) isolation-with-migration (IM) models

**Models of population structure reflecting the (A) island, (B) splitting and (C) isolation-with-migration (IM) models**. The island model assumes equilibrium gene flow (m) between subpopulations that have no shared ancestry. The divergence model describes an ancestral population, which splits at time t into two daughter populations that do not exchange genes in subsequent generations. The isolation-with-migration model describes a constant-sized ancestral population that splits into two daughter populations that can exchange genes and change in size. There are seven parameters in the isolation-with-migration model: effective population size of the ancestral deme (N_{A}), effective population sizes of the two descendent demes (N_{1 }and N_{2}), unidirectional migration between the descendent populations (m_{1 }and m_{2}), proportion of the ancestral population founding deme 1 (S), and population divergence time (t).

Furthermore, human population structure is often considered only from the perspective of a single summary statistic, F_{ST}, which is a standardized measure of the genetic variation shared between populations. However, F_{ST }is dependent on the effective size (N), migration rate (m) and divergence time (t) of individual demes, and by itself has no straightforward demographic interpretation. A small F_{ST }between two populations may indicate large effective population sizes, high rates of gene flow since diverging from a common ancestral population, or recent population divergence. For instance, under Wright's _{ST }are linked by a nonlinear relationship independent of t (equation 2) ^{4}, an X chromosome F_{ST }of 0.2 _{ST }are related by a nonlinear relationship independent of m (equation 4) _{ST }of 0.2 suggests that global human populations diverged on average ~84 Kya. However, despite being in common usage, gene flow-only and divergence-only models probably have little relevance to actual human demographic history.

Here, we examine the structure of human populations by means of the isolation-with-migration (IM) model _{A}), effective population sizes of the two descendent demes (N_{1 }and N_{2}), unidirectional migration rates between the descendent populations (m_{1 }and m_{2}), proportion of the ancestral population founding the first deme (S), and population divergence time (t). Unlike the island model, which assumes infinite divergence times, or the splitting model, which assumes zero migration, IM makes no

Here, we apply a Bayesian inference framework together with a maximum likelihood algorithm _{ST})

Results

Population differentiation

Wright's F_{ST }for our global sample averages to 0.25. When we calculate F_{ST }between all pairs of populations, we find the greatest F_{ST }values between sub-Saharan African and non-African groups (i.e., F_{ST }ranges from 0.160–0.450) (Table _{ST }of 0.137) and within non-African groups (mean F_{ST }of 0.174), with F_{ST }values as high as 0.226 between Melanesians and Basque

Mean effective population sizes, migration rates, divergence times and F_{ST}.

Pop 1

Pop 2

N_{A}

N_{1}

N_{2}

m_{12}/gen ×10^{-4}

m_{21}/gen ×10^{-4}

Nm

t (kya)

F_{ST}

African

BIA

MAN

9,980

3,980

6,600

2.8

1.9

4.9

48.7

0.117

BIA

SAN

6,620

5,560

5,340

0.86

1.9

3.0

50.0

0.169

MAN

SAN

9,530

6,930

3,790

0.72

0.020

0.8

46.8

0.126

African/Non-African

BIA

BAS

11,200

4,650

3,250

1.1

0.43

1.2

61.4

0.311

BIA

HAN

12,800

2,330

2,600

0.30

0.075

0.2

27.7

0.374

BIA

MEL

11,600

6,900

1,570

0.48

0.71

1.0

88.5

0.331

MAN

BAS

9,970

4,530

2,750

5.8

0.036

4.2

23.4

0.160

MAN

HAN

11,000

2,750

2.2

0.18

15.5

0.236

MAN

MEL

9,420

7,110

318

2.1

3.0

3.8

12.4

0.221

SAN

BAS

10,400

5,820

2,490

0.42

0.0012

0.3

83.0

0.344

SAN

HAN

10,200

7,270

1,880

0.24

0.46

0.6

151

0.450

SAN

MEL

10,500

8,990

1,210

<<0.001

1.6

1.6

68.5

0.390

Non-African/Non-African

BAS

HAN

11,900

2,230

1,940

9.2

1.4

4.4

85.6

0.085

BAS

MEL

11,600

2,120

283

0.26

12

3.0

62.5

0.226

HAN

MEL

11,300

1,770

592

0.18

21

5.0

61.2

0.210

Abbreviations: BIA, Biaka; MAN, Mandenka; SAN, San; BAS, Basque; HAN, Han; MEL, Melanesians;

Demographic inference under the isolation-with-migration model

Marginal Bayesian posterior probabilities were calculated using Markov chain Monte Carlo, and best-fit parameterizations were inferred for each unique population pair (e.g., see Figure 1 in Additional file

**Supplemental Materials for "Intergenic DNA sequences from the human X chromosome reveal high rates of global gene flow".** This document contains tables and figures showing additional results referenced in the main text.

Click here for file

Effective population sizes

Modern effective population sizes (N) were inferred multiple times under the IM model (i.e., once for each paired population, Tables 1 and 2 in Additional file _{20 }= 6.9, _{0 }<1,500). Mean ancestral sizes (10,500, range 6,600–12,800) are generally larger than modern effective sizes, and are also often larger than the sum of their descendant populations (_{19 }= 3.42,

Population split proportions

Our dataset has little power to infer how ancestral effective sizes were apportioned among descendent demes. Most estimates of the split proportion, S, have large confidence intervals (Table 3 in Additional file

Migration rates

Stationary estimates of long-term unidirectional migration rates average to 2.4 × 10^{-4}/generation (range: 8.7 × 10^{-8 }– 2.1 × 10^{-3}/generation), thereby suggesting that gene flow between global populations is relatively frequent (Table 4 in Additional file ^{-3}; lowest rate: 3.8 × 10^{-5}) implies a movement of ~1 X chromosome every 2 years. Bidirectional migration rates within and between continents vary significantly (_{2,12 }= 21.5, ^{-4}), and between Africans and Eurasians (2.1 × 10^{-4}), are relatively low compared to migration rates within Eurasia (1.5 × 10^{-3}). Furthermore, migration patterns between populations are largely symmetric. Han Chinese and Melanesians provide a key exception; migration from China to the Pacific (2.1 × 10^{-3}/generation) has significantly exceeded migration in the opposite direction (1.8 × 10^{-5}/generation) (see 95% confidence intervals in Table 4 in Additional file

Population divergence times

Marginal posterior distributions for t indicate that divergence times between African populations all occur ~50 kya, with the largest upper confidence interval for any two African populations at ~140 kya (Table 5 in Additional file

Validation of inferred demographic parameters

The inference system employed here has been validated elsewhere _{A}, N_{1}, N_{2}, m_{1}, m_{2 }and t inferred above. Due to poor estimates of the split proportion, we assumed S = 0.5. To check whether these coalescent simulations return data similar to the empirical loci, we compared observed summary statistics for all twenty X chromosome loci with summary distributions from these parameterized simulation models. We focused on four summaries of the data: i) F_{ST}, which describes the genetic distance between populations; ii and iii) _{W }and _{π}, which are unbiased estimators of the population mutation rate _{e}

Observed F_{ST }values are correlated with F_{ST }values simulated under these 15 simulation models (Mantel test, r = 0.49, _{ST }values that are just slightly lower than those actually observed (i.e., mean F_{ST }of 0.21 versus 0.25, not significantly different). The simulation models also provide good fits to observed data for the remaining summaries, all of which reflect aspects of the population site frequency spectrum. A Bonferroni correction holding the experiment-wise type-I error rate constant at

Observed values (bars) of Tajima's D for 20 X chromosome loci in (A) Biaka pygmies and (B) Melanesians compared to simulated distributions for each population (curves)

**Observed values (bars) of Tajima's D for 20 X chromosome loci in (A) Biaka pygmies and (B) Melanesians compared to simulated distributions for each population (curves)**. Tajima's D values for the majority of empirical loci are consistent with simulated distributions that are obtained from an isolation-with-migration model parameterized with the inferred demography of these two populations.

Discussion

It is generally appreciated that migration affects many important ecological and evolutionary properties of populations

As pointed out by Whitlock and McCauley _{ST }is limiting because it provides no insight into which historical processes are responsible for the observed genetic differences between populations. It is therefore left up to individual investigators to choose which model of population structure is used to interpret data, and inferences depend heavily upon the assumptions inherent in each model. We set out to directly estimate rates of gene flow between human populations through the use of the isolation-with-migration (IM) model, which incorporates both population splitting and gene flow. For this purpose, we have analyzed a large DNA sequence database collected with the expressed purpose of constructing models of human demographic history, and hence, is focused on intergenic (i.e., putatively neutral) regions on the X chromosome _{ST }for this dataset were found to be slightly higher than those estimated from other large X chromosome resequence datasets

To disentangle the evolutionary processes underlying F_{ST }in real human populations, we inferred N, m and t separately for the six populations in our survey using the IM model. While the mean global value for ancestral population size, ~10^{4}, is consistent with previous estimates of the global population size of modern humans _{0 }≈ 3000) estimated from linkage disequilibrium (LD)

This finding is important because many studies make the simplifying assumption that individual human populations have an effective population size of 10^{4 }[e.g.,

Indeed, we demonstrate here that rates of gene flow between subdivided human populations are non-zero. For unidirectional migration rates (m_{1}, m_{2}), ~87% of pairwise comparisons (i.e., 42 of 48) showed gene flow in at least one direction (i.e., their 95% confidence intervals exclude zero migration; Table 4 in Additional file _{1 }+ m_{2}) are considered, lower bounds of 95% confidence intervals are greater than zero for all 15 pairwise comparisons. Furthermore, mean bidirectional migration rates within Africa (2.7 × 10^{-4}/generation), and between Africans and Eurasians (2.1 × 10^{-4}), are significantly lower than migration rates within Eurasia (1.5 × 10^{-3}).

Correspondingly, estimates of the population migration rate range from 0.2–4.9 (mean Nm = 2.4). This implies that 2–3 X chromosome copies, on average, move between human populations every generation, although this stationary estimate does not explain how migration events are distributed through time. We infer slightly higher population migration rates within Africa and within Eurasia than between continents (Figure

Correlation between population divergence (F_{ST}) and inter-deme migration (Nm)

**Correlation between population divergence (F _{ST}) and inter-deme migration (Nm)**. African population pairs are indicated by circles, non-African population pairs by triangles, and African/non-African population pairs by crosses.

Geographic representation of population migration rates Nm

**Geographic representation of population migration rates Nm**. Mean and range of Nm are provided for African/non-African population pairs.

Clearly, our results cannot be interpreted under either the pure splitting or island models, which assume no gene flow and no shared ancestry, respectively. In any case, under an island model, a population migration rate Nm greater than ~0.25 would be too high to explain the value of F_{ST }observed here (cf. Figure 2A in Additional file _{ST }depends on interactions between a suite of demographic parameters, including N, m and t (Figure 3 in Additional file _{ST }to yield insights into human demographic processes without further knowledge of population divergence times, effective sizes and migration rates (i.e., the very parameters that we often attempt to infer from F_{ST}). Although F_{ST }is often considered directly proportional to divergence time, where migration is assumed to be absent and all population sizes identical [46:29–30], these assumptions do not hold for the human populations examined here. Thus, caution is warranted when interpreting F_{ST }as a simple proxy for population history [e.g.,

In sum, we have independently inferred effective population sizes, times of divergence and rates of migration under an IM model of population structure based on the analysis of a large X chromosome DNA sequence database. The parameters that we have estimated for six globally distributed populations indicate relatively high levels of migration (e.g., mean Nm = 2.4) (Figure _{ST }(e.g., see Figure 4 in Additional file _{ST }

Finally, the finding of high rates of gene flow among human populations has important implications for how we interpret the distribution of SNPs associated with disease

Conclusion

While the maximum effective size of modern humans is estimated at ~10,000, individual populations vary substantially in size. African populations tend to be larger (2,300–9,000) than non-African populations (300–3,300). We independently estimate mean rates of bidirectional gene flow at 4.8 × 10^{-4}/generation, and these rates are higher among non-African populations (1.5 × 10^{-3}) than among African populations (2.7 × 10^{-4}). Interestingly, because effective sizes and migration rates are inversely related in African and non-African populations, effective migration rates are similar globally (e.g., mean Nm = 2.4). While significant theoretical challenges remain in disentangling the evolutionary factors that structure human populations, it is clear that migration can no longer be treated as a simple, equilibrium parameter – or ignored – as it often is in reconstructions of human history.

Methods

Genomic data

Our database comprises 20 loci from intergenic regions on the X chromosome. Each region chosen for sequencing spans ~20 kb of primarily single-copy non-coding (i.e., putatively non-functional) DNA in regions of medium or high recombination, which are at least 50 kb away and recombinationally unlinked from the nearest gene

F_{ST }estimates

F_{ST }can be calculated using several different algorithms. Here, we adopt the approach of Hudson et al.

where H_{w }(≡ _{w}) is the mean distance per polymorphic site sampled from the same population, and H_{b }(≡ _{b}) is the mean distance per polymorphic site sampled from both populations. Reported values represent the mean F_{ST }at all segregating sites across all 20 X chromosome loci.

The expected value of F_{ST }for X chromosome loci under the island model with an infinite number of demes depends only on the product of the effective population size N, and the migration rate per generation m [10: 294-5]

_{ST}⟩ ≈ (1 + 3 Nm)^{-1}

F_{ST }estimates must be corrected if a finite number of demes d are intended instead

The population-scaled rate of gene flow Nm can be derived by simple rearrangement of equation 3.

Correspondingly, the expected value of F_{ST }for X chromosome loci under a divergence model depends only on the divergence time t, in generations, scaled by the effective population size N

Demographic inference

Genetic diversity at twenty X chromosome loci is applied to determine the most likely parameterization for a series of paired population isolation-with-migration models. Seven demographic parameters are inferred from the genomic data under each two-deme IM model: effective population size of the ancestral deme (N_{A}), effective population sizes of the two descendent demes (N_{1 }and N_{2}), unidirectional migration rates between descendent populations (m_{1 }and m_{2}), proportion of the ancestral population founding the first deme (S), and population divergence time (t). Populations are analyzed in all pairwise combinations using the Markov chain Monte Carlo Bayesian/maximum likelihood framework implemented in the 31 July 2006 version of IM _{1 }= 3N_{1}_{2 }= 3N_{2}_{A }= 3N_{A}_{1 }= m_{1}/_{2 }= m_{2}/^{6 }years. Per generation rates assume a mean generation interval of 28 years, as estimated from cross-cultural ethnographic data _{1 }and m_{2 }are inferred in the coalescent (i.e., backward in time), and bidirectional migration rates m are simply the summation of m_{1 }and m_{2}.

Because the IM algorithm has most power with perfectly treelike data (an infinite sites implementation), datasets with no evidence of recombination were extracted from each locus using the four-gamete approach of Hudson and Kaplan _{1}, _{2}, _{A }∈ _{1}, _{2 }∈ ^{7 }steps. Chain mixing by Metropolis-Hasting coupling, long run times and multiple independent runs allow us to identify convergence on each parameter's underlying stationary distribution.

Because we observe little variation among multiple independent runs (e.g., Figure 1 in Additional file

Coalescent simulations for demographic parameter validation

The nonlinear relationship between gene flow, divergence time and F_{ST }under the isolation-with-migration model is explored using coalescent simulation with the software ms _{W }and _{π}, which are unbiased estimators of the population mutation rate; Tajima's D, which summarizes the population site frequency spectrum; and F_{ST}, which summarizes the joint site frequency spectrum. Observed values of these four summary statistics, calculated from the empirical dataset, are compared to the summary statistic distributions returned by coalescent simulation. A Bonferroni correction holding the experiment-wise type-I error rate constant at _{ST}, a between-population (i.e., matrix) test.

Authors' contributions

MPC participated in the design of the study, contributed to data collection, ran analyses, and wrote the manuscript. AEW contributed to data collection. JDW advised on data analysis, and provided comments on the manuscript. MFH participated in the design of the study, advised on data analysis, and helped revise the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We are grateful to J Hey (Rutgers University) and F Mendez (University of Arizona) for helpful discussion; and S Kobourov for access to the dispersed-computing cluster in the Department of Computer Science (University of Arizona). The National Science Foundation helped fund genetic data collection and analysis via the grant BCS-0423670 to M.F.H. and J.D.W, as well as providing computational support via the San Diego Supercomputing Center under TeraGrid grant DBS060002T to M.P.C.