Abstract
Background
Recombination rates vary at the level of the species, population and individual. Now recognized as a transient feature of the genome, recombination rates at a given locus can change markedly over time. Existing inferential methods, predominantly based on linkage disequilibrium patterns, return a longterm average estimate of past recombination rates. Such estimates can be misleading, but no analytical framework to infer recombination rates that have changed over time is currently available.
Results
We apply coalescent modeling in conjunction with a suite of summary statistics to show that the recombination history of a locus can be reconstructed from a time series of genetic samples. More usefully, we describe a new method, based on ntuple dataset subsampling, to infer past changes in recombination rate from DNA sequences taken at a single time point. This subsampling strategy can correctly assign simulated loci to constant, increasing and decreasing recombination models with an accuracy of 84%.
Conclusions
While providing an important steppingstone to determining past recombination rates, ntuple subsampling still exhibits a moderate error rate. Theoretical limitations indicated by coalescent theory suggest that highly accurate inference of past recombination rates will remain challenging. Nevertheless, we show for the first time that reconstructing historic recombination rates is possible in principle.
Keywords:
Recombination rate; Temporal; Reconstruction; Statistical inferenceBackground
Meiotic recombination, whereby DNA variants are shuffled between homologous parental chromosomes, is a fundamental process in the evolution of genetic diversity. For many years poorly studied, the mechanisms and effects of recombination are now increasingly well understood [1]. We know that recombination rates are both heritable [2] and variable among individuals [37]. In other words, recombination is a Darwinian evolutionary system [8,9].
Recent studies have demonstrated convincingly that recombination rates at a given locus vary at the level of the species, population and individual. Comparisons between the chimpanzee and human genomes show poor correlation of both hotspot and background recombination rates at orthologous loci [10,11]. Similarly, recombination rates vary between human populations [11], not only at continental scales, but also between close geographical neighbors (e.g., French and Italians) [11]. Recombination rates even vary widely between individuals drawn from the same population [37]. The picture now emerging is one of an extremely dynamic recombination landscape [10], with transient recombination peaks and troughs across the human genome, overlaying the betterknown evolutionary variation in DNA substitution rates [12].
How changes in recombination rate are controlled is less well understood [13]. At some loci, recombination events are determined by nucleotide variation within specific DNA sequence motifs (such as the degenerate 13bp pattern recognized by PRDM9) [14,15]. These cismediated recombination events often show evidence of transmission distortion [16], where biased gene conversion preferentially favors one allelic variant that can rapidly reach fixation [17]. Related mechanisms may also act as a selective force to reduce recombination around functional genomic elements [18]. However, recombination rates at most loci seem to be mediated by trans factors [5], typically controlled by genes that coordinate DNAprotein interactions [19,20], or more generally, by regional chromatin remodeling [1,7]. These studies suggest that transmediated recombination processes dominate genomescale recombination events and are not obviously under the influence of natural selection.
Recombination is typically detected either directly by gamete typing, or indirectly from linkage disequilibrium (LD) patterns [1]. Gamete typing surveys large numbers of recombination events within a single generation, and therefore provides an accurate (albeit costly) estimate of contemporary recombination rates. Conversely, statistical analysis of linkage disequilibrium patterns counts recombination events that have accrued over multiple generations, and therefore returns a longterm average estimate of historic recombination rates. Since recombination rates can change through time, contemporary and historic rate estimates need not agree. Regions of high recombination, as predicted from linkage disequilibrium, may be inactive when surveyed using gamete typing [21], a discrepancy that indicates the extinction of a previously high recombination region. Conversely, gamete typing may reveal regions of high recombination where none are suggested by linkage disequilibrium, thus indicating the birth of new high recombination loci [22].
The main point is that recombination rates at a genomic location can vary substantially through time. Although this fact is now widely appreciated [16], the manner in which recombination rates increase or decrease still remains completely unknown. Do changes in recombination rate occur rapidly, perhaps due to point mutations suddenly altering the action of the recombination machinery? Or are changes more gradual, occurring as regional nucleotide diversity mutates slowly over time? We currently lack any analytical framework to address these sorts of questions. Here, we determine that a suite of summary statistics can track changes in recombination rate through time. We extract temporal information about changing recombination rates, and describe some of the theoretical limitations that constrain this endeavor. More importantly, we develop a novel methodology based on ntuple subsampling that has sufficient statistical power to reconstruct the recombination history of a genetic locus studied at a single time point. This approach is intended as a proofofconcept that past changes in recombination rates can be reconstructed from contemporary data, even if reconstructing historic rates from empirical data remains challenging.
Results
Correlation and sensitivity of summary statistics
We first explored how different summary statistics respond to recombination events. The number of segregating sites S was used as a negative control because mean S does not vary with the recombination rate. We also assembled a suite of eight summary statistics that were designed specifically to detect recombination events – R_{min}, rmmg, the number of haplotypes, haplotype diversity, Wall’s B and Q, Hudson’s C and Z_{nS}. These summaries likely recognize different aspects of recombination, although the relationships between them have not been explored. Certainly none of these summaries capture the entire recombination profile of a genetic sample (i.e., they are not statistically sufficient).
We studied the correlation matrix between summary statistics using an equal mix of datasets with linearly increasing, decreasing and constant recombination rates. A priori, we might expect that many of the summaries detect similar aspects of the overall recombination signal. Indeed, pairwise comparisons indicated that nearly all the summaries were correlated, albeit to different extents (r values range from 0 to 0.982, mean of 0.351) (Figure 1). The smallest correlations involved rmmg, a conservative lower bound on the minimum number of recombination events R_{min}, which showed little variation among datasets under the conditions modeled here. None of the summary statistics were perfectly correlated, thus emphasizing that multiple summaries are needed to capture different aspects of the recombination profile.
Figure 1. Correlations between recombination summary statistics. (Upper diagonal) Scatter plots show pairwise relationships among the summary statistics. (Lower diagonal) Pie charts show the magnitude of the correlation with blue and red indicating positive and negative values (e.g., Pearson’s r = −0.20 for Hudson’s C and Z_{nS}). All nonzero correlations are statistically significant (P ≤ 0.05).
To determine how these summaries respond to different recombination rates, we simulated genetic data under a wide range of constant recombination values (0–10 ρ/kb) (Figure 2). S is shown as a negative control because its mean is invariant to the recombination rate (Figure 2, upper right). Most summary statistics varied nonlinearly across this linear range of recombination values. It follows that the usefulness of any individual summary may change with the underlying recombination rate, but in different ways. Therefore, a combination of some or all of these summaries may be more sensitive for detecting different recombination rates than any one of them alone.
Figure 2. Sensitivity of summary statistics to different constant recombination rates. Black lines show the mean (solid) and 95% confidence intervals (dotted) of summary statistic values. The red line indicates different constant recombination rates (ρ per kb). Note that summary statistics mostly vary nonlinearly with linear change in recombination rates.
Tracking changing recombination rates using time series data
It is less obvious how summary statistics might covary with recombination rates that change over time. To explore this process, we generated coalescent simulations where recombination rates were allowed to vary over many generations. Genetic datasets were simulated using coalescent software [23] modified to allow recombination rates to change through time. We simulated data for a humanlike deme: 10^{4} replicates of 10kb autosomal sequences were drawn from a constant sized population (N_{e} = 10^{4}) [2426] with a mutation rate, μ, and average recombination rate, r, of 3.75 × 10^{8} events/bp/generation [27]. These rates were chosen to mimic regions of very strong recombination in real human populations [11]. The recombination rate was either held constant, or allowed to vary linearly, exponentially or logistically through time for 10^{4} generations (cf. [24,26]). The total amount of recombination was constrained to be identical for all models; only its distribution through time was altered.
A representative example illustrating a logistic decline in recombination rates towards the present is presented in Figure 3. Corresponding plots for constant recombination, together with recombination rates increasing and decreasing linearly, exponentially and logistically, are presented in Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3, Additional file 4: Figure S4, Additional file 5: Figure S5, Additional file 6: Figure S6 and Additional file 7: Figure S7.
Figure 3. Response of summary statistics to recombination rates changing logistically over time. Black lines show mean (solid) and 95% confidence intervals (dotted) of summary statistic values. The red line indicates how the recombination rate changes over time (ρ per kb). Note the lag between changes in the recombination rate and changes in summary statistic values. Compare responses for constant recombination, as well as linearly, exponentially and logistically increasing and decreasing recombination rates, in Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3, Additional file 4: Figure S4, Additional file 5: Figure S5, Additional file 6: Figure S6 and Additional file 7: Figure S7.
Additional 1: Figure S1. Response of summary statistics to constant recombination rates.
Format: PDF Size: 271KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 2: Figure S2. Response of summary statistics to recombination rates decreasing linearly.
Format: PDF Size: 278KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 3: Figure S3. Response of summary statistics to recombination rates increasing linearly.
Format: PDF Size: 280KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 4: Figure S4. Response of summary statistics to recombination rates decreasing exponentially.
Format: PDF Size: 281KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 5: Figure S5. Response of summary statistics to recombination rates increasing exponentially.
Format: PDF Size: 284KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 6: Figure S6. Response of summary statistics to recombination rates decreasing logistically.
Format: PDF Size: 281KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 7: Figure S7. Response of summary statistics to recombination rates increasing logistically. (PDF 279 kb)
Format: PDF Size: 279KB Download file
This file can be viewed with: Adobe Acrobat Reader
Most of the summary statistics tracked the changing recombination profile, albeit with notable differences in accuracy. The variance of many summaries altered with the recombination rate, thus suggesting that different summaries have greatest power to estimate recombination rates at different times. This reinforces the view that using a combination of summary statistics should maximize statistical power, although a simple linear combination may not necessarily be optimal.
Note too that summary values typically lagged changes in the recombination rate. Genetic variation observed in the present was actually laid down in the (sometimes very distant) past [28]. As recombination rates change, it takes time before this change is reflected in the genetic record. This lag effect is perhaps best illustrated in the plot showing recombination rates increasing exponentially into the past (Additional file 5: Figure S5). Although the recombination rate drops quickly, the summaries change far more slowly. Even after the recombination rate falls near zero, existing lineages still retain the signal of recombination events that occurred further back in the past. Only as these recombined lineages are lost through genetic drift is the new low recombination rate finally reflected in the summaries. This time lag places important constraints on the resolution with which recombination rates that have changed through time can be reconstructed.
Reconstructing past recombination rates from data taken at a single time point
Tracking variable recombination rates using time series data may be feasible for some fast evolving systems (e.g., exploring the loss of sexual competency in yeast), but it is not practical for longlived organisms like humans. To explore whether past recombination rates can be reconstructed from genetic data taken at a single time point, we developed a novel bootstrapping methodology that we call ntuple subsampling.
Mutations occur randomly through time. In any given dataset, some polymorphisms will be old and most modern lineages will carry them. Others will be young, and will therefore be found in only one or two individuals. By determining whether recombination events affect young or old polymorphisms, we can theoretically obtain snapshots of recombination rates through time.
This concept is best shown graphically (Figure 4). Imagine we repeatedly subsample a group of four lineages (i.e., an ntuple of four, or a ‘quartet’) from a given dataset. These individuals may be closely related (Figure 4A), in which case they contain information about recombination rates in the recent past. Alternately, the individuals may be only distantly related (Figure 4C), in which case they may carry both young and old recombination events. By repeatedly resampling the dataset, the recombination rate at different times can be inferred. We emphasize that ntuple subsampling has a natural confound. Young ntuples carry information about recent recombination rates, but old quartets contain a mix of information about old and recent recombination events. The statistical power of this approach is therefore unclear and we explore this issue in detail below.
Figure 4. Variable ages of quartets. Randomly selected quartets (black lines) capture information about (A) young, (B) medium and (C) old time depths. For visual clarity, quartets are shown on a nonrecombining genealogy, but the principle holds equally for ancestral recombination graphs.
The use of resampling methods, such as the bootstrap and jackknife [29], to estimate sampling distributions is widespread in statistics in general, but especially so in evolutionary biology [30]. More recently, interest has grown in socalled ensemble methods (also referred to as ‘bagging and boosting’) that seek to improve overall accuracy by combining the predictions of many weak classifiers, each of which is shown a slightly perturbed version of the data [31]. The following implementation of ntuple subsampling differs from the ensemble method as we use only a single classifier. However, it is a related concept in that the classifier is shown many subsets of the data, which in our case is used to develop combinations of summary statistics that capture information about recombination rates over different time periods.
To ascertain whether ntuple subsampling has sufficient power to estimate past recombination rates, we created a test system where datasets derived from only one of three recombination models: recombination that is constant, linearly increasing or linearly decreasing through time (10^{4} datasets each) (Additional file 1: Figure S1, Additional file 2: Figure S2 and Additional file 3: Figure S3). 10^{3} quartets were generated for each dataset, and the suite of summary statistics was calculated for each subsample. The mean, variance and maximum of these summary statistic distributions were recorded.
Although powerful Bayesian and maximum likelihood methods have been developed to perform inference on such datasets [32,33], these approaches are extremely computationally expensive (i.e., tens to thousands of CPU hours to analyze a single dataset) [25,26,34,35]. As we needed to run power analyses for thousands of test cases, a more pragmatic analytical framework was necessarily adopted (cf. [36]). We instead employed discriminant analysis [37], a routine statistical technique for data classification, with jackknife crossvalidation to evaluate the accuracy of the classifier. Discriminant analysis infers the combination of weighted summaries (i.e., the optimal transform) that best distinguishes the recombination models. In jackknife crossvalidation, model labels were removed, and each blinded dataset was instead assigned to a recombination model using the discriminant function. Assignment accuracy was calculated by determining the proportion of datasets that were assigned to the correct recombination model (i.e., datasets derived under a constant recombination model should be assigned back to the constant model). Both linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) were tested. These related methods respond differently to unequal covariance among models, as well as different sizes of sample and training sets [38]. LDA proved to return more accurate classifications in this instance.
LDA was performed on all datasets from all three recombination models. Each dataset was sequentially excluded, the optimal transform inferred by LDA was applied, and each dataset was reassigned back to a recombination model. As we have three models, assignment rates of onethird are expected just by chance. Assignment rates approaching one indicate increasingly accurate assignments.
Table 1 shows observed assignment rates. Because the mean of S is invariant to recombination (Figure 2, upper right), assignments using S alone are presented as a negative control. As expected, the mean, variance or maximum number of segregating sites performed no better than chance. The best individual summary, the mean number of haplotypes, was much more accurate (60%). The best result was obtained by combining the mean, variance and maximum values of all summaries (68%), although 32% of datasets were still placed incorrectly.
Table 1. Assignment accuracy using linear discriminant analysis on quartets
Scaling subsamples by ntuple Age
These assignments were obtained using information about the amount of recombination in each ntuple, but not its age. When recombination rates change over time, the amount of recombination and the age of each ntuple should be correlated. We would therefore prefer to use summaries that capture information about both factors simultaneously. To develop such summaries, we scaled the recombination summaries by S, which is a robust proxy for ntuple age (Additional file 8). Pairwise correlations indicated that most scaled summaries are positively correlated (r values from 0.055 to 1, mean of 0.436) (Additional file 9: Figure S9). The crossvalidation test was repeated, and surprisingly, the scaled summaries often performed more poorly than their unscaled versions. Nevertheless, using the mean, variance and maximum values of both the scaled and unscaled summaries returned the best overall result (71% correct assignment). This suggests two key conclusions. First, genetic datasets do record retrievable information about past changes in recombination rate. Second, scaled and unscaled recombination summaries do capture very slightly different information from the recombination profile.
Additional 8. Determining the optimal scaling factor to capture ntuple age.
Format: PDF Size: 257KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional 9: Figure S8. Correlations between recombination summary statistics scaled by S.
Format: PDF Size: 363KB Download file
This file can be viewed with: Adobe Acrobat Reader
Coalescent theory tells us that the power to detect recombination events should decline exponentially into the past (see details later). Therefore, the linearly increasing and decreasing models are mostly dominated by low and high recombination rates, respectively, while the constant model is intermediate. We were concerned that our crossvalidation test might simply be detecting low, medium and high recombination rates rather than distinguishing constant recombination from recombination rates that change through time. We therefore repeated the crossvalidation test with four recombination models: constant high, constant low, linearly increasing and linearly decreasing recombination rates. Assignment accuracy was only slightly lower than for the threemodel test (64% vs 71%). We conclude that ntuple subsampling can distinguish changing recombination from constant recombination, as well as rates that increase or decrease through time.
Effect of ntuple size on classification accuracy
Thus far, ntuple subsampling has been performed using just four sequences (a quartet). Quartets have found many uses in phylogenetics because this is the minimum number required for unrooted trees to possess distinguishable topologies. However, quartets may not be optimal for reconstructing past recombination rates within a population. We therefore varied the subsample size from 4 to the total sample size, 4 ≤ n ≤ 100 (Figure 5). The threemodel system was used, and the optimal LDA transform was recalculated for each value of n.
Figure 5. Effect of subsample (ntuple) size on assignment accuracy. Subsample sizes range from 4 to the sample size (n = 100). The curve shows a local regression through the data points.
As before, assignment rates started at 71% for quartets, initially improved with increasing n, before declining again as the subsample size approached the total sample size. The best assignment accuracy (84%) was obtained with an ntuple of size 77. While subsampling a large proportion of the dataset seemed to be most informative under this model system, optimal subsample size appears to vary from dataset to dataset in practice. When recombination rates are low, recombination events are recorded by few lineages and subsampling a greater proportion of the dataset improves detection of these rare occurrences. However, we can only detect recombination rates from their effects on DNA polymorphisms, so recombination events will often pass undetected if nucleotide diversity is low. Both low recombination rates and low genetic diversity therefore favor larger subsamples, while smaller subsamples are preferable for highly diverse or recombinationrich regions. Apart from these general guidelines, it seems that optimal subsample sizes must be determined empirically for each dataset.
Assignment accuracy was maximized at 84% across all analyses performed here. Although considerably better than chance, the error rate is still moderate. Because power levels are relatively modest, reconstructing historic recombination rates for real genomic loci is expected to remain difficult even when ntuple subsampling is employed. The highly constrained testing environment used here (e.g., a simple and perfectly known demography) emphasizes this point. In practice, complex demographic processes can produce patterns of genetic variation that might otherwise be attributed to processes of recombination [39]. Still, ntuple subsampling is directly amenable to statistical methods that infer model likelihoods by simulating data across a parameter space (e.g., approximate Bayesian computation [32,33]). Like ntuple subsampling, these methods typically employ a suite of summary statistics, and because they are based on Monte Carlo simulation, they can readily be modified to accommodation the novel bootstrapping process that we propose.
Discussion
We show that information about past changes in recombination rate can be extracted from genomic data using a suite of summary statistics coupled with lineage subsampling to provide proxy information about recombination events at different time depths. Simulated datasets can be correctly assigned to different models of historic recombination with high accuracy (84%).
Why is the power of ntuple subsampling not greater? Coalescent theory suggests several possible reasons. The coalescent describes how pairs of lineages sequentially share a common parent and merge (“coalesce”) until only one ancestral lineage remains. This process is analogous to genetic drift, where lineages are lost by chance over time. The key point is that individuals existing today are represented by fewer and fewer ancestral lineages moving backwards into the past (Figure 6).
Figure 6. Relationship between time and number of lineages under the coalescent. (A) Expected coalescent times for 2–5 lineages in units of N generations for a haploid locus (2 N generations for autosomal loci). (B) Representative coalescent genealogy. Note that many lineages exist to record events in the recent past, while few lineages remain to represent older time points. Only recombination involving an extant lineage (shaded points) can be observed today. The probability that recombination involves an extant lineage is high in the present, where many shaded points exist, but declines exponentially into the past, where shaded points are scarce.
Put more formally, the coalescent times T_{i} of m sampled lineages are exponentially distributed with mean and variance [4042].
In the sampling limit m → ∞, this exponential process implies that coalescence of the final two lineages (T_{2}) takes, on average, half the time to the most recent common ancestor of the sampled dataset (Figure 6A). While these equations only hold for constant sized populations, the general principle is true for most demographic scenarios of relevance to human history. Many lineages record the recent history of a locus, but older time depths are represented by fewer and fewer lineages that still exist today. Indeed, only two lineages will be present for approximately half the total age of the tree. These rootward branches are expected to provide most of the signal for changes in historic recombination rates. However, recombination events from distant times can only be detected in the modern dataset if they directly affect the few old lineages that survive to the present (Figure 6B). Information about historic recombination events therefore declines exponentially into the past.
Further, there is a high probability of observing these two deepest branches, even with very small subsample sizes [43,44].
Given four randomly chosen subsamples (i.e., a quartet), P(root  n = 4) = 3/5 = 0.60. Even with just four quartets, the probability that the oldest two branches are not observed is less than 5%. This equation gives the average expectation across all trees, but the general principle holds even for highly unbalanced trees. At the RRM2P4 locus, which putatively introgressed from archaic hominins into modern humans, 232 individuals are observed on one side of the tree’s basal split, while only 21 are found on the other [35]. The probability of observing the root with a single quartet is still high, P(root  n = 4) = 0.29. Importantly, all ntuples that sample these two oldest branches return exactly the same information about these oldest recombination events (Figure 4C). It follows that simply observing a larger number of ntuples is not sufficient to obtain more information about recombination in the oldest half of the ancestral recombination graph. Conversely, young quartets offer many possible sampling permutations (Figure 4A), and each of these can potentially provide independent information about recent recombination events. All datasets therefore record more information about recent recombination rates, while power to detect old recombination events declines exponentially backwards into the past.
Conclusions
A natural limit places important constraints on our ability to reconstruct past changes in recombination rates. If the change occurred recently, sufficient extant lineages may still record the event, and ntuple subsampling is likely to be an informative technique. Moving further in time from the change, the power to reconstruct the recombination profile decreases exponentially. If the change occurs beyond the coalescent (i.e., the most recent common ancestor of the dataset), it obviously cannot be reconstructed at all. Extremely detailed changes in past recombination profiles, particularly for more distant events or complex genomic loci, will remain challenging. However, we show that ntuple subsampling does have sufficient power to reconstruct some aspects of past changes in recombination rates, especially for relatively recent events.
Methods
Simulations
The coalescent simulation software ms [23] was modified to allow recombination rates to change through time. The C source code of the resulting program, ms_recomb, is available from the authors on request. Simulations focus on the most common transmediated recombination events, and we therefore model changes in recombination without selection (i.e., no transmission distortion).
Genetic datasets were simulated using Kingman’s ncoalescent [40,41]. To ground the simulations in a realistic framework, model parameters were chosen to reflect biologically meaningful values for humans. We purposely simulated a generic humanlike deme rather than any specific population. Simulations were generated for a single WrightFisher deme with a constant effective population size (N_{e} = 10^{4}) (i.e., the estimated global effective population size of modern humans) [2426]. A sample of 10kb autosomal sequences was simulated for 100 individuals with a mutation rate, μ, and average recombination rate, r, of 3.75 × 10^{8} events/bp/generation [27]. Unless otherwise noted, this process was iterated 10^{4} times for each model.
The recombination rate was either held constant, or allowed to vary linearly, exponentially or logistically through time for 10^{4} generations (cf.[24,26]). Linear rates were incremented by the reciprocal of the generation units per generation, exponential rates were fitted to a curve with λ = 5 × 10^{4}, and logistic rates were fitted to a curve with K = 100, N = 1 and r = 9 × 10^{4}. (Note that these curves are for exploratory purposes only. They are not intended to represent real rates of change in human populations). The total amount of recombination was constrained so as to be identical for all models, but was apportioned through time according to the constant, linear, exponential and logistic distributions described above. Overall population recombination rates (i.e., ρ = 4N_{e}r = 15) were chosen to mimic regions of very strong recombination in real human groups [11]. Low and high constant rates were defined as 15% and 85% of the maximum rate under the corresponding linear models.
To infer past recombination rates, samples were taken at a single time point and surveyed using ntuple subsampling (see main text for details). To determine how summary statistics respond to changing recombination rates, variation in summary statistics was tracked over a time span of 10^{4} generations by taking 10^{4} independent coalescent simulations at each of 500 20generation intervals.
Summary statistics
Summary statistics were calculated using functions from the libsequence library [45]. The C++ source code of the resulting program, msstats_recomb, is available from the authors on request. The number of segregating sites, S, controls for the population mutation rate θ (= 4N_{e}μ) and summarizes the total length of the genealogy [46]. A suite of eight additional summary statistics was employed to capture different aspects of the recombination profile: R_{min}, the minimum number of recombination events calculated from observed fourgamete violations [47]; rmmg, a conservative lower bound on R_{min} proposed by Myers and Griffiths (equation four in [48]); nHaps, the number of observed unique sequence haplotypes; HapDiv, the haplotype diversity, expected heterozygosity, or probability that two sequences chosen randomly from the sample are different [49]; Wall’s B and Q, variant estimators of the number of congruent polymorphic sites (i.e., segregating sites in complete linkage disequilibrium) [50]; Hudson’s C, an estimator of the population recombination rate ρ (= 4N_{e}r) estimated from the variance of pairwise sequence differences [51]; and Z_{nS}, the mean pairwise r^{2} estimate of linkage disequilibrium across all polymorphic sites [52].
Statistics
Correlations between scaled and unscaled summary statistics, and discriminant analyses, were calculated using the statistical software R[53]. Local regressions were performed using a polynomial of degree 2, a smoothedparticle hydrodynamics (SPH) kernel, and a 50% nearest neighbor bandwidth with a 10% constant component.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MPC, BRH, MCW and JS conceived and designed the experiments. MPC performed the experiments. MPC, BRH and JS analyzed the data. MPC drafted the manuscript. All authors have read and approved the final manuscript.
Acknowledgments
We thank Richard Hudson (University of Chicago) for suggesting how to simulate changing recombination rates using ms. This research was supported by the Australian Research Council through a Future Fellowship (FT100100031) to BRH, and by the Royal Society of New Zealand through a Marsden Grant (MAU0603) to JS and BRH, and a Rutherford Fellowship (RDF10MAU001) and Marsden Grant (11MAU007) to MPC.
References

Paigen K, Petkov P: Mammalian recombination hot spots: properties, control and evolution.
Nat Rev Genet 2010, 11:221233. PubMed Abstract  Publisher Full Text

Kong A, Barnard J, Gudbjartsson DF, Thorleifsson G, Jonsdottir G, Sigurdardottir S, Richardsson B, Jonsdottir J, Thorgeirsson T, Frigge ML: Recombination rate and reproductive success in humans.
Nat Genet 2004, 36(11):12031206. PubMed Abstract  Publisher Full Text

Calabrese P: A population genetics model with recombination hotspots that are heterogeneous across the population.
Proc Natl Acad Sci USA 2007, 104(11):47484752. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Broman KW, Murray JC, Sheffield VC, White RL, Weber JL: Comprehensive human genetic maps: individual and sexspecific variation in recombination.
Am J Hum Genet 1998, 63(3):861869. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Neumann R, Jeffreys AJ: Polymorphism in the activity of human crossover hotspots independent of local DNA sequence variation.
Hum Mol Genet 2006, 15(9):14011411. PubMed Abstract  Publisher Full Text

Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G: A highresolution recombination map of the human genome.
Nat Genet 2002, 31(3):241247. PubMed Abstract  Publisher Full Text

Khil PP, CameriniOtero RD: Variation in patterns of human meiotic recombination.
Genome Dyn 2009, 5:117127. PubMed Abstract  PubMed Central Full Text

Smukowski CS, Noor MAF: Recombination rate variation in closely related species.
Heredity 2011, 107(6):496508. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Webster MT, Hurst LD: Direct and indirect consequences of meiotic recombination: implications for genome evolution.
Trends Genet 2012, 28(3):101109. PubMed Abstract  Publisher Full Text

Jeffreys AJ, Neumann R: The rise and fall of a human recombination hot spot.
Nat Genet 2009, 41(5):625629. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Laayouni H, Montanucci L, Sikora M, Melé M, Dall’Olio GM, LorenteGaldos B, McGee KM, Graffelman J, Awadalla P, Bosch E: Similarity in recombination rate estimates highly correlates with genetic differentiation in humans.
PLoS One 2011, 6(3):e17913. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G: The influence of recombination on human genetic diversity.
PLoS Genet 2006, 2(9):e148. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Úbeda F, Wilkins JF: The Red Queen theory of recombination hotspots.
J Evol Biol 2011, 24(3):541553. PubMed Abstract  Publisher Full Text

Myers S, Freeman C, Auton A, Donnelly P, McVean G: A common sequence motif associated with recombination hot spots and genome instability in humans.
Nat Genet 2008, 40(9):11241129. PubMed Abstract  Publisher Full Text

Zheng J, Khil P, CameriniOtero RD, Przytycka T: Detecting sequence polymorphisms associated with meiotic recombination hotspots in the human genome.
Genome Biol 2010, 11(10):R103. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Coop G, Myers SR: Live hot, die young: transmission distortion in recombination hotspots.
PLoS Genet 2007, 3(3):e35. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hellenthal G, Pritchard JK, Stephens M: The effects of genotypedependent recombination, and transmission asymmetry, on linkage disequilibrium.
Genetics 2006, 172(3):20012005. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Brick K, Smagulova F, Khil P, CameriniOtero RD, Petukhova GV: Genetic recombination is directed away from functional genomic elements in mice.
Nature 2012, 485:642645. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Myers S, Bowden R, Tumian A, Bontrop RE, Freeman C, MacFie TS, McVean G, Donnelly P: Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination.
Science 2010, 327(5967):876879. PubMed Abstract  Publisher Full Text

Baudat F, Buard J, Grey C, FledelAlon A, Ober C, Przeworski M, Coop G, de Massy B: PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice.
Science 2010, 327(5967):836840. PubMed Abstract  Publisher Full Text

Kauppi L, Stumpf MPH, Jeffreys AJ: Localized breakdown in linkage disequilibrium does not always predict sperm crossover hot spots in the human MHC class II region.
Genomics 2005, 86(1):1324. PubMed Abstract  Publisher Full Text

Jeffreys AJ, Neumann R, Panayi M, Myers S, Donnelly P: Human recombination hot spots hidden in regions of strong marker association.
Nat Genet 2005, 37(6):601606. PubMed Abstract  Publisher Full Text

Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation.
Bioinformatics 2002, 18(2):337338. PubMed Abstract  Publisher Full Text

Cox MP, Hammer MF: A question of scale: human migrations writ large and small.
BMC Biol 2010, 8:98. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Cox MP, Morales DA, Woerner AE, Sozanski J, Wall JD, Hammer MF: Autosomal resequence data reveal late stone Age signals of population expansion in subsaharan african foraging and farming populations.
PLoS One 2009, 4(7):e6366. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cox MP, Woerner AE, Wall JD, Hammer MF: Intergenic DNA sequences from the human X chromosome reveal high rates of global gene flow.

Wall JD, Cox MP, Mendez FL, Woerner A, Severson T, Hammer MF: A novel DNA sequence database for analyzing human demographic history.
Genome Res 2008, 18:13541361. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Schmid J, Cannon RD, Holland BR: A futile act? Thoughts on the reproductive biology of Candida albicans.
Mycologist 2004, 18(4):158163. Publisher Full Text

Efron B, Tibshirani R: An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall; 1993.

Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap.
Evolution 1985, 39(4):783791. Publisher Full Text

Hastie T, Tibshirani R, Friedman J: The elements of statistical learning: data mining, inference and prediction. Berlin: Springer Verlag; 2009.

Beaumont MA, Nielsen R, Robert C, Hey J, Gaggiotti O, Knowles L, Estoup A, Panchal M, Corander J, Hickerson M: In defence of modelbased inference in phylogeography.
Mol Ecol 2010, 19(3):436446. Publisher Full Text

Csilléry K, Blum MGB, Gaggiotti OE, François O: Approximate Bayesian Computation (ABC) in practice.
Trends Ecol Evol 2010, 25(7):410418. PubMed Abstract  Publisher Full Text

Cox MP: Accuracy of molecular dating with the rho statistic: deviations from coalescent expectations under a range of demographic models.
Hum Biol 2008, 80:335357. PubMed Abstract  Publisher Full Text

Cox MP, Mendez FL, Karafet TM, Metni Pilkington M, Kingan SB, DestroBisol G, Strassmann BI, Hammer MF: Testing for archaic hominin admixture on the X chromosome: model likelihoods for the modern human RRM2P4 region from summaries of genealogical topology under the structured coalescent.
Genetics 2008, 178:427437. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hey J, Nielsen R: Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis.
Genetics 2004, 167(2):747760. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fisher RA: The use of multiple measurements in taxonomic problems.
Ann Eug 1936, 7(2):179188. Publisher Full Text

Wu W, Mallet Y, Walczak B, Penninckx W, Massart DL, Heuerding S, Erni F: Comparison of regularized discriminant analysis, linear discriminant analysis and quadratic discriminant analysis, applied to NIR data.
Anal Chim Acta 1996, 329:257265. Publisher Full Text

Johnston HR, Cutler DJ: Population demographic history can cause the appearance of recombination hotspots.
Am J Hum Genet 2012, 90(5):774783. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stochastic Process Appl 1982, 13:235248. Publisher Full Text

Wakeley J: Coalescent Theory: An Introduction. Greenwood Village, Colorado: Roberts & Company Publishers; 2008.

Kliman RM, Hey J: DNA sequence variation at the period locus within and among species at the Drosophila melanogaster complex.
Genetics 1993, 133:375387. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Saunders IW, Tavaré S, Watterson GA: On the genealogy of nested subsamples from a haploid population.
Adv Appl Prob 1984, 16:471491. Publisher Full Text

Thornton K: libsequence: a C++ class library for evolutionary genetic analysis.
Bioinformatics 2003, 19(17):23252327. PubMed Abstract  Publisher Full Text

Watterson GA: On the number of segregating sites in genetical models without recombination.
Theor Popul Biol 1975, 7(2):256276. PubMed Abstract  Publisher Full Text

Hudson RR, Kaplan NL: Statistical properties of the number of recombination events in the history of a sample of DNA sequences.
Genetics 1985, 111(1):147164. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Myers SR, Griffiths RC: Bounds on the minimum number of recombination events in a sample history.
Genetics 2003, 163(1):375394. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nei M: Molecular evolutionary genetics. New York: Columbia University Press; 1987.

Wall JD: Recombination and the power of statistical tests of neutrality.
Genet Res, Camb 1999, 74:6579. Publisher Full Text

Hudson RR: Estimating the recombination parameter of a finite population model without selection.
Genet Res 1987, 50:245250. PubMed Abstract  Publisher Full Text

Kelly JK: A test of neutrality based on interlocus associations.
Genetics 1997, 146:11971206. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

A Language and Environment for Statistical Computing. 2012.
http://www.rproject.org webcite