Abstract
Background
Descriptive hierarchical Poisson models and populationgenetic coalescent mixture models are used to describe the observed variation in singlenucleotide polymorphism (SNP) density from samples of size two across the human genome.
Results
Using empirical estimates of recombination rate across the human genome and the observed SNP density distribution, we produce a maximum likelihood estimate of the genomic heterogeneity in the scaled mutation rate θ. Such models produce significantly better fits to the observed SNP density distribution than those that ignore the empirically observed recombinational heterogeneities.
Conclusion
Accounting for mutational and recombinational heterogeneities can allow for empirically sound null distributions in genome scans for "outliers", when the alternative hypotheses include fundamentally historical and unobserved phenomena.
Background
Understanding the populationgenetic forces behind the observed variation among human genome sequences is vital to deciphering the genetic causes of phenotypic variation among humans. The phenomena that influence the density of human SNPs include (1) variationintroducing events that are empirically observable, such as, pointmutations, recombinations, and activities of various transposable elements that may result from the counteraction of various DNA damage and repair pathways [[1], for e.g.], as well as (2) genealogyaffecting events that are historical and generally unobserved, such as population dynamics, population structure, and natural selection. A biological understanding of the observed genomic variation in SNP density, by means of explicit populationgenetic models of coalescence in the presence of recombination and mutation, must incorporate any interplay among the heterogeneities in the above phenomena. Here we strive for an empirically sound understanding of the observed human SNP density, as determined by a genomewide alignment of two different consensus sequences, by accounting for the empirically observable mutational and recombinational heterogeneities under the simplest model of population history (selectivelyneutral, constantsized, randommating). The two sequences are the NCBI human genome sequence and the sequence produced by Celera Genomics [2]. Our SNP density data were obtained from first aligning the Celera consensus sequence to the NCBI assembly and then counting the number of SNPs in bins of 100 kb (100,000 base pairs), as was done in section 6 of the above study [2]. Next, we build simple models for the distribution of SNP density from random samples of size 2 from a locus that is 100 kb in length. Our objective is to explain as much of this simple measure of diversity as possible, under empirically sound null hypotheses that include coarsegrained, genomewide measurements of recombinational variation.
Results
Two approaches toward modeling are taken. The first approach is descriptive and employs hierarchical Poisson models to obtain better fits than the homogeneous Poisson distribution used earlier [2]. Insights gained from the first approach inform the second approach. The second approach is nondescriptive and populationgenetic with biologically interpretable parameters. It employs mixtures of SNP densities simulated under the coalescent with different mutation and recombination rates to obtain a better fit to the observed SNP density distribution. This approach introduces heterogeneity into the coalescentbased simulation of SNP density that was shown to produce a poor fit under the assumptions of genomewide homogeneity and equality of mutation and recombination rates [2]. The simple closedform expressions used in the paper are elementary results in coalescent theory [3,4].
Descriptive Hierarchical Poisson Models
Let Λ and T be the parameters in the mass function of a Poisson distribution given by Pr(X = xΛT) = e^{ΛT}(ΛT)^{x}/x!. The random variables Λ and T are generally proxies for relative mutation rate and the sum of branch lengths of the coalescent trees for all the nonrecombining segment(s) of the 100 kb locus, respectively. In other words, T is a proxy for the sum of the branch lengths of the ancestral recombination graph (ARGsize) of our sample of size 2 at a locus that is 100 kb long. The random variable X represents the count of SNPs in contiguous 100 kb intervals from an alignment of two human genomes. In this hierarchical scheme, heterogeneities are modeled by the Gamma and Beta probability density functions (PDFs), G(γ_{1}, γ_{2}) and B (β_{1}, β_{2}), respectively, as described in Methods. We chose the Gamma distribution G(γ_{1}, γ_{2}) to model T for the following reasons. When there is no recombination, the depth of the coalescent tree of two samples is exponentially distributed with rate parameter 1, i.e., G(1,1). And when there are n sites with free recombination in between them, the sum of the n independent and exponentially distributed depths is G(n, 1). Thus, T is only a mathematically convenient proxy for the ARGsize of our sample of size 2, since T ~ G(γ_{1}, γ_{2}) does not explicitly capture the distribution of ARGsize for intermediate levels of intralocus recombination among sites at our locus. We use the relatively flexible Beta family on [0, 1] to model, Λ which is a proxy for relative mutation rate. The Poisson distribution for SNP density follows from the assumption of the infinitelymanysites mutational model under selective neutrality, where mutations hit a site at most once according to the product of the total length of the sitespecific coalescent tree and the sitespecific relative mutation rate. Therefore, such hierarchical Poisson models are merely descriptive, as they are built via mathematically convenient Beta and Gamma distributed random variables Λ and T that act as proxies for the relative mutation rate and the ARGsize, respectively.
The likelihood function for each of the following hierarchical Poisson models was maximized with the Newton's method from several random initial conditions. We use the Akaike information criterion (AIC) [5] to make model comparisons. For a given model AIC = 2 log(ML) + 2K, where ML is the maximum likelihood value and K is the number of parameters in the model. In the hierarchical Poisson model A, we allow T ~ G(γ_{1}, γ_{2}), while Λ is fixed at 1. The fit to the data (Figure 1) improved in comparison to the homogeneous Poisson fit which completely ignores the underlying ancestral recombination process. Thus, when the Gamma distribution is used to approximate the distribution of the sum of all branch lengths of the ancestral recombination graph (ARGsize) of a locus, the observed variance is better explained. Model A is mutationally homogeneous as Λ, the proxy for mutation rate, is fixed. In order to allow variation, a hierarchical Poisson model A' that restricts T to a constant parameter λ while allowing Λ to be Beta distributed (Λ ~ B(β_{1}, β_{2})) was fit to the data. The fit was significantly better than that of model A. Thus, modeling heterogeneity in mutation rates, via the Beta distributed proxy Λ, across the different 100 kb loci gives better fits to the SNP density distribution. When we allowed both Λ to be Beta distributed and T to be Gamma distributed, we get the hierarchical Poisson model B. As shown in Figure 1, the fit is significantly better to the observed data when heterogeneities in both mutation and recombination are approximately accounted for through the proxies in model B. The results of the maximum likelihood (ML) analysis of these four Poisson models are summarized in Table 1. The first and second moments (, and ) under the maximum likelihood estimates are also shown for each model in the Table. Note that the means are almost the same but the variances vary considerably. If one wants a datadescriptive fit to the SNP density distribution, then Model B is a good candidate. With the arrival of more refined data (with counts in lowdensity regions as discussed later) one could consider further generalizations of such hierarchical Poisson models along the zeroinflated class [6], for instance, to obtain better descriptive fits. Unfortunately, the bestfitted parameters of such descriptive models lack any explicit biological interpretability, in terms of standard populationgenetic models of reproduction. Guided by insights from these descriptive hierarchical Poisson models, we analyze the simplest populationgenetic model of the neutral coalescent with an explicit accounting for heterogeneities in both mutation and recombination rates. We use a simulated maximum likelihood framework [7] for parameter estimation.
Figure 1. Fits of the homogeneous Poisson model (large gray dots), hierarchical Poisson model A (black dots) with T ~ G(γ_{1}, γ_{2}) hierarchical Poisson model A' (gray line) with T = λ and Λ ~ B(β_{1}, β_{2}), and hierarchical Poisson model B (black line) with T ~ G(β_{1}, β_{2}) and Λ ~ B(β_{1}, β_{2}) to the observed SNP density distribution (joined gray dots).
Table 1. Maximum likelihood analysis and comparison of Poisson models
PopulationGenetic Coalescent Mixture Models
A panmictic, WrightFisher, neutral coalescent model with a constant effective population size of 10,000 diploid individuals was assumed to simulate the distribution of the number of segregating sites at a locus of 100 kb evolving under an infinitelymanysites mutation model using the C program ms [8]. The scaled product of the effective population size (N_{e}) and the mutation rate per locus per generation (μ) is denoted by θ = 4N_{e}μ. The recombination rate r is the probability of crossover per generation between the ends of the locus being simulated and its scaled product with N_{e }is denoted by ρ = 4N_{e}r.
In the absence of recombination and with constant mutation rates, the distribution of SNPs is known to have an explicit form. The coalescent tree is identical for every nucleotide site in the locus in any given realization of the coalescent process of two samples. Since the rescaled time to the coalescent event and the mutation event are exponentially distributed with rates 1 and θ, respectively, the probability of a mutation event before the coalescent event is θ/(1 + θ). Thus, the probability of observing x mutations at our locus before the coalescent event is (θ/(1 + θ))^{x }1/(1 + θ). In other words, the probability of observing x SNPs at a locus when r = 0 is geometrically distributed with parameter 1/(1 + θ).
It is also known that as the recombination rate at our locus approaches infinity, the distribution of SNPs approaches a Poisson distribution with parameter θ. This can be seen from the following argument. High levels of recombination assures that the coalescent tree at each site is independent of those at other sites. Thus, for a locus with n sites, the probability of observing x SNPs is . For large loci, this binomial mass function is known to approximate e^{θ}θ^{x}/x!, the Poisson mass function, as n → ∞ and .
However, when the recombination rate is some intermediate value between the above two extremes no explicit forms are known for the SNP density. We use empirical estimates of the SNP density from a large number of simulations (typically 100,000). Figure 2 shows how the distribution of SNP density under our assumptions morphs from the geometric distribution (black dots) towards the Poisson distribution (grey dots) as the scaled recombination rate ρ increases from 0 to 1000 in decreasing shades of grey. This behavior is identical for any fixed value of θ except for a scale change.
Figure 2. The distribution of SNP density in 100 kb morphs from the geometric distribution (black dots) towards the Poisson distribution (gray dots) as the scaled recombination rate ρ increases from 0 to 1000 in decreasing shades of gray for θ = 10 (top) and θ = 100 (bottom).
The empirical estimates of the sexaveraged human recombination rates in 1 Mbp intervals based on Genethon [9], Marshfield [10] and deCODE [11] maps were downloaded from [12]. We intrapolated to obtain the estimates over 100 kb segments by assuming rate constancy over the 10 consecutive 100 kb segments that constitute the 1 Mbp segment for which an empirical estimate of the recombination rate were available. The empirical distribution of the sexaveraged human recombination rate in 100 kb intervals, based on Genethon map, as shown in Figure 3, is denoted by . The strategy described in Methods was used to obtain a simulationbased empirical estimate of the SNP density distribution for each scaled mutation rate
Figure 3. The distribution of the empirical estimates of the sexaveraged recombination rate in 100 kb segments of the human genome from the Genethon map (joined black dots) and , the maximum simulated likelihood estimate of the weights on θ_{i }∈ Θ (gray line) for the coalescent mixture model.
when the recombination rate was assumed to be distributed according to . We denote this simulationbased estimate of the SNP density distribution for each θ_{i }∈ Θ by . Note that , the true SNP density distribution, as the number of replicates (N) used to estimate it grows large. In practice, N was set at 100,000. A discretized and rescaled Beta density with parameters α and β was used to find the mixing weights for each θ_{i }∈ Θ. Thus, for every ordered pair (α, β), the shape of the Beta density specified the mixing weights, as follows:
where, Θ = 304 is the cardinality or size of the set Θ. Such (α, β)specified 's were used to weigh the corresponding 's in order to obtain a finite mixture of the form . A simulated likelihood function of α and β was thus constructed for the given SNP data X = (x_{0}, ∈, x_{n}), as follows,
We used the Newton's method to find the maximum simulated likelihood (MSL) estimates = 6.7 and = 14.9 (MSL = 185555). We also did a leastsquares fit of the observed to the predicted densities and found comparable estimates. Empirical estimates of the sexaveraged recombination rates from deCODE, and Marshfield maps were also used in a similar analysis. Comparable estimates were obtained under a reasonably good fit (MSL = 185558) with the deCODE map whose empirical CDF resembles that of the Genethon Map. However, an analysis with the Marshfield map yielded a poorer fit (MSL = 186007). Figure 4 summarizes the fits to the observed SNP data while Figure 3 shows the marginal density of ρ from the Genethon map and the marginal density of θ under the maximum simulated likelihood estimates ( = 6.7, = 14.9) with mean, variance, and standard deviation given by 90.7, 876.1, and 29.6, respectively. Among the three coarsescaled maps of the empirical estimates of the sexaveraged human recombination rates, the Genethon map gave the best fit to our observed SNP density distribution data.
Figure 4. The SNP density distribution (joined gray dots), Poisson distribution with mean 90 (large gray dots), simulated distribution of SNPs with ρ = θ = 90 (gray line), and the Maximum Simulated Likelihood estimate from the coalescent simulations with ρ ~ θ_{i }~ and (black line).
Discussion
Another study [13] claimed to have achieved a good fit to single reads with 0, 1, 2, 3, or 4 SNPs, by accounting for mutational heterogeneity and genealogical variability in a different manner. They partitioned the genome into 200 kb bins, and selected a single read from each bin. They calculated the observed GC content of the bin, and from a regression of GC content on nucleotide diversity across the whole genome, they calculated an expected diversity given the local GC content of each bin induced by the exponentially distributed coalescent time for samples of size 2 in the absence of recombination. However, when the full bin size of 100 kb were used [2], the SNP count ranged to more than 100 per bin. Because many neighboring reads have shared genealogies, the magnitude of variability from bin to bin is much greater, and the power to detect this heterogeneity is far greater. Thus, the latter study [2] found that the coalescent in the presence of recombination fit the observed SNP density better than the coalescent without recombination. The model employed in the former study [13] fits without recombination only because the power is so low to detect a departure and because there are correspondingly fewer recombination events expected within single reads vs. 100 kb bins. Using the data of SNP counts in 100 kb bins in this study, we find that the coalescent with heterogeneities in recombination as well as mutation gives substantially better fits than the coalescent with a constant rate of recombination and mutation. We have shown that by invoking heterogeneities in mutation and recombination rates, one can better explain the observed variation in SNP density across two randomly sampled 100 kb segments of human chromosomes. Descriptive fits by means of hierarchical Poisson models, as well as populationgenetic fits by means of coalescent mixture models, significantly improved when heterogeneities in recombination as well as mutation rates were accounted for. The coalescent mixture model does not completely fit the data in the most interesting region, namely, the segments with the least SNP density. This is partly due to the filtering strategy used to obtain the data. Since there were considerable gaps in the alignments for several bins, there was an overestimation of bins with 0 SNPs. Thus, these bins were ignored from the analysis. Were low SNP counts from such currently ignored bins made available from a highresolution alignment, a similar analysis would reveal the poorer fits of the descriptive hierarchical Poisson models employed here, unless they are further generalized to allow for a larger mass at 0 through the zeroinflated class [6], for instance. If one's objective is to produce a descriptive fit to our observed SNP density distribution, then the hierarchical model B is clearly preferable to all the models considered in this study due to its strikingly high likelihood value. However, if one wanted a populationgenetic model with biologically interpretable parameters to fit the same data, then the best fitted coalescent mixture model with the Genethon recombination map is preferable.
It is important to bear in mind that the distribution of T will be affected not only by recombination rate but also by population structure and demography. Likewise, the distribution of Λ and T will be affected by the complex interaction between various DNA damage and repair pathways that ultimately lead to various types of mutational and recombinational events [[1], for e.g.]. Moreover, the action of selection will simultaneously affect both the distribution of T and Λ about the selected site(s). However, since only a small percentage of the genome is expected to be affected by recent selective sweeps, the overall SNP density distribution should not be significantly affected by such selective events. Thus, our MSL estimate of the genomic variation in θ, based on the Genethon map, is under the standard neutral coalescent that allows for recombinational and mutational rate heterogeneity across the genome. The true genomic variation in θ can also be affected by several other confounded historical factors including selection, population structure, and demography, besides genomic variation in mutation rate. All these confounded historical factors can be seen as alternative hypotheses to the null hypothesis of our coalescent mixture model for the SNP density distribution, i.e., the standard neutral coalescent with genomic heterogeneity in recombination and mutation rates.
Conclusion
As high resolution data for larger samples become available at a genomic scale, one can use such simulated ML methods (with appropriate sample sizes) to get the null distributions of various test statistics while accounting for heterogeneities in recombination rates (from empirical maps or finerscaled inferred maps) and mutation rates (from the informative phylogenomic constraints imposed by additional ape genomes). Such empirically observable phenomena should be incorporated into the null hypothesis when more complex models with unobserved historical phenomena, such as population dynamics, population structure, and/or natural selection are tested in humans at the genomic scale. Current scans of the human genome tend to underestimate the costs of ignoring the empirically observable heterogeneities under the null hypothesis.
Methods
In the hierarchical Poisson scheme, heterogeneities are modeled by the following Gamma and Beta probability density functions (PDFs),
The following strategy was used to obtain a simulationbased empirical estimate of the SNP density distribution for each scaled mutation rate
when the recombination rate was assumed to be distributed according to
.1. for each θ_{i }∈ Θ, repeat N times:
(a) sample a ρ according to
(b) simulate the coalescent according to ρ and θ_{i }[4,8]
(c) record the number of SNPs
2. Obtain the empirical distribution of SNP density for the given θ_{i }when ρ ~
Authors' contributions
AGC posed the question, RS and RTD made simple models, RS implemented the models and all three authors edited the manuscript.
Acknowledgements
RS is a Research Fellow of the Royal Commission for the Exhibition of 1851. RS and RTD are partially supported by the National Science Foundation/National Institutes of Health Grant DMS/NIGMS 0201037. RS thanks Arkendra De, Kevin Thornton, and Russell Zaretzki for insightful discussions and Gilean McVean for comments on an earlier draft. Critically constructive comments of two anonymous referees improved the manuscript.
References

Schärer OD: Chemistry and biology of DNA repair.
Angew Chem Int Ed 2003, 42:29462974. Publisher Full Text

Venter JC, et al.: The sequence of the human genome.
Science 2001, 291:13041351. PubMed Abstract  Publisher Full Text

Stochastic Process Appl 1982, 13:235248. Publisher Full Text

Hudson RR: Properties of a neutral allele model with intragenic recombination.
Theoretical Population Biology 1983, 23:183201. PubMed Abstract  Publisher Full Text

Akaike H: A new look at the statistical model identification.
IEEE Trans Autom Control 1974, 19:716723. Publisher Full Text

Bhöning D: ZeroInflated Poisson models and C.A.MAN: A tutorial collection of evidence.
Biometrical Journal 1998, 40:833843. Publisher Full Text

Pakes A, Pollard D: Simulation and the Asymptotics of Optimization Estimators.
Econometrica 1989, 57:10271057. Publisher Full Text

Hudson RR: Generating samples under a WrightFisher neutral model of genetic variation.
Bioinformatics 2002, 18:337338. PubMed Abstract  Publisher Full Text

Dib C, Faure S, Fizames C, Samson D, Drouot N, Vignal , Millasseau P, Marc S, Kazan J, Seboun E, Lathrop M, Gyapay G, Morissette J, Weissenbach J: A comprehensive genetic map of the human genome based on 5,264 microsatellites.
Nature 1996, 380:152154. PubMed Abstract  Publisher Full Text

Broman KW, Murray JC, Sheffeld VC, White RL, Weber JL: Comprehensive human genetic map: individual and sexspecific variation in recombination.
Am J Hum Genet 1998, 63:861869. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir SJB, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K: A highresolution recombination map of the human genome.
Nature Genetics 2002, 31:241247. PubMed Abstract  Publisher Full Text

UCSC Genome Annotation Database [http://genome.ucsc.edu/goldenPath/gbdDescriptions.html] webcite

The international SNP map working group: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms.
Nature 2001, 409:928933. PubMed Abstract  Publisher Full Text