Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, UT, USA

Department of Anthropology, University of Utah, Salt Lake City, UT, USA

Abstract

Background

Though a variety of linkage disequilibrium tests have recently been introduced to measure the signal of recent positive selection, the statistical properties of the various methods have not been directly compared. While most applications of these tests have suggested that positive selection has played an important role in recent human history, the results of these tests have varied dramatically.

Results

Here, we evaluate the performance of three statistics designed to detect incomplete selective sweeps, LRH and iHS, and ALnLH. To analyze the properties of these tests, we introduce a new computational method that can model complex population histories with migration and changing population sizes to simulate gene trees influenced by recent positive selection. We demonstrate that iHS performs substantially better than the other two statistics, with power of up to 0.74 at the 0.01 level for the variation best suited for full genome scans and a power of over 0.8 at the 0.01 level for the variation best suited for candidate gene tests. The performance of the iHS statistic was robust to complex demographic histories and variable recombination rates. Genome scans involving the other two statistics suffer from low power and high false positive rates, with false discovery rates of up to 0.96 for ALnLH. The difference in performance between iHS and ALnLH, did not result from the properties of the statistics, but instead from the different methods for mitigating the multiple comparison problem inherent in full genome scans.

Conclusions

We introduce a new method for simulating genealogies influenced by positive selection with complex demographic scenarios. In a power analysis based on this method, iHS outperformed LRH and ALnLH in detecting incomplete selective sweeps. We also show that the single-site iHS statistic is more powerful in a candidate gene test than the multi-site statistic, but that the multi-site statistic maintains a low false discovery rate with only a minor loss of power when applied to a scan of the entire genome. Our results highlight the need for careful consideration of multiple comparison problems when evaluating and interpreting the results of full genome scans for positive selection.

Background

Until a few years ago, studies of positive selection have been limited to sequence data from a single gene covering only a few thousand nucleotides. Now that detailed genetic maps of human variability are available in many populations, it is possible to measure the signature of positive selection on a genomic scale

Most of the discussion surrounding these genome scans has focused on the similarities of their results, since all indicate that positive selection has been a surprisingly important force in recent human evolution

While several studies have estimated the power of EHH statistics to infer positive selection, the statistical power of FRC has not yet been explored. To address this gap, we use simulated data to compare the properties of FRC and EHH statistics. We first examine the power of the single-site statistics of each method under explicit null models of neutrality and alternative models of selection. We then estimate the false positive rate, power, and false discovery rate of each test when applied to an empirical distribution of its respective statistic based on a combination of neutral and selected loci.

The available computational methods for simulating genealogies cannot easily model complex demographic scenarios combined with the presence of positive selection. Most methods require a single population of constant size. This is problematic when evaluating the statistical power of LD-based tests in the presence of positive selection, as population bottlenecks and subdivision can create LD that mimics that generated by selection. Here, we introduce a new approach for simulating positive selection in complex population histories with subdivision, migration, bottlenecks, and expansions in a coalescent framework. With this approach, we first generate a set of potential allele trajectories for the favored allele using forward-in-time simulations. Then for each backwards-in-time simulation, we select an allele trajectory at random and condition the coalescent simulation on the population sizes and migration history of the favored allele as specified by the allele trajectory (see Methods).

Results

In our analysis, we considered one test derived from the FRC statistic, ALnLH, and two tests derived from the EHH statistic, LRH and iHS

Throughout the analysis, we calculated two versions of the FRC statistic. As originally presented by Wang et al., FRC is calculated from unphased data using the individuals homozygous for each allele at the focal site _{p }and ALnLH_{u}, with the phased statistic using information from both homozygotes and heterozygotes to infer FRC. As shown in Figure _{u}, the effect should be small given the accuracy of current phase estimation technology and that ALnLH_{u }ignores information from all heterozygote comparisons

Power to detect selection from single-site statistics with a constant recombination rate

**Power to detect selection from single-site statistics with a constant recombination rate**. For all figures, the power was averaged across 4 population histories of constant size, expansion, expansion with migration, and bottleneck with migration. Both ALnLH_{p }and iHS performed quite well in most models. The power of LRH was consistently lower than the other statistics. Neutral simulations for each set of simulation parameters provided the critical values for each statistic. **a.** Power to detect selection for allele frequencies between 0.2 and 0.8 with a simulated region of 1 Mb at a significance level of 0.01. ALnLH_{p }and ALnLH_{u }were equivalent when allele frequencies were close to 0.5, but the power of ALnLH_{u }drops by 40% with allele frequencies of 0.2 and 0.8. **b.** Power to detect selection in simulated regions of 0.1 Mb to 1 Mb. The power was calculated from an equal proportion of allele frequencies 0.2, 0.4, 0.6, and 0.8 for the favored allele at a significance level of 0.01. The average power increased substantially for ALnLH and iHS out to nucleotide lengths of 400 Kb, beyond which there was little improvement. **c.** Power to detect selection for significance levels of 0.005 to 0.05 with simulated region of 1 Mb and an equal proportion of allele frequencies 0.2, 0.4, 0.6, and 0.8 for the favored allele. The average power of ALnLH_{p }and iHS was over 0.9 for significance levels of 0.01 or greater.

In general, the properties of iHS and ALnLH_{p }were similar when the recombination rate was constant (Figure _{p }was much greater, as shown in Figure _{p }dropped by 46%, while iHS dropped by only 8% for (Figure _{p}, the global recombination rate is based on the observed decay of LD at G6PD and the genome deviation from the G6PD model

Power to detect selection from single-site statistics for various demographic models with a constant recombination rate

**Power to detect selection from single-site statistics for various demographic models with a constant recombination rate**. All statistics perform well under all 4 population histories. The only statistic notably sensitive to population history was iHS, which performed particularly well in models with expansion and relatively worse in models with bottlenecks and migration. The simulated region was 1 Mb in length with an equal proportion of allele frequencies 0.2, 0.4, 0.6, and 0.8 for the favored allele at a significance level of 0.01.

Effects of variable recombination rate on the power of selection statistics

**Effects of variable recombination rate on the power of selection statistics**. The variable recombination rate reduced the power of iHS by 8% and the power of ALnLH_{p }by 46%. The locus recombination rate for each simulation was set to an exponential random variate with mean equal to 1 cm/Mb. For other simulation parameters, see Figure 2.

For the results presented above, we calculated a statistic for each SNP and evaluated the power to detect selection, with the null hypothesis of neutrality and the alternative hypothesis of strong positive selection acting on the SNP in question. This is an appropriate test for positive selection when the investigator has a prior hypothesis about the potential influence of natural selection and when there are a small number of candidate loci. However, as we demonstrate below, when this simple strategy is applied to an uninformed scan across the genome, it introduces a multiple testing problem that heavily weights the significant results toward false positives. The testing methodology that Voight et al.

Figure _{u}, we treat the test statistic for each SNP within a 200 Kb region as a separate (but not independent) test for each gene. While the false positive rate per SNP was 0.016 for ALnLH_{u }

Power, false positive rates, and false discovery rates for a) iHS and b) ALnLH_{u}

**Power, false positive rates, and false discovery rates for a) iHS and b) ALnLH _{u}**. To obtain critical values, we combined loci from neutral simulations with loci from selection simulations in proportion to the fraction of the genome influenced by positive selection

Discussion

From our evaluation of false discovery rates, we can estimate the number of false discoveries for each genomic scan. Of the 1799 candidate genes identified by Wang et al.

While the single-site statistics used in these studies perform equally well under simulations with constant recombination rates, several factors inhibited the performance of ALnLH. These factors primarily involve implementation details of the test and not the properties of the FRC statistic itself. Since both ALnLH and iHS methods measure the long range LD for each allele at each focal site, it may be possible to design a test based on the FRC statistic that matches or exceeds the performance of iHS using the Voight et al. implementation as a template

Throughout our analysis of EHH statistics, iHS consistently outperformed LRH. Since specific guidelines are not available for determining the core haplotype region and level of EHH decay for LRH, we may have underestimated the power of LRH. However, we tested 4 sets of parameter values using examples in Sabeti et al. as a guide

Our estimates for the power of the iHS test were consistently higher than those reported in Voight et al.

There are two other considerations when comparing the power analysis from Voight et al.

As pointed out by Przeworski et al., empirical scans for selection will miss many selection events when they are applied to genomes that have been heavily influenced by recent positive selection

Conclusions

In agreement with previous findings, our results demonstrate that the multi-site iHS test is an excellent test for detecting incomplete selective sweeps in full genome scans, with power between 0.33 and 0.74 and false discovery rate between 0 and 0.53 at the 0.01 level. In comparison, the power of the ALnLH test in full genome scans was approximately 25% lower with a false discovery rate between 0.74 and 0.96. However, the statistical properties of the two statistics are quite similar when applied to a single site in a candidate gene test, with power of over 0.8 at the 0.01 level, demonstrating the importance differences in the adjustments made for multiple tests in full genome scans. Our results highlight the need for careful consideration of multiple comparison problems when evaluating and interpreting the results of full genome scans for positive selection. The algorithm we present for simulating genealogies influenced by positive selection will allow for more thorough exploration of complex demographic scenarios when evaluating methods for detecting positive selection.

Methods

Simulating the allele trajectory

To simulate positive selection, we employed the coalescent framework first proposed by Kaplan et al. ^{2}+n. Pickrell et al.

In the interest of developing a more flexible method, we introduce a new importance-sampling method based on forward Wright-Fisher drift. Consider a sample of n sequences from a single subpopulation, x of which carry a favored allele that originated t generations ago with a selection coefficient of s. We would like to draw randomly from the trajectories that produce x modern copies of the favored allele in a sample of size n. To accomplish this, we simulate the forward trajectory of the favored allele, continuing until the allele is lost, becomes fixed, or until t generations have passed. Let p equal the frequency of the allele in the subpopulation in the final generation. Then the importance weight for our desired distribution is the binomial likelihood function:

Because Wright-Fisher drift is a Markov process, the importance weight depends only on the allele frequency in the final generation. In contrast, Slatkin's method employs a backward process that is only a rough approximation to Wright-Fisher drift, so the sampling weight must be calculated over the entire history of the two alleles with a separate term for each population and for each potential migration path in each generation

Because the allele trajectory is generated from Wright-Fisher forward simulation, this method can seamlessly model complex demographic scenarios that include bottlenecks, expansions, and population subdivision with migration. The biggest downside to this flexibility is the potential for choosing parameter values that rarely result in population allele frequencies that are near the observed frequency in the sample. This concern must be evaluated when choosing parameter values, as some will require a prohibitive number of forward simulations to cover the sample space. However, all of the backward time methods are approximations to a forward Wright-Fisher process, and are meant to model natural processes that clearly occur in forward time, so this method is adequate for exploring most relevant models of positive selection and demography. For models where the sample allele frequency is particularly unlikely, Slatkin's method will be preferable since it involves a backward process conditioned on the sample

For the results presented here, we set s and t to fixed values, though in principle they could be set to random variates in each forward iteration, reflecting uncertainty around estimates of selection strength and allele age. If t is a random variable, each origin generation-subpopulation must be weighted by its respective population size to reflect the probability that a new mutation originates in that generation

Coalescent simulations

We assumed all recombination events were crossovers, where a crossover occurs with the favored or neutral allele with probability proportional to the frequency of the alleles in the subpopulation

The trajectory of the favored allele was generated under a model where the migration rates are constant between subpopulations for each epoch. However, since a trajectory is in part a realization of this random process, we could not assume constant migration rates in a coalescent simulation based on a particular trajectory. The number of individuals of each genotype migrating to and from each population in a given generation is determined by the forward simulation and is therefore treated as a constant during the backward simulation. The individual migrants themselves are, however, chosen at random during the backward simulation. To implement this process, we introduced two migration lookup tables. The first table was analogous to the coalescence lookup table, storing the cumulative hazard of migration out of a given subpopulation for each allele. We used the second table to determine the destination subpopulation of a migrant, by storing the conditional probability of migrating from an origin subpopulation to a destination subpopulation given that a migration event occurred out of the origin subpopulation in a particular generation. Expanding on Coop and Griffith's method, we accessed the coalescence and migration lookup tables with uniform random variates to generate the waiting time until the next event for each subpopulation-allele combination

Ascertainment bias

To introduce ascertainment bias to the simulated data, we developed a procedure to model the process in the Perlegen dataset. In their SNP discovery process, they identified all polymorphic sites in a fully sequenced subsample, then genotyped those sites in a larger sample

Statistics

EHH is defined as the probability that two chromosomes in a sample share the same haplotype for a given set of SNPs

Voight et al.

FRC is the fraction of inferred recombinant chromosomes between two sites within a sample

where X is the distance from the focal site.

For a given allele at a focal site, Wang et al. calculate FRC separately for each site within 500 Kb of the focal site with a minor allele frequency greater than 0.1

Positively selected alleles that are much younger than G6PD will, in general, have larger LD blocks surrounding the selected allele. If the likelihood calculation were left unadjusted, this would result in low likelihood scores for alleles with very low LD, since they would be a poor fit to the G6PD model. This is also an issue for alleles older than G6PD or in regions with higher rates of recombination. Since these are undesired properties, Wang et al.

where

Here, Y_{i }is the FRC at site i, X_{i }is the distance from site i to the focal site, F is the expected value of FRC as a function of the distance from the focal site, N is the number of sites, and σ^{2 }is the variance of g over the entire empirical distribution.

They calculate ALnLH for each allele at each site with a homozygote minor allele frequency of greater than 0.05. From the empirical distribution, they determine the average and standard deviation of ALnLH scores. Candidates for positive selection are those SNPs where one allele has an ALnLH score of 2.6 SD above the mean while the other allele has a score of less than 1 SD above the mean. In their 2006 study, these criteria included the top 1.6% of the empirical distribution

Authors' contributions

CH, HH, and AR designed the study and participated in the data analysis. CH and AR developed the coalescent simulation algorithm. CH implemented the coalescent algorithm and wrote the manuscript with extensive input and feedback from coauthors. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank Jon Seger for his help in designing the coalescent simulation algorithm we present here. We would also like to thank Eric Wang for providing the source code used to calculate ALnLH. This work was supported in part by the Primary Children's Medical Center Foundation National Institute of Diabetes and Digestive and Kidney Diseases (DK069513).