Use of trait-dependent sampling designs in whole-genome association studies of sequence data can reduce total sequencing costs with modest losses of statistical efficiency. In a quantitative trait (QT) analysis of data from the Genetic Analysis Workshop 17 mini-exome for unrelated individuals in the Asian subpopulation, we investigate alternative designs that sequence only 50% of the entire cohort. In addition to a simple random sampling design, we consider extreme-phenotype designs that are of increasing interest in genetic association analysis of QTs, especially in studies concerned with the detection of rare genetic variants. We also evaluate a novel sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability. We take differential sampling of individuals with informative trait values into account by inverse probability weighting using standard survey methods which thus generalizes to the source population. In replicate 1 data, we applied the designs in association analysis of Q1 with both rare and common variants in the FLT1 gene, based on knowledge of the generating model. Using all 200 replicate data sets, we similarly analyzed Q1 and Q4 (which is known to be free of association with FLT1) to evaluate relative efficiency, type I error, and power. Simulation study results suggest that the QT-dependent selection designs generally yield greater than 50% relative efficiency compared to using the entire cohort, implying cost-effectiveness of 50% sample selection and worthwhile reduction of sequencing costs.
We assume an existing cohort of unrelated individuals, well phenotyped for a quantitative trait (QT) that is approximately normally distributed. At present, the cost of whole-genome sequencing of an entire cohort to test for association remains prohibitive, and selection of a subset of informative individuals for sequencing is a reasonable strategy to reduce costs. In the absence of relevant genetic information, such as family history or single-nucleotide polymorphism (SNP) association identified through genome-wide association studies, one might hypothesize that individuals at the extremes of the QT distribution would be more informative for the detection of association with sequence variants. For example, individuals with high QT values are more likely to carry high-risk variants, whereas those with low values do not. In contrast, taking a simple random sample from the cohort will tend to select individuals from the central part of the trait distribution and can fail to sample individuals with low-frequency variants. Because ascertainment by the phenotype needs to be taken into account, simple selection from the tails of the trait distribution complicates inference. The resulting trait distribution is bimodal, violating the usual linear regression assumptions and can produce distorted effect estimates and type I errors greater or less than nominal. In addition, the resulting sample is not representative of the original cohort, making it more difficult to design validation studies in nonselected populations. Finally, if both extremely high and extremely low QT values are “pathological” then excluding “normal” individuals would reduce the ability to detect protective variants associated with favorable QT values. As an alternative, we propose a sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability of being selected. We account for differential sampling in the association analysis by inverse probability weighting (IPW) with standard survey methods for variance estimation and hypothesis testing. The purpose of our analyses of the Genetic Analysis Workshop 17 (GAW17) mini-exome data is to compare alternative sampling strategies with respect to relative efficiency. We also examine type I error and power of hypothesis testing.
We consider the analysis of sequencing data for the entire cohort as the ideal against which to compare sampling designs in which only 50% of the cohort is sequenced. Four alternative sampling designs are summarized in Table 1. Extreme-phenotype alternative designs 3 and 4 are systematic and do not actually involve any random selection; they differ only in whether individuals with “normal” values are chosen for sequencing. Design 2 involves simple random sampling (SRS) and thus is expected to be unbiased compared to the entire cohort, but it has reduced power and is subject to sampling variation. Under design 5, which is similarly subject to sampling variation, we specify the sampling probability for each individual according to the value of their quantitative trait Yi. For individual i, we use the distance from the median of the trait distribution (Ymed) in the original cohort:
In particular, we specify the IPW weight as:
where the sampling probabilities under design 5 are:
which we refer to as 5.1 and 5.2, respectively. These probabilities are scaled by the maximum distance from the median, with:
where Ymax and Ymin are the maximum and minimum of the Yi, respectively. In design 5.2 extreme observations are selected more frequently.
Table 1. Sampling designs and analytical methods for QT association analysis
Statistical models and hypothesis testing
For common variants, it is feasible to analyze one SNP at a time. For rare variants, however, methods that accumulate multiple SNP genotypes across a region are necessary to improve the power to detect association. We calculate a rare variant score, defined as the total count of rare alleles within a gene [1,2]. Association analyses of a QT under designs 1 and 2 are conducted using standard linear regression methods with the QT as the dependent variable. To account for unequal sampling probabilities under design 5, we apply IPW in the linear regression model parameter estimation and use the R function svyglm from the survey package to estimate appropriate variances and construct a test statistic . However, because we select observations according to values of QT, in designs 3, 4, and 5 the validity of standard linear regression methods is questionable. We hypothesize that reversing the direction of the regression is a potentially more robust alternative. As described in other contexts, reversing the regression has the advantage of conditioning out the phenotype. When analyzing a common variant, we model the genotype(s) conditional on the QT value using logistic regression, and when analyzing a rare variant score, we model the count conditional on the QT value using Poisson regression.
Application to GAW17 data
We analyze quantitative traits Q1 and Q4 in replicates 1 to 200 of the GAW17 simulated unrelated mini-exome data . With knowledge of the generating model for Q1, we chose the SNPs in the FLT1 gene to compare methods under a true alternative hypothesis of genetic association and use analysis of Q4 for comparisons under the null hypothesis. To reduce heterogeneity and potential for population stratification bias, we chose the Asian subpopulations for study: Chinese + Japanese (n = 321).
We excluded SNPs in the FLT1 gene that were monomorphic within the Asian subpopulation. Of the 20 remaining SNPs, 18 were classified as rare (all minor allele frequencies [MAFs] were less than 3.0%); but according to the generating model, only 6 of these had functional variants. We calculated a rare variant score (the count of rare alleles) from the 18 rare SNPs. The total count of rare alleles per person ranged from 0 to 3 (with corresponding frequencies of 252, 54, 11, and 4). One SNP with a common functional variant, C13S523, was analyzed separately (MAF = 8.72% with no rare homozygotes).
QT-dependent sampling and QT analysis for Q1 and for Q4 were based on residuals from a linear regression on Age, Sex and Smoking status. We conducted an association analysis under each of the five sampling designs listed in Table 1 in all 200 replicates. Results are reported for replicate 1 alone to illustrate application of the methods in a single data set. We summarized distributions of regression coefficients (means and variances), standard error estimates (means), and test statistics (type I error and power) across replicates.
Analysis of Q1 with FLT1 SNPs in replicate 1
Association of Q1 with FLT1 was detected at near genome-wide significance levels for both the common SNP and the rare variant score (Table 2) using the entire Asian subpopulation. The logistic regression test statistics are nearly always smaller than the linear regression test statistics, as expected, because logistic regression is less efficient for a normally distributed trait. Under linear regression analysis, association signals for all of the 50% designs were attenuated, with the most attenuation for designs 2 and 5 and the least attenuation for extreme-phenotype designs 3 and 4. However, the regression coefficients for designs 3 and 4 appear to be inflated. Under logistic regression analysis, design 5 had the least signal attenuation, and under Poisson regression it had less signal attenuation when extreme phenotypes were selected more frequently, as in design 5.2.
Table 2. Results of regression analysis of Q1 with the FLT1 gene in replicate 1
Evaluation of test statistics in replicates 1–200
We evaluated type I error at the nominal 5% level by means of association analysis of Q4 with the FLT1 rare variant score and common causal variant. With 200 replicates, the 95% confidence interval is roughly 0.05 ± 0.03. For the rare variant score, the linear regression type I error tended to be less than 5% (Table 3), except for design 5.2, in which the empirical standard deviation was larger than the mean standard error (SE), yielding an elevated type I error. In Poisson regression, type I error was somewhat elevated, with the generalized linear model SE estimates too small on average. For the common variant (data not shown), type I error was within 5% for both linear and logistic regressions, except for design 5.2. In this case, in which the sampling probabilities had a wide range, the survey variance estimation method applied in the linear regression tended to underestimate the empirical variance, as also observed in the rare variant analysis. For logistic regression, however, design 5.2 was equivalent to designs 3 and 4.
Table 3. Simulation results for analysis of Q4 with the FLT1 rare variant score, replicates 1–200
To graphically convey power differences, we constructed pairwise comparison plots of design 1 test statistics for Q1 and the alternative designs in the 200 replicates. We also calculated mean values of the ratio of the design 1 test statistic over each of the test statistics for the alternative designs that we referred to as the mean ratio of test statistics (MRT). For the rare variant score (Figure 1), loss of power under design 2 is readily apparent in both linear and Poisson regressions. The scatter observed in designs 2, 5.1, and 5.2 can be explained partly by the additional variation associated with the random sampling component. The test statistics for extreme-phenotype designs 3 and 4 were strongly correlated with those for design 1, with high MRT values. Under Poisson regression, MRT values for designs 5.1, and 5.2 were similarly high. For the common variant (data not shown), discrepancies with design 1 test statistics were generally similar to those for the rare variant analysis, with equivalent performance for all designs under logistic regression but substantial loss of power.
Figure 1. Linear (upper panels) and Poisson (lower panels) regression test statistics in Q1 rare variant analysis. MRT, mean ratio of test statistics.
Evaluation of estimation efficiency using replicates 1–200
Under the null analysis of the rare variant score for Q4, all estimates were close to being unbiased (Table 3). For Q1, however, when extreme-phenotype values were oversampled and this was ignored (under designs 3 and 4), the mean coefficients in the linear regression were 50% larger than those under design 1 (Figure 2). In contrast, the mean coefficients for Poisson regression were similar under designs 1, 3 and 4. Notably, when we incorporated the sampling probability of each individual according to the value of their QT, as in sampling design 5, the mean coefficient was attenuated in the linear regression but close in value to that for the entire sample in Poisson regression. Similar findings were obtained for common variant linear and logistic regressions (data not shown).
Figure 2. Linear (upper panels) and Poisson (lower panels) regression coefficients in Q1 rare variant analysis
To compare designs in terms of precision of estimation, we evaluated relative efficiency (RE), defined as 100 times the ratio of the empirical variance of design 1 to that of the alternative design (Table 4). This measure does not depend on SE estimation but may be affected by differences in the expected value of the regression coefficient, so we considered the null case of Q4, in which estimates were approximately unbiased, and the alternative case of Q1. In comparison to analysis of the entire cohort, we observed only 40–50% RE under SRS design 2. Under extreme-phenotype designs 3 and 4, RE was low for linear regression but high for logistic and Poisson regression. In contrast, linear regression REs for Q1 and Q4 were highest for sampling design 5.1, with an RE of 88% for Q4. However, when extreme observations were selected with high probability, as in design 5.2, RE was high for logistic and Poisson regression and comparable to that of designs 3 and 4.
Table 4. Relative efficiencies of regression coefficients for analysis of Q1 and Q4 with FLT1, replicates 1–200
As expected, use of a 50% SRS design is not a cost-effective approach for reducing sequencing costs: The empirical relative efficiency of estimation was less than 50% in the GAW17 simulations we examined. However, designs 3–5, in which RE generally exceeded 50% when conditioning on the QT and could be substantially higher, appeared to offer a net benefit. The more extreme IPW sampling design 5.2, which is similar to designs 3 and 4, behaves similarly, at least for the regression analyses that condition on the phenotype. However, the simple IPW linear regression analysis that we applied here may be less than optimal in finite samples under the alternative hypothesis. Although RE results for design 5 are encouraging, for hypothesis testing the source of parameter underestimation under the alternative hypothesis needs to be addressed and SE estimation improved. Evaluations reported here are limited, and it is quite possible that sampling and/or weighting schemes could be improved.
In our extreme-phenotype implementation of designs 3 and 4, we used the observed QT values, in contrast to more common case-control approaches. Because the linear regression assumption of normal residuals is grossly violated, the observed exaggeration of regression coefficients is not surprising. The lack of elevated type I error may be ascribed to the use of symmetric tail selection, but for rare variant analysis there is a clear loss of efficiency. In the GAW17 simulation, in which rare and common variant effects were generated under an additive model, differences between designs 3 and 4 were trivial. In other settings, design 4 may be more robust.
Although it has not been commonly applied, Poisson regression of allele counts in the rare variant score appears to be useful as a general approach for single-gene analysis in many designs, provided that the tendency toward elevated type I error is resolved. For design 5, Poisson regression avoids the complications of weighting, because the model conditions on the QT value.
There is established literature on optimal design for continuous outcomes in experimental settings and survey samples, and sampling has been well exploited in the context of time-to-event outcomes in epidemiological designs, such as the case-cohort study . In comparison to using the entire cohort, extreme sampling incurs some loss of efficiency and power to detect association, but it is far more efficient than simple random sampling of the same number of individuals. Although our comparisons in the GAW17 data do not conform to a wholly realistic setting in terms of sample size and volume of sequencing data, they do suggest that QT-dependent sampling can be quite effective in reducing sequencing costs with developing prospects for whole-genome sequencing.
The authors declare that there are no competing interests.
YEY and SBB conceived of the study. YEY carried out the analyses and wrote the first draft of the manuscript. Both authors revised and approved the final manuscript.
This work is supported by the Canadian Institutes of Health Research and the MITACS Network of Centres of Excellence in the mathematical sciences. The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.