Email updates

Keep up to date with the latest news and content from BMC Proceedings and BioMed Central.

This article is part of the supplement: Genetic Analysis Workshop 16

Open Access Proceedings

Power of selective genotyping in genome-wide association studies of quantitative traits

Chao Xing12* and Guan Xing3

Author Affiliations

1 Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA

2 McDermott Center of Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA

3 Bristol-Myers Squibb Company, 311 Pennington-Rocky Hill Road, Pennington, New Jersey 08534, USA

For all author emails, please log on.

BMC Proceedings 2009, 3(Suppl 7):S23  doi:

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1753-6561/3/S7/S23


Published:15 December 2009

© 2009 Xing and Xing; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The selective genotyping approach in quantitative genetics means genotyping only individuals with extreme phenotypes. This approach is considered an efficient way to perform gene mapping, and can be applied in both linkage and association studies. Selective genotyping in association mapping of quantitative trait loci was proposed to increase the power of detecting rare alleles of large effect. However, using this approach, only common variants have been detected. Studies on selective genotyping have been limited to single-locus scenarios. In this study we aim to investigate the power of selective genotyping in a genome-wide association study scenario, and we specifically study the impact of minor allele frequency of variants on the power of this approach. We use the Genetic Analysis Workshop 16 rheumatoid arthritis whole-genome data from the North American Rheumatoid Arthritis Consortium. Two quantitative traits, anti-cyclic citrullinated peptide and rheumatoid factor immunoglobulin M, and one binary trait, rheumatoid arthritis affection status, are used in the analysis. The power of selective genotyping is explored as a function of three parameters: sampling proportion, minor allele frequency of single-nucleotide polymorphism, and test level. The results show that the selective genotyping approach is more efficient in detecting common variants than detecting rare variants, and it is efficient only when the level of declaring significance is not stringent. In summary, the selective genotyping approach is most suitable for detecting common variants in candidate gene-based studies.

Background

The extreme sampling strategy in quantitative genetics is termed selective genotyping [1], involving phenotyping a large population of individuals, but genotyping only those individuals whose phenotypes deviate substantially from the mean. This approach was first introduced in a linkage context and was readily adopted in association analysis by treating the upper and lower subpopulations as case and control groups, respectively.

This selective genotyping approach was shown to be efficient in association mapping of quantitative trait loci (QTLs) in a statistical framework [2,3]. Since then, there have been several theoretical studies proposing alternative test statistics [4,5] or alternative sampling procedures [6], and extending this approach to haplotype analysis [7]. It is important to note that the selective genotyping approach was initially proposed to increase the power of detecting rare alleles with large effect at QTLs, but all the theoretical work cited assumed an unrealistic assumption of a large sample size even in the subpopulation of each extreme. The selective genotyping design has been applied to identify association between quantitative traits and genetic markers. For example, it was applied in studying genetic association of intelligence scores [8], attention-deficit hyperactivity disorder [9], and body mass index [10]. It is worth noting that in all of the studies mentioned previously that had positive results, the alleles of interest were common (minor allele frequency, or MAF > 0.20).

Both theoretical and applied studies to date have focused on limited number of loci in a candidate gene study scenario, and therefore their test levels were set high. They made no comparison between common and rare variants in a realistic situation in terms of sample size. In this study we investigated the empirical power of selective genotyping using the Genetic Analysis Workshop 16 (GAW16) rheumatoid arthritis (RA) genome-wide association study (GWAS) data from the North American Rheumatoid Arthritis Consortium (NARAC). The genome-wide data provided us an opportunity to investigate the power at different test levels and a full spectrum of allele frequency.

Methods

The NARAC data consists of 868 cases and 1,194 controls. The samples were genotyped using the Illumina Infinium HumanHap550 chip, and in the current study we focused on 531,689 single-nucleotide polymorphisms (SNPs) across the autosomal genome. The data were filtered through genotype call rate (>90% completeness) and the Hardy-Weinberg equilibrium test (p-value > 0.001), and 522,587 SNPs were retained. Note that we did not filter the data through a MAF criterion because the MAF was a primary variable we were investigating. To control for population stratification, we calculated the first two principal components, denoted PC1 and PC2, of 1,344 ancestry-informative SNPs [11] and adjusted for them together with sex in the subsequent association analysis. Because all individuals were genotyped, we chose a proportion of the whole sample to mimic the selective genotyping procedure. Given a quantitative trait, the upper Nth, N ∈ {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, percentiles of cases, and, correspondingly, a random N percent of controls were selected as a study dataset, i.e., there were 10 datasets generated in total. All subjects were included when N = 100. A total of three RA-related traits, including a binary variable of RA affection status and two quantitative phenotypes (anti-cyclic citrullinated peptide, or anti-CCP and immunoglobulin M, or IgM) measured only in cases were used in the analyses.

The statistical testing procedure included three steps. First, we tested association between affection status and genotype using both cases and controls by a logistic regression model, which yielded a statistic T1. Second, we tested association between a quantitative trait and genotype in the selected cases by a linear regression model, which yielded a statistic T2. Because the distribution of a quantitative trait in the selected cases might not be normally distributed, the Box-Cox power transformation [12] was always performed in the linear regression. Covariates PC1, PC2, and sex were adjusted in both tests. Third, because T1 and T2 were independent under the null hypothesis of no association between the traits and genotype, we employed the Fisher's method of combining p-values [13] to form a statistic , where P(Ti) denotes the p-value corresponding to the test statistic Ti. We concentrated on results based on T3 in the subsequent analysis because it was shown to be more powerful than the other two [2].

To investigate the empirical power of selective genotyping in GWAS, we only focused on SNPs that attained a p-value < 0.05 under control of the false-discovery rate (FDR) [14] when using the whole sample. Setting the results using the whole sample as the gold standard, we defined relative power as, at a certain test level α, the total number of significant SNPs using a subset of data (N < 100) divided by the total number of significant SNPs using the whole data (N = 100). We studied the relative power as a function of three sets of parameters: the sampling proportion N%, N ∈ {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, the test level α = 5.0 × 100, O ∈ {-2, -3, -4, -5, -6, -7, -8}, and the SNP MAF stratified by p, p ∈ {0.05, 0.10, 0.20, 0.30, 0.40}. All analyses were performed using the software PLINK [15] and R [16].

Results

We first performed a genome scan using the whole sample, and were able to detect loci confirmed to be associated with RA such as the major histocompatibility complex on chromosome 6, PTPN22 on chromosome 1, and TRAF1-C5 on chromosome 9 (data not shown), which proved the validity of the current method. To attain a p-value < 0.05 under FDR control, the nominal p-values for anti-CCP and IgM were 2.35 × 10-6 and 2.34 × 10-6, respectively. There were 336 SNPs for anti-CCP and 337 SNPs for IgM that met these criteria. Their distributions across chromosomes and allele frequency categories are summarized in Tables 1 and 2. There were no SNPs with a MAF < 0.01 that met the criteria, and therefore the lowest threshold of MAF to stratify SNPs was 0.05.

Table 1. Genomic distribution of SNPs attaining a p-value < 0.05 under control of the false discovery rate using the whole sample

Table 2. Allele frequency distribution of SNPs attaining a p-value < 0.05 under control of the false discovery rate using the whole sample

The power of selective genotyping strategy depended on the test levels. This approach was efficient only when the level of declaring significance was not stringent. To achieve a relative power greater than 0.8, less than 30% of the sample needed to be genotyped when the test level was equal to or greater than 5 × 10-3; however, more than 50% of the sample needed to be genotyped at the test level of 5 × 10-5; when the test level was equal to or less than 5 × 10-6, the power decreased linearly with the sample size (Figure 1). Stratifying each line in Figure 1 by the MAF, we found the efficiency of selective genotyping increased as the MAF increased regardless of test levels (Figure 2). In other words, this approach was most efficient in detecting common variants. We observed similar results for both traits, and only presented those for anti-CCP.

thumbnailFigure 1. Relative power of selective genotyping at different test levels for trait anti-CCP.

thumbnailFigure 2. Relative power of selective genotyping at different test levels stratified by the minor allele frequency for trait anti-CCP.

Discussion

Selecting individuals with extreme phenotypes increases the narrow sense of heritability of the trait in the sample at the cost of decreasing sample size. Therefore, there is a trade-off between these two factors to achieve a certain level of statistical power. This approach is more efficient in detecting common variants than detecting rare variants because the effective sample size (the number of variants in a sample) decreases more rapidly for rare variants as we decrease the sampling proportion.

Selective genotyping is commonly practiced by genotyping individuals with extreme phenotypes at both upper and lower extremes; however, in the current study we only selected those at the upper extremes in cases and then made random selection in controls. This study design was based on the trait characteristics, not constraints of the data available. Both anti-CCP and IgM are usually present only in cases, i.e., both quantitative traits are zero in unaffected persons. Therefore, the study population approximates a selected sample plus a random sample from the general population, similar to the design proposed by Slatkin [2].

Conclusion

The selective genotyping approach is more efficient in detecting common variants than detecting rare variants even though it was initially proposed to detect the latter [2]. This approach is efficient only when the level of declaring significance is not stringent; therefore, it should not be employed in a GWAS, though it can be used in a replication study to confirm findings of the GWAS. In summary, the selective genotyping approach is most suitable for detecting common variants in candidate gene-based studies.

List of abbreviations used

anti-CCP: Anti-cyclic citrullinated peptide (anti-CCP); FDR: False-discovery rate; GAW16: Genetic Analysis Workshop 16; GWAS: Genome-wide association study; IgM: Immunoglobulin M; MAF: Minor allele frequency; NARAC: North American Rheumatoid Arthritis Consortium; QTLs: Quantitative trait loci; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CX conceived of the study, participated in data analysis, and drafted the manuscript. GX participated in study design, carried out data analysis, and helped to draft the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

This project is supported by NIH/NCRR grant UL1 RR024982 Pilot Award to CX. The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences.

This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.

References

  1. Lander ES, Botstein D: Mapping mendelian factors underlying quantitative traits using RFLP linkage maps.

    Genetics 1989, 121:185-199. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Slatkin M: Disequilibrium mapping of a quantitative-trait locus in an expanding population.

    Am J Hum Genet 1999, 64:1764-1772. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Van Gestel S, Houwing-Duistermaat JJ, Adolfsson R, van Duijn CM, Van Broeckhoven C: Power of selective genotyping in genetic association analyses of quantitative traits.

    Behav Genet 2000, 30:141-146. PubMed Abstract | Publisher Full Text OpenURL

  4. Wallace C, Chapman JM, Clayton DG: Improved power offered by a score test for linkage disequilibrium mapping of quantitative-trait loci by selective genotyping.

    Am J Hum Genet 2006, 78:498-504. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Huang BE, Lin DY: Efficient association mapping of quantitative trait loci with selective genotyping.

    Am J Hum Genet 2007, 80:567-576. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Chen Z, Zheng G, Ghosh K, Li Z: Linkage disequilibrium mapping of quantitative-trait loci by selective genotyping.

    Am J Hum Genet 2005, 77:661-669. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. Xiong M, Fan R, Jin L: Linkage disequilibrium mapping of quantitative trait loci under truncation selection.

    Hum Hered 2002, 53:158-172. PubMed Abstract | Publisher Full Text OpenURL

  8. Plomin R, McClearn GE, Smith DL, Vignetti S, Chorney MJ, Chorney K, Venditti CP, Kasarda S, Thompson LA, Detterman DK, Daniels J, Owen M, McGuffin P: DNA markers associated with high versus low IQ: the IQ quantitative trait loci (QTL) Project.

    Behav Genet 1994, 24:107-118. PubMed Abstract | Publisher Full Text OpenURL

  9. Cornish KM, Manly T, Savage R, Swanson J, Morisano D, Butler N, Grant C, Cross G, Bentley L, Hollis CP: Association of the dopamine transporter (DAT1) 10/10-repeat genotype with ADHD symptoms and response inhibition in a general population sample.

    Mol Psychiatry 2005, 10:686-698. PubMed Abstract | Publisher Full Text OpenURL

  10. Herbert A, Gerry NP, McQueen MB, Heid IM, Pfeufer A, Illig T, Wichmann HE, Meitinger T, Hunter D, Hu FB, Colditz G, Hinney A, Hebebrand J, Koberwitz K, Zhu X, Cooper R, Ardlie K, Lyon H, Hirschhorn JN, Laird NM, Lenburg ME, Lange C, Christman MF: A common genetic variant is associated with adult and childhood obesity.

    Science 2006, 312:279-283. PubMed Abstract | Publisher Full Text OpenURL

  11. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK, Seldin MF: Analysis and application of European genetic substructure using 300 k SNP information.

    PLoS Genet 2008, 4:e4. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Box G, Cox D: An analysis of transformations (with discussions).

    J R Stat Soc Ser B 1964, 26:211-254. OpenURL

  13. Fisher RA: Combining independent tests of significance.

    Am Statist 1948, 2:30. Publisher Full Text OpenURL

  14. Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency.

    Ann Stat 2001, 29:1165-1188. Publisher Full Text OpenURL

  15. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses.

    Am J Hum Genet 2007, 81:559-575. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  16. R Development Core Team: R: A language and environment for statistical computing. [http://www.R-project.org] webcite

    Vienna, R Foundation for Statistical Computing 2008. OpenURL