Abstract
Background
Genetic association study is currently the primary vehicle for identification and characterization of diseasepredisposing variant(s) which usually involves multiple singlenucleotide polymorphisms (SNPs) available. However, SNPwise association tests raise concerns over multiple testing. Haplotypebased methods have the advantage of being able to account for correlations between neighbouring SNPs, yet assuming HardyWeinberg equilibrium (HWE) and potentially large number degrees of freedom can harm its statistical power and robustness. Approaches based on principal component analysis (PCA) are preferable in this regard but their performance varies with methods of extracting principal components (PCs).
Results
PCAbased bootstrap confidence interval test (PCABCIT), which directly uses the PC scores to assess genedisease association, was developed and evaluated for three ways of extracting PCs, i.e., cases only(CAES), controls only(COES) and cases and controls combined(CES). Extraction of PCs with COES is preferred to that with CAES and CES. Performance of the test was examined via simulations as well as analyses on data of rheumatoid arthritis and heroin addiction, which maintains nominal level under null hypothesis and showed comparable performance with permutation test.
Conclusions
PCABCIT is a valid and powerful method for assessing genedisease association involving multiple SNPs.
Background
Genetic association studies now customarily involve multiple SNPs in candidate genes or genomic regions and have a significant role in identifying and characterizing diseasepredisposing variant(s). A critical challenge in their statistical analysis is how to make optimal use of all available information. Populationbased casecontrol studies have been very popular[1] and typically involve contingency table tests of SNPdisease association[2]. Notably, the genotypewise Armitage trend test does not require HWE and has equivalent power to its allelewise counterpart under HWE[3,4]. A thorny issue with individual tests of SNPs for linkage disequilibrium (LD) in such setting is multiple testing, however, methods for multiple testing adjustment assuming independence such as Bonferroni's[5,6] is knowingly conservative[7]. It is therefore necessary to seek alternative approaches which can utilize multiple SNPs simultaneously. The genotypewise Armitage trend test is appealing since it is equivalent to the score test from logistic regression[8] of casecontrol status on dosage of diseasepredisposing alleles of SNP. However, testing for the effects of multiple SNPs simultaneously via logistic regression is no cure for difficulty with multicollinearity and curse of dimensionality[9]. Haplotypebased methods have many desirable properties[10] and could possibly alleviate the problem[1114], but assumption of HWE is usually required and a potentially large number of degrees of freedom are involved[7,11,1518].
It has recently been proposed that PCA can be combined with logistic regression test (LRT)[7,16,17] in a unified framework so that PCA is conducted first to account for betweenSNP correlations in a candidate region, then LRT is applied as a formal test for the association between PC scores (linear combinations of the original SNPs) and disease. Since PCs are orthogonal, it avoids multicollinearity and at the meantime is less computerintensive than haplotypebased methods. Studies have shown that PCALRT is at least as powerful as genotype and haplotypebased methods[7,16,17]. Nevertheless, the power of PCAbased approaches vary with ways by which PCs are extracted, e.g., from genotype correlation, LD, or other kinds of metrics[17], and in principle can be employed in frameworks other than logistic regression[7,16,17]. Here we investigate ways of extracting PCs using genotype correlation matrix from different types of samples in a casecontrol study, while presenting a new approach testing for genedisease association by direct use of PC scores in a PCAbased bootstrap confidence interval test (PCABCIT). We evaluated its performance via simulations and compared it with PCALRT and permutation test using real data.
Methods
PCA
Assume that p SNPs in a candidate region of interest have coded values (X_{1}, X_{2}, ⋯, X_{p}) according to a given genetic model (e.g., additive model) whose correlation matrix is C. PCA solves the following equation,
where = 1, i = 1,2, ⋯, p, l_{i }= (l_{i1}, l_{i2}, ⋯, l_{ip})' are loadings of PCs. The score for an individual subject is
where cov (F_{i}, F_{j}) = 0, i ≠ j, and var(F_{1}) ≥ var(F_{2}) ≥ ⋯ ≥ var(F_{p}).
Methods of extracting PCs
Potentially, PCA can be conducted via four distinct extracting strategies (ES) using casecontrol data, i.e., 0. Calculate PC scores of individuals in cases and controls separately (SES), 1. Use cases only (CAES) to obtain loadings for calculation of PC scores for subjects in both cases and controls, 2. Use controls only (COES) to obtain the loadings for both groups, and 3. Use combined cases and controls (CES) to obtain the loadings for both groups. It is likely that in a casecontrol association study, loadings calculated from cases and controls can have different connotations and hence we only consider scenarios 13 hereafter. More formally, let (X_{1}, X_{2}, ⋯, X_{p}) and (Y_{1}, Y_{2}, ⋯, Y_{p}) be pdimension vectors of SNPs at a given candidate region for cases and controls respectively, then we have,
Strategy 1 (CAES):
where C_{XX }is the correlation matrix of (X_{1}, X_{2}, ⋯, X_{p}), and = 1, i = 1,2, ⋯, p. The i^{th }PC for cases is calculated by
and for controls
Strategy 2 (COES):
where C_{YY }is the correlation matrix of (Y_{1}, Y_{2}, ⋯, Y_{p}). The i^{th }PC for controls is calculated by
And for cases, the i^{th }PC, i = 1,2, ⋯, p, is calculated by
Strategy 3 (CES):
where C is the correlation matrix obtained from the pooled data of cases and controls, and . The i^{th }PC of cases is calculated by
The i^{th }PC of controls is calculated by
PCABCIT
Given a sample of N cases and M controls with pSNP genotypes (X_{1}, X_{2}, ⋯, X_{N})^{T}, (Y_{1}, Y_{2}, ⋯, Y_{M})^{T}, and X_{i }= (X_{1i}, X_{2i}, ⋯, x_{pi}) for the i^{th }case, Y_{i }= (Y_{1i}, Y_{2i}, ⋯, y_{pi}) for the i^{th }control, a PCABCIT is furnished in three steps:
Step 1: Sampling
Replicate samples of cases and controls are obtained with replacement separately from (X_{1}^{(b}, X_{2}^{(b)}, ⋯, X_{N}^{(b)})^{T }and (Y_{1}^{(b}, Y_{2}^{(b)}, ⋯, Y_{M}^{(b)})^{T}, b = 1,2, ⋯, B (B = 1000).
Step 2: PCA
For each replicate sample obtained at Step 1, PCA is conducted and a given number of PCs retained with a threshold of 80% explained variance for all three strategies[16], expressed as and .
Step 3: PCABCIT
3a) For each replicate, the mean of the k^{th }PC in cases is calculated by
and that of the k^{th }PC in controls is calculated by
3b) Given confidence level (1  α ), the confidence interval of is estimated by percentile method, with form
where is the percentile of , and is the percentile.
The confidence interval of is estimated by
where is the percentile of , and is the percentile.
3c) Confidence intervals of cases and controls are compared. The null hypothesis is rejected if and do not overlap, which is and are statistically different[19], indicating the candidate region is significantly associated with disease at level α. Otherwise, the candidate region is not significantly associated with disease at level α.
Simulation studies
We examine the performance of PCABCIT through simulations with data from the North American Rheumatoid Arthritis (RA) Consortium (NARAC) (868 cases and 1194 controls)[20], taking advantage of the fact that association between protein tyrosine phosphatase nonreceptor type 22 (PTPN22) and the development of RA has been established[2124]. Nine SNPs have been selected from the PNPT22 region (114157960114215857), and most of the SNPs are within the same LD block (Figure 1). Females are more predisposed (73.85%) and are used in our simulation to ensure homogeneity. The corresponding steps for the simulation are as follows.
Figure 1. LD (r^{2}) among nine PTPN22 SNPs. The nine PTPN22 SNPs are rs971173, rs1217390, rs878129, rs11811771, rs11102703, rs7545038, rs1503832, rs12127377, rs11485101. The triangle marks a single LD block within this region: (rs878129, rs11811771, rs11102703, rs7545038, rs1503832, rs12127377, rs11485101).
Step 1: Sampling
The observed genotype frequencies in the study sample are taken to be their true frequencies in populations of infinite sizes. Replicate samples of cases and controls of given size (N, N = 100, 200, ⋯, 1000) are generated whose estimated genotype frequencies are expected to be close to the true population frequencies while both the allele frequencies and LD structure are maintained. Under null hypothesis, replicate cases and controls are sampled with replacement from the controls. Under alternative hypothesis, replicate cases and controls are sampled with replacement from the cases and controls respectively.
Step 2: PCABCITing
For each replicate sample, PCABCITs are conducted through the three strategies of extracting PCs as outlined above on association between PC scores and disease (RA).
Step 3: Evaluating performance of PCABCITs
Repeat steps 1 and 2 for K ( K = 1000 ) times under both null and alternative hypotheses, and obtain the frequencies (P_{α}) of rejecting null hypothesis at level α (α = 0.05).
Applications
PCABCITs are applied to both the NARAC data on PTPN22 in 1493 females (641 cases and 852 controls) described above and a data containing nine SNPs near μopioid receptor gene (OPRM1) in Han Chinese from Shanghai (91 cases and 245 controls) with endophenotype of heroininduced positive responses on first use[25]. There are two LD blocks in the region of gene OPRM1 (Figure 2).
Figure 2. LD (r^{2}) among nine OPRM1 SNPs. The nine OPRM1 SNPs are rs1799971, rs510769, rs696522, rs1381376, rs3778151, rs2075572, rs533586, rs550014, rs658156. The triangles mark the LD block 1 (rs696522, rs1381376, rs3778151) and LD block 2 (rs550014, rs658156).
Results
Simulation study
The performance of PCABCIT is shown in Table 1 for the three strategies given a range of sample sizes. It can be seen that strategies 2 and 3 both have type I error rates approaching the nominal level (α = 0.05), but those from strategy 1 deviate heavily. When sample size larger than 800, the power of PCABCIT is above 0.8, and strategies 2 and 3 outperform strategy 1 slightly.
Table 1. Performance of PCABCIT at level 0.05 with strategies 13†
Applications
For the NARAC data, Armitage trend test reveals none of the SNPs in significant association with RA using Bonferroni correction (Table 2), but the results of PCABCIT with strategies 2 and 3 show that the first PC extracted in region of PTPN22 is significantly associated with RA. The results are similar to that from permutation test (Table 3).
Table 2. Armitage trend test on nine PTPN22 SNPs and RA susceptibility
Table 3. PCABCIT, PCALRT and permutation test on real data
For the OPRM1 data, the sample characteristics are comparable between cases and controls (Table 4), and three SNPs (rs696522, rs1381376 and rs3778151) are showed significant association with the endophenotype (Table 5). The results of PCABCIT with strategies 2 and 3 and permutation test are all significant at level α = 0.01. In contrast, result from PCALRT is not significant at level α = 0.05 with strategy 2 (Table 3). The apparent separation of cases and controls are shown in Figure 3 for PCABCIT with strategy 3, suggesting an intuitive interpretation.
Table 4. Sample characteristics of heroininduced positive responses on first use
Table 5. Armitage trend tests on nine OPRM1 SNPs and heroininduced positive responses on first use
Figure 3. Real data analyses by PCABCIT with strategy 3 and confidence level 0.95. The horizontal axis denotes studies and vertical axis mean(PC1), the statistic used to calculate confidence intervals for cases and controls. PCABCITs with strategy 3 were significant at confidence level 0.95.
Discussion
In this study, a PCAbased bootstrap confidence interval test[19,2628] (PCABCIT) is developed to study genedisease association using all SNPs genotyped in a given region. There are several attractive features of PCAbased approaches. First of all, they are at least as powerful as genotype and haplotypebased methods[7,16,17]. Secondly, they are able to capture LD information between correlated SNPs and easy to compute with needless consideration of multicollinearity and multiple testing. Thirdly, BCIT integrates point estimation and hypothesis testing as a single inferential statement of great intuitive appeal[29] and does not rely on the distributional assumption of the statistic used to calculate confidence interval[19,2629].
While there have been several different but closely related forms of bootstrap confidence interval calculations[28], we focus on percentiles of the asymptotic distribution of PCs for given confidence levels to estimate the confidence interval. PCABCIT is a datalearning method[29], and shown to be valid and powerful for sufficiently large number of replicates in our study. Our investigation involving three strategies of extracting PCs reveals that strategy 1 is invalid, while strategies 2 and 3 are acceptable. From analyses of real data we find that PCABCIT is more favourable compared with PCALRT and permutation test. It is suggested that a practical advantage of PCABCIT is that it offers an intuitive measure of difference between cases and controls by using the set of SNPs (PC scores) in a candidate region (Figure 3). As extraction of PCs through COES is more in line with the principle of a casecontrol study, it will be our method of choice given that it has a comparable performance with CES. Nevertheless, PCABCIT has the limitation that it does not directly handle covariates as is usually done in a regression model.
Conclusions
PCABCIT is both a valid and a powerful PCAbased method which captures multiSNP information in study of genedisease association. While extracting PCs based on CAES, COES and CES all have good performances, it appears that COES is more appropriate to use.
Abbreviations
SNP: single nucleotide polymorphism; HWE: HardyWeinberg Equilibrium; LD: linkage disequilibrium; LRT: logistic regression test; PCA: principle component analysis; PC: principle component; ES: extracting strategy; SES: separate case and control extracting strategy (strategy 0); CAES: casebased extracting strategy (strategy 1); COES: controlbased extracting strategy (strategy 2); CES: combined case and control extracting strategy (strategy 3); BCIT: bootstrap confidence interval test.
Authors' contributions
QQP, JHZ, and FZX conceptualized the study, acquired and analyzed the data and prepared for the manuscript. All authors approved the final manuscript.
Acknowledgements
This work was supported by grant from the National Natural Science Foundation of China (30871392). We wish to thank Dr. Dandan Zhang (Fudan University) and NARAC for supplying us with the data, and comments from the Associate Editor and anonymous referees which greatly improved the manuscript. Special thanks to referee for the insightful comment that extraction of PCs with controls is line with the casecontrol principles.
References

Morton NE, Collins A: Tests and estimates of allelic association in comples.
Proc Natl Acad Sci USA 1998, 95:1138911393. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sasieni PD: From genotypes to genes: doubling the sample size.
Biometrics 1997, 53:12531261. PubMed Abstract  Publisher Full Text

Gordon D, Haynes C, Yang Y, Kramer PL, Finch SJ: Linear trend tests for casecontrol genetic association that incorporate random phenotype and genotype misclassification error.
Genet Epidemiol 2007, 31:853870. PubMed Abstract  Publisher Full Text

Slager SL, Schaid DJ: Casecontrol studies of genetic markers: Power and sample size approximations for Armitage's test for trend.
Human Heredity 2001, 52:149153. PubMed Abstract  Publisher Full Text

Sidak Z: On Multivariate Normal Probabilities of Rectangles: Their Dependence on Correlations.

Sidak Z: On Probabilities of Rectangles in Multivariate Student Distributions: Their Dependence on Correlations.
The Annals of Mathematical Statistics 1971, 42:169175. Publisher Full Text

Zhang FY, Wagener D: An approach to incorporate linkage disequilibrium structure into genomic association analysis.
Journal of Genetics and Genomics 2008, 35:381385. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Balding DJ: A tutorial on statistical methods for population association studies.
Nature Reviews Genetics 2006, 7:781791. PubMed Abstract  Publisher Full Text

Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN: Nonparametric tests of association of multiple genes with human disease.
American Journal of Human Genetics 2005, 76:780793. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Becker T, Schumacher J, Cichon S, Baur MP, Knapp M: Haplotype interaction analysis of unlinked regions.
Genetic Epidemiology 2005, 29:313322. PubMed Abstract  Publisher Full Text

Chapman JM, Cooper JD, Todd JA, Clayton DG: Detecting disease associations due to linkage disequilibrium using haplotype tags: A class of tests and the determinants of statistical power.
Human Heredity 2003, 56:1831. PubMed Abstract  Publisher Full Text

Epstein MP, Satten GA: Inference on haplotype effects in casecontrol studies using unphased genotype data.
American Journal of Human Genetics 2003, 73:13161329. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fallin D, Cohen A, Essioux L, Chumakov I, Blumenfeld M, Cohen D, Schork NJ: Genetic analysis of case/control data using estimated haplotype frequencies: Application to APOE locus variation and Alzheimer's disease.
Genome Research 2001, 11:143151. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stram DO, Pearce CL, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and EM estimation of haplotypespecific relative risks from genotype data for a casecontrol study of unrelated individuals.
Human Heredity 2003, 55:179190. PubMed Abstract  Publisher Full Text

Clayton D, Chapman J, Cooper J: Use of unphased multilocus genotype data in indirect association studies.
Genetic Epidemiology 2004, 27:415428. PubMed Abstract  Publisher Full Text

Gauderman WJ, Murcray C, Gilliland F, Conti DV: Testing association between disease and multiple SNPs in a candidate gene.
Genetic Epidemiology 2007, 31:383395. PubMed Abstract  Publisher Full Text

Oh S, Park T: Association tests based on the principalcomponent analysis.
BMC Proc 2007, 1(Suppl 1):S130. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wang T, Elston RC: Improved power by use of a weighted score test for linkage disequilibrium mapping.
American Journal of Human Genetics 2007, 80:353360. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Heller G, Venkatraman ES: Resampling procedures to compare two survival distributions in the presence of rightcensored data.
Biometrics 1996, 52:12041213. Publisher Full Text

Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LRL, et al.: TRAF1C5 as a risk locus for rheumatoid arthritis  A genomewide study.
New England Journal of Medicine 2007, 357:11991209. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, et al.: A missense singlenucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis.
Am J Hum Genet 2004, 75:330337. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Carlton VEH, Hu XL, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, et al.: PTPN22 genetic variation: Evidence for multiple variants associated with rheumatoid arthritis.
American Journal of Human Genetics 2005, 77:567581. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kallberg H, Padyukov L, Plenge RM, Ronnelid J, Gregersen PK, Helmvan Mil AHM, Toes REM, Huizinga TW, Klareskog L, Alfredsson L, et al.: Genegene and geneenvironment interactions involving HLADRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis.
American Journal of Human Genetics 2007, 80:867875. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, et al.: Replication of putative candidategene associations with rheumatoid arthritis in > 4,000 samples from North America and Sweden: Association of susceptibility with PTPN22, CTLA4, and PADI4.
American Journal of Human Genetics 2005, 77:10441060. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhang D, Shao C, Shao M, Yan P, Wang Y, Liu Y, Liu W, Lin T, Xie Y, Zhao Y, et al.: Effect of muopioid receptor gene polymorphisms on heroininduced subjective responses in a Chinese population.
Biol Psychiatry 2007, 61:12441251. PubMed Abstract  Publisher Full Text

Carpenter J: Test Inversion Bootstrap Confidence Intervals.
Journal of the Royal Statistical Society Series B (Statistical Methodology) 1999, 61:159172. Publisher Full Text

Davison AC, Hinkley DV, Young GA: Recent developments in bootstrap methodology.
Statistical Science 2003, 18:141157. Publisher Full Text

DiCiccio TJ, Efron B: Bootstrap confidence intervals.
Statistical Science 1996, 11:189212. Publisher Full Text

Efron B: Bootstrap Methods: Another Look at the Jackknife.
The Annals of Statistics 1979, 7:126. Publisher Full Text