Abstract
Background
In genetic association study, especially in GWAS, gene or regionbased methods have been more popular to detect the association between multiple SNPs and diseases (or traits). Kernel principal component analysis combined with logistic regression test (KPCALRT) has been successfully used in classifying gene expression data. Nevertheless, the purpose of association study is to detect the correlation between genetic variations and disease rather than to classify the sample, and the genomic data is categorical rather than numerical. Recently, although the kernelbased logistic regression model in association study has been proposed by projecting the nonlinear original SNPs data into a linear feature space, it is still impacted by multicolinearity between the projections, which may lead to loss of power. We, therefore, proposed a KPCALRT model to avoid the multicolinearity.
Results
Simulation results showed that KPCALRT was always more powerful than principal component analysis combined with logistic regression test (PCALRT) at different sample sizes, different significant levels and different relative risks, especially at the genewide level (1E5) and lower relative risks (RR = 1.2, 1.3). Application to the four gene regions of rheumatoid arthritis (RA) data from Genetic Analysis Workshop16 (GAW16) indicated that KPCALRT had better performance than singlelocus test and PCALRT.
Conclusions
KPCALRT is a valid and powerful gene or regionbased method for the analysis of GWAS data set, especially under lower relative risks and lower significant levels.
Background
It is commonly believed that genetic factors play an important role in the etiology of common diseases and traits. With rapid improvements in highthroughout genotyping techniques and the growing number of available markers, genomewide association studies (GWAS) have been promising approaches for identifying common genetic variants. The first successful wave of GWAS has reproducibly identified hundreds of associations of common genetic variants with more than 100 diseases and traits, including agerelated macular degenerative diseases [1], Parkinson's disease [2] and type 2 diabetes [3,4]. Recently GWAS metaanalysis, which combines the evidence for association from individual studies with appropriate weights, is becoming an increasingly important method to identify new loci of complex disease and traits [57]. Although this has improved our understanding of the genetic basis of these complex diseases and traits, and has provided valuable clues to their allelic architecture, there are still many analytic and interpretation challenges in GWAS [811]. For both GWAS and GWAS metaanalysis, it is customary to run singlelocus association tests in the whole genome to identify causal or associated single nucleotide polymorphisms (SNPs) with strong marginal effects on disease or traits. However, such a SNPbySNP analysis leads to computational burden and the wellknown multiplicity problem, with a highly inflated risk of type I error and decreased ability to detect modest effects. One way to deal with these and related challenges is to consider higher units for the analysis such as genes or regions. Several studies have shown that treating gene or region instead of SNP as the unit of association may alleviate the problems of intensive computation and multiple testing [8,10], lead to more stable results and higher interpretability [12,13], be regarded as good standards for subsequent replication studies [14] and suit for network (or pathway) approaches to interpret the finds from GWAS [15].
However, given the SNPs allocated into genes or regions, the issue of how to evaluate genetic association for each candidate gene or genome region remains. To examine whether multiple SNPs in the candidate gene or region are associated with disease or trait, several multimarker analysis methods have been developed, including haplotypebased methods [16,17], Hotelling's T^{2 }test [18,19], principal component analysis (PCA)based methods [2023], and Pvalue combination methods [11,24,25]. Especially, the PCAbased methods have been shown to be as or more powerful than standard joint SNP or haplotypebased tests [23]. PCA can capture linkage disequilibrium information within a candidate gene/region, but is less computationally demanding compared to haplotypebased analysis. It also avoids multicolinearity between SNPs, for the principal components (PCs) are orthogonal.
However, one cannot assert that linear PCA will always detect all structure in a given genomic data set. If the genomic data contains nonlinear structure, PCA will not be able to detect it [26]. Furthermore, it is well known that PCA can not accurately represent nonGaussian distributions. Up to now, many researchers have introduced appropriate nonlinear process into PCA and developed nonlinear PCA algorithms [2731]. Among these modified PCA methods, the kernel PCA (KPCA) is the most well known and widely adopted [2730], which has several advantages than other methods: (1) it does not require nonlinear optimization, but just the solution of an eigenvalue problem; (2) it provides a better understanding of what kind of nonlinear features are extracted: they are principal components in a feature space which is fixed a priori by choosing a kernel function; (3) it comprises a fairly general class of nonlinearities by the possibility to use different kernels.
KPCA has been studied intensively in the last several years in the field of machine learning, face recognition and data classification, and has been claimed success in many applications [2730]. Especially, for classifying tumour samples, Liu et al proposed to combine KPCA with logistic regression test (KPCALRT) by gene expression data [30]. Nevertheless, the purpose of association study is to detect the correlation between genetic variations and disease rather than to classify the sample, and the genomic data is categorical rather than numerical. Recently, Wu et al proposed a kernelbased logistic regression model to detect the association between multiple SNPs and disease by projecting the nonlinear original SNPs data into a linear feature space [32]. However, the logistic model is still impacted by multicolinearity between the projections, which may lead to loss of power. We, therefore, propose a KPCALRT model to avoid the multicolinearity. The algorithm conducts KPCA first to account for the nonlinear relationship between SNPs in a candidate region, and then apply LRT to test the association between kernel principal components (KPCs) scores and diseases. Simulations and real data application are conducted to evaluate its performance in association study.
Methods
PCA
As a traditional multivariable statistical technique, PCA has been widely applied
in genetic analysis, both for reduction of redundant information and interpretation
of multiple SNPs. The basic idea of PCA is to efficiently represent the data by decomposing
a data space into a linear combination of a small collection of bases consisting of
orthogonal axes that maximally decorrelate the data. Assuming that M SNPs in a candidate gene or specific genome region of interests have coded values
{x_{i }∈ R^{M } i = 1,2,...,N}, where N represents sample size giving a genetic model (assuming additive model here). PCA
diagonalizes the covariance matrix of the centered observations x_{i},
To do this, one has to solve the following eigenvalue problem:
where ν are the eigenvectors of C, and λ are the corresponding eigenvalues. As
where the dot product of two vectors a = (a_{1}, a_{2}, ..., a_{N}) and b = (b_{1}, b_{2}, ..., b_{N}) is defined as
KPCA
Given the observations, we first map the data nonlinearly into a feature space F by
Again, we make the assumption that our data mapped into feature space, Φ(x_{1}),...,Φ(x_{N}), is centered, i.e.
we have to find eigenvalues λ ≥ 0 and eigenvectors ν ∈ F\{0} satisfying
By the same argument as above, the solutions ν lie in the span of Φ(x_{1}),...,Φ(x_{N}). This implies that we may consider the equivalent equation
and that there exist coefficients a_{i }(i = 1,...,N) such that
Substituting (3) and (5) into (4), we arrive at
where α denotes the column vector with entries α_{1}, ..., α_{N}, and K is a symmetric N × N matrix defined by
It has a set of eigenvectors which spans the whole space, thus
gives all solutions α of equation (6).
Assume λ_{1 }≤ λ_{2 }≤ ... ≤ λ_{N }represent the eigenvalues for the matrix K with α^{1}, α^{2}, ..., α^{N }being the corresponding complete set of eigenvectors. λ_{p }is the first nonzero eigenvalue. We do the normalization for the solutions α^{p}, ..., α^{N }by requiring that the corresponding vectors in F be normalized, i.e. ν^{k }· ν^{k }= 1 for all k = p, p + 1, ..., N. Based on (5), (6) and (8), this translates into
We need to compute projections on the eigenvectors ν^{k }in F to do principal component extraction. Suppose x is the SNP set within previously defined gene or genome region of an individual, with an image Φ(x) in F, then
are its nonlinear principal components corresponding to Φ.
Note that neither (7) nor (10) requires Φ(x_{i}) in explicit form  they are only needed in dot products. We, therefore, are able to use kernel functions for computing these dot products without actually performing the map Φ: for some choices of a kernel k(x_{i}, x_{j}), by methods of functional analysis, it can be shown that there exists a map Φ into some dot product space F (possibly of infinite dimension) such that k(x_{i}, x_{j}) can compute the dot product in F. This property is often called "kernel trick" in the literature.
Theoretically, a proper function can be created for each data set based on the Mercer's theorem of functional analysis [29]. The most common kernel functions include linear kernel, polynomial kernel, radial basis function (RBF) kernel, sigmoid kernel [30], IBS kernel and weighted IBS kernel [32]. In particular, KPCA with linear kernel is the same as standard linear PCA. It is worth noting that in general, the above kernel functions show similar performance if appropriate parameters are chosen. In present work, we chose the RBF kernel owing to its flexibility in choosing the associated parameter [33].
There are two widely used approaches for the selection of parameters for a certain kernel function. The first method chooses a series of candidate values for the concerned kernel parameter empirically, performs the learning algorithm using each candidate value, and finally assigns the value based on the best performance to the kernel parameter. As is wellknown to us, the second one is the crossvalidation. However, both approaches are timeconsuming and with high computation burden [34]. For RBF kernel applied in present study, there is a popular way of choosing the bandwidth parameter σ, which is to set it to the median of all pairwise Euclidean distances x_{i } x_{j} in the set {x_{k }∈ R^{M } k = 1, 2, ..., N} for all 1 ≤ i < j ≤ N [3537].
Models
To test the associations between multiple SNPs and disease, the PCALRT and KPCALRT models are defined as follows:
where PCs and KPCs are the first L^{th }linear and nonlinear (kernel) principal component scores of the SNPs, respectively. The value of L can be chosen such that the cumulative contributing proportion of the total variability explained by the first L PCs (λ_{1 }+ λ_{2 }+ ···+ λ_{L})/(λ_{1 }+ λ_{2 }+ ··· + λ_{M}) exceeds some threshold. For comparison, we set the same threshold of 80% in both PCALRT and KPCALRT as Gauderman et al [34].
Data simulation
To assess the performance of KPCALRT and compare it with PCALRT, we apply a statistical simulation based on HapMap data under the null hypothesis (H_{0}) and alternative hypothesis (H_{1}). The corresponding steps for the simulation are as follows:
Step 1. Download the phased haplotype data of a genome region from the HapMap web site (http://snp.cshl.org webcite): we select the Protein tyrosine phosphatase, nonreceptor type 22 (PTPN22) gene region to generate the simulating genotype data of CEU population using HapMap Phase 1& 2 full dataset. This region is located at Chr 1: 114168639..114197803, including 11 SNPs. Figure 1 shows their pairwise R^{2 }structure and minor allele frequencies (MAF).
Figure 1. Pairwise R^{2 }among the 11 SNPs in the selected region. The 11 SNPs are: rs7555634, rs2476600, rs1217395, rs2797415, rs1970559, rs1746853, rs2185827, rs1217406, rs1217407, rs3765598, rs1217408. The triangles mark the three haplotype blocks within this region. The value in each diamond is the R^{2 }value and the shading indicates the level of LD between a given pair of SNPs. The values to the right of the 11 dbSNP IDs (rs# IDs) are the corresponding minor allele frequencies.
Step 2. Based on the HapMap phased haplotype data, we generate large samples with 100 000 cases and 100 000 controls as CEU populations using the software HAPGEN [38]. To investigate the performance of the two methods on different causal SNPs with different MAF and different LD patterns, each of the 11 SNPs was defined as the causal variant. We remove the causal SNP in the simulation to assess the indirect association with disease via correlated markers,. Under H_{0}, we set the relative risk per allele as 1.0 to assess the type I error. Under H_{1}, different levels of relative risks are set (1.1, 1.2, 1.3, 1.4 and 1.5 per allele) to assess the power. The SNPs in this region are coded according to the additive genetic model.
Step 3. From the remained SNPs, we sample the simulation data and perform the PCALRT and KPCALRT under different sample sizes N (N/2 cases and N/2 controls, N = 1000, 2000, ..., 12000) using the R packages kernlab (http://cran.rproject.org/web/packages/kernlab/index.html webcite) and Design (http://cran.rproject.org/web/packages/Design/index.html webcite). Under H_{0}, we repeat 10 000 simulations at two significant levels (0.05 and 0.01). Under H_{1}, for each model with a given relative risk, we repeat 10 000 simulations at four significant levels (0.05, 0.01, 1E5 and 1E7).
Application
The proposed method is applied to rheumatoid arthritis (RA) data from GAW16 Problem 1. The data consists of 2062 Illumina 550 k SNP chips from 868 RA patients and 1194 normal controls collected by the North American Rheumatoid Arthritis Consortium (NARAC) [39]. At present study, only 1493 females (641 cases and 852 controls) are analyzed to avoid potential bias with the fact that rheumatoid arthritis is two to three times more common in women than in men [40].
To illustrate the performance of PCALRT and KPCALRT, we mainly focus on four special regions in chromosome 1, within the genes PTPN22, ANKRD35, DUSP23, RNF186 involved, respectively. The reasons are as follows: 1) Both the PTPN22 gene (R620W, rs2476601) and ANKRD35 gene have been reported to be associated with RA [4143]; 2) DUSP23 can activate mitogenactivated protein kinase kinase [43], which may regulate a pathway in rheumatoid arthritis [44,45]; 3) RNF186 involves a ulcerative colitisrisk loci (rs3806308) [44], and RA may be associated with ulcerative colitis [45].
Results
Data simulation
Type I error
Simulation results under H_{0 }are shown in Table 1, which indicates that the type I error rates of both PCALRT and KPCALRT are very close to given nominal values (α = 0.01, α = 0.05) under different sample sizes. This suggests that both the models perform well under null hypothesis.
Table 1. Type I error of PCALRT and KPCALRT
Power
When defining the 6^{th }SNP (rs1746853) as the causal variant, Figure 2 shows the powers of the two models under different significant levels at the given relative risk of 1.3 and sample size of 3000. It is clear that KPCALRT is always much more powerful than PCALRT, especially at the significant level of 1E5 (the suggested genewide level in Neale and Sham [14]). In the following, only the results at the significant level of 1E5 are presented.
Figure 2. The powers of PCALRT and KPCALRT under different significant levels at the given relative risk of 1.3 and sample size of 3000. The horizontal axis denotes the significant levels and the vertical axis denotes the powers of PCALRT and KPCALRT.
With the same causal variant as above, Figure 3 shows the powers of the two models under different sample sizes at the given relative risk of 1.3, while Figure 4 shows the powers under different relative risks at the given sample size of 3000. As expected, the powers are monotonically increasing functions of sample sizes and the relative risk levels for both models. Furthermore, the powers of KPCALRT are much higher than PCALRT when the sample size is not less than 3000 (Figure 3). Both models are less powerful when RR is less than 1.2. At higher relative risks, KPCALRT also shows greater power than PCALRT. Especially at the relative risks of 1.3, the power of PCALRT is close to zero while it is about 0.6 for KPCALRT (Figure 4). Figure 5 shows the powers of both models at the given sample size of 3000 and relative risk of 1.3 when each of the 11 SNPs is set as the causal variant. Interestingly, KPCALRT is always more powerful than PCALRT in each case.
Figure 3. The powers of PCALRT and KPCALRT under different sample sizes at the given relative risk of 1.3. The horizontal axis denotes the sample sizes and the vertical axis denotes the powers of PCALRT and KPCALRT.
Figure 4. The powers of PCALRT and KPCALRT under different relative risks at the given of sample sizes 3000. The horizontal axis denotes the relative risks and the vertical axis denotes the powers of PCALRT and KPCALRT.
Figure 5. The powers of PCALRT and KPCALRT at the given sample size of 3000 and relative risk of 1.3 when each of the 11 SNPs was set as the causal variant. The horizontal axis denotes the positions of the causal variant and the vertical axis denotes the powers of PCALRT and KPCALRT.
These simulation results indicate that the powers of KPCALRT are always higher than PCALRT at given significant levels, sample sizes and relative risks. Particularly, under lower relative risk (1.2 and 1.3) and smaller significant levels (1E5 and 1E7), KPCALRT is more powerful than PCALRT.
Application
Table 2 shows the information of the selected four regions and the performances of PCALRT, KPCALRT and singlelocus test. For region 1, the statistical significances at the given nominal level (1E5) were detected by all the three methods. For region 2, the same significance was found by both singlelocus test and KPCALRT, while PCALRT did not identify this region. Only the KPCALRT detected the significance for region 3, and both PCALRT and KPCALRT identified significance for region 4. These results suggested that KPCALRT performs the best among the three methods.
Table 2. The performances of singlelocus test, PCALRT and KPCALRT
Discussion
In genetic association study, especially in GWAS, in order to avoid the collinearity among SNPs and reduce the false positive rate caused by multiple testing, several groups have proposed PCAbased methods and found that these methods are typically as or more powerful than both single locus test and haplotypebased test [2023]. However, it is not enough to just consider the linear relationship between SNPs, and the PCAbased methods will lose power when the nonlinear relationship exists in the genome. In this paper, based on the ideas of Wu et al [32] and Liu et al [32], we combined KPCA with LRT to propose the KPCALRT model for detecting the association between multiple SNPs and diseases. The simulation results (Table 1, Figure 2 to Figure 5) showed that KPCALRT performed well under null hypothesis, and all the powers of KPCALRT were higher than PCALRT at given significant levels, sample sizes and relative risks, especially under lower relative risk (1.2 and 1.3) with smaller significant levels (1E5 and 1E7). Specifically, we set five low levels of relative risks (1.11.5) because the great majority of the identified risk marker alleles conferred very small relative risks [46]. Our simulation results show that KPCALRT is much more powerful than PCALRT when the sample size is not less than 3000 (Figure 3). Both models are less powerful when RR is lower than 1.2. At higher relative risks, KPCALRT also shows greater power than PCALRT. Especially at the relative risks of 1.3, the power of PCALRT is close to zero while it is about 0.6 for KPCALRT (Figure 4). To investigate the performance of the two methods on different causal SNPs with different MAF and different LD patterns, each of the 11 SNPs is defined as the causal variant. In each case, KPCALRT is more powerful than PCALRT (Figure 5).
To compare the three methods (singlelocus test, PCALRT and KPCALRT), the four regions from the RA data in GAW16 Problem 1 (Table 2) are considered in this paper. For region 1, the statistical significances at the given nominal level (1E5) were detected by all three methods. For region 2, the same significance is found by both singlelocus test and KPCALRT, while PCALRT did not identify this region. There are no reports on the association of region 3 and region 4, but in this paper the results of KPCALRT show that there may be susceptible locus in the two regions, and the result of PCALRT on region 4 coincided with KPCALRT. In conclusion, KPCALRT performed the best among the three methods.
The four genes involved in the regions for real data analysis are selected based on prior researches and Gene Ontology [47]. The definition of "region" is very broad, such as a single SNP, a haplotype, a gene set, or interval of constant copy number [8]. To be easily interpreted, genes or genome regions are often defined based on the biological knowledge, such as Gene Ontology and KEGG [48]. For large genes or regions, it is hard to fine map the causal SNPs or associated markers even if association between the whole genes or regions could be detected. Recently slidingwindow scan approaches have been widely used to partition the large genes or regions into many overlapping/nonoverlapping regions [49,50]. Then the proposed gene or regionbased methods can be used in each region.
There are several limitations about the proposed method. First, only one causal SNP is considered in present work. Second, how to fix the kernel function with appropriate parameters for each data is still a theoretical problem. Third, when the effect size is smaller (relative risk per allele = 1.1, see Figure 3), both PCALRT and KPCALRT are less powerful. Fourth, all the frequencies of the causal SNPs are higher than 0.05, so it is hard to decide whether the proposed method is powerful for rare variants. The last, the proposed KPCALRT is based on logistic regression, so it could not deal with quantitative traits. To do this, KPCAbased methods could be combined with e.g. multivariate regression analysis or partial least squares (PLS) [51]. Further work to solve such problems will certainly be warranted.
Conclusions
In present study, we have proposed a KPCALRT model for testing associations between a candidate gene or genome region with diseases (or traits). Results from both simulation studies and application to real data show that KPCALRT with appropriate parameters is always as or more powerful than PCALRT, especially under lower relative risks and significant levels.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
QSG, YGH, ZSY, JHZ, BBZ and FZX conceptualized the study, acquired and analyzed the data and prepared for the manuscript. All authors approved the final manuscript.
Acknowledgements
This work was supported by the grant from National Natural Science Foundation of China (30871392). We thank NARAC for providing us with the data.
References

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al.: Complement factor H polymorphism in agerelated macular degeneration.
Science 2005, 308(5720):385389. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Maraganore DM, de Andrade M, Lesnick TG, Strain KJ, Farrer MJ, Rocca WA, Pant PV, Frazer KA, Cox DR, Ballinger DG: Highresolution wholegenome association study of Parkinson disease.
Am J Hum Genet 2005, 77(5):685693. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PI, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ, et al.: Genomewide association analysis identifies loci for type 2 diabetes and triglyceride levels.
Science 2007, 316(5829):13311336. PubMed Abstract  Publisher Full Text

Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ, Perry JR, Rayner NW, Freathy RM, et al.: Replication of genomewide association signals in UK samples reveals risk loci for type 2 diabetes.
Science 2007, 316(5829):13361341. PubMed Abstract  Publisher Full Text

de Bakker PI, Ferreira MA, Jia X, Neale BM, Raychaudhuri S, Voight BF: Practical aspects of imputationdriven metaanalysis of genomewide association studies.
Hum Mol Genet 2008, 17(R2):R122128. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lindgren CM, Heid IM, Randall JC, Lamina C, Steinthorsdottir V, Qi L, Speliotes EK, Thorleifsson G, Willer CJ, Herrera BM: Genomewide association scan metaanalysis identifies three Loci influencing adiposity and fat distribution.
PLoS genetics 2009, 5(6):e1000508. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, Li Y, Kurreeman FAS, Zhernakova A, Hinks A: Genomewide association study metaanalysis identifies seven new rheumatoid arthritis risk loci.
Nature genetics 2010, 42(6):508514. PubMed Abstract  Publisher Full Text

Beyene J, Tritchler D, Asimit JL, Hamid JS: Gene or regionbased analysis of genomewide association studies.
Genet Epidemiol 2009, 33(Suppl 1):S105110. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kraft P, Hunter D: Genetic risk predictionare we there yet?
New Engl J Med 2009, 360(17):1701. PubMed Abstract  Publisher Full Text

Buil A, MartinezPerez A, PereraLluna A, Rib L, Caminal P, Soria J: A new genebased association test for genomewide association studies.
2009.
BioMed Central Ltd: S130.

Yang HC, Liang YJ, Chung CM, Chen JW, Pan WH: Genomewide genebased association study.
BMC Proc 2009, 3(Suppl 7):S135. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lo S, Chernoff H, Cong L, Ding Y, Zheng T: Discovering interactions among BRCA1 and other candidate genes associated with sporadic breast cancer.
Proceedings of the National Academy of Sciences 2008, 105(34):12387. Publisher Full Text

Qiao B, Huang CH, Cong L, Xie J, Lo SH, Zheng T: Genomewide genebased analysis of rheumatoid arthritisassociated interaction with PTPN22 and HLADRB1.
BMC Proc 2009, 3(Suppl 7):S132. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Neale BM, Sham PC: The future of association studies: genebased analysis and replication.
Am J Hum Genet 2004, 75(3):353362. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Hayward NK, Montgomery GW, Visscher PM, Martin NG, et al.: A versatile genebased test for genomewide association studies.
Am J Hum Genet 2010, 87(1):139145. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hauser E, Cremer N, Hein R, Deshmukh H: Haplotypebased analysis: a summary of GAW16 Group 4 analysis.
Genet Epidemiol 2009, 33(Suppl 1):S2428. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pryce JE, Bolormaa S, Chamberlain AJ, Bowman PJ, Savin K, Goddard ME, Hayes BJ: A validated genomewide association study in 2 dairy cattle breeds for milk production and fertility traits using variable length haplotypes.
J Dairy Sci 2010, 93(7):33313345. PubMed Abstract  Publisher Full Text

Xiong M, Zhao J, Boerwinkle E: Generalized T2 test for genome association studies.
Am J Hum Genet 2002, 70(5):12571268. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fan R, Knapp M: Genome association studies of complex diseases by casecontrol designs.
Am J Hum Genet 2003, 72(4):850868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Peng Q, Zhao J, Xue F: PCAbased bootstrap confidence interval tests for genedisease association involving multiple SNPs.
BMC Genet 2010, 11:6. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang K, Abbott D: A principal components regression approach to multilocus genetic association studies.
Genet Epidemiol 2008, 32(2):108118. PubMed Abstract  Publisher Full Text

Wang X, Qin H, Sha Q: Incorporating multiplemarker information to detect risk loci for rheumatoid arthritis.
BMC Proc 2009, 3(Suppl 7):S28. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Gauderman WJ, Murcray C, Gilliland F, Conti DV: Testing association between disease and multiple SNPs in a candidate gene.

Yang HC, Lin CY, Fann CS: A slidingwindow weighted linkage disequilibrium test.
Genet Epidemiol 2006, 30(6):531545. PubMed Abstract  Publisher Full Text

Yang HC, Hsieh HY, Fann CS: Kernelbased association test.
Genetics 2008, 179(2):10571068. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Silva S, Botelho C, De Bem R, Almeida L, Mata M: CNLPCA: Extracting NonLinear Principal Components of Image Datasets.

Mika S, Schlkopf B, Smola A, Müller K, Scholz M, Rtsch G: Kernel PCA and denoising in feature spaces.
Advances in neural information processing systems 1999, 11(1):536542.

Schlkopf B, Smola A, Müller K: Kernel principal component analysis.
Artificial Neural Networks¡ªICANN'97 1997, 583588. PubMed Abstract  Publisher Full Text

Scholkopf B, Smola A, Muller KR: Nonlinear component analysis as a kernel eigenvalue problem.
Neural Comput 1998, 10(5):12991319. Publisher Full Text

Liu Z, Chen D, Bensmail H: Gene expression data classification with Kernel principal component analysis.
J Biomed Biotechnol 2005, 2005(2):155159. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kramer MA: Nonlinear Principal Component Analysis Using Autoassociative Neural Networks.
Aiche J 1991, 37(2):233243. Publisher Full Text

Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, Lin X: Powerful SNPset analysis for casecontrol genomewide association studies.
Am J Hum Genet 2010, 86(6):929942. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nguyen VH, Golinval JC: Fault detection based on Kernel Principal Component Analysis.
Eng Struct 2010, 32(11):36833691. Publisher Full Text

Zhang DQ, Zhou ZH: Adaptive kernel principal component analysis with unsupervised learning of kernels.

Kwok JT, Tsang IW: Learning with idealized kernels.
2003, 400.

Jaakkola T, Diekhans M, Haussler D: Using the Fisher kernel method to detect remote protein homologies.
1999, 149158.

Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledgebased analysis of microarray gene expression data by using support vector machines.
Proceedings of the National Academy of Sciences of the United States of America 2000, 97(1):262. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes.
Nat Genet 2007, 39(7):906913. PubMed Abstract  Publisher Full Text

Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LRL, et al.: TRAF1C5 as a risk locus for rheumatoid arthritis  A genomewide study.
New Engl J Med 2007, 357(12):11991209. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Firestein GS: Evolving concepts of rheumatoid arthritis.
Nature 2003, 423(6937):356361. PubMed Abstract  Publisher Full Text

Begovich A, Carlton V, Honigberg L, Schrodi S, Chokkalingam A, Alexander H, Ardlie K, Huang Q, Smith A, Spoerke J: A missense singlenucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis.
The American Journal of Human Genetics 2004, 75(2):330337. Publisher Full Text

Carlton V, Hu X, Chokkalingam A, Schrodi S, Brandon R, Alexander H, Chang M, Catanese J, Leong D, Ardlie K: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis.
The American Journal of Human Genetics 2005, 77(4):567581. Publisher Full Text

Källberg H, Padyukov L, Plenge R, Rnnelid J, Gregersen P, van der Helmvan Mil A, Toes R, Huizinga T, Klareskog L, Alfredsson L: Genegene and geneenvironment interactions involving HLADRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis.
The American Journal of Human Genetics 2007, 80(5):867875. Publisher Full Text

Silverberg MS, Cho JH, Rioux JD, McGovern DPB, Wu J, Annese V, Achkar JP, Goyette P, Scott R, Xu W: Ulcerative colitisrisk loci on chromosomes 1p36 and 12q15 found by genomewide association study.
Nat Genet 2009, 41(2):216220. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Boyer F, Fontanges E, Miossec P: Rheumatoid arthritis associated with ulcerative colitis: a case with severe flare of both diseases after delivery.
Ann Rheum Dis 2001, 60(9):901901. PubMed Abstract  PubMed Central Full Text

Manolio T, Brooks L, Collins F: A HapMap harvest of insights into the genetics of common disease.
The Journal of clinical investigation 2008, 118(5):1590. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Nat Genet 2000, 25(1):2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Res 2000, 28(1):2730. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sha Q, Tang R, Zhang S: Detecting susceptibility genes for rheumatoid arthritis based on a novel slidingwindow approach.
BMC Proc 2009, 3(Suppl 7):S14. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tang R, Feng T, Sha Q, Zhang S: A variablesized slidingwindow approach for genetic association studies via principal component analysis.
Ann Hum Genet 2009, 73(Pt 6):631637. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wold H: Partial least squares.
1985.