Abstract
Background
The current genomewide association (GWA) analysis mainly focuses on the single genetic variant, which may not reveal some the genetic variants that have small individual effects but large joint effects. Considering the multiple SNPs jointly in Genomewide association (GWA) analysis can increase power. When multiple SNPs are jointly considered, the corresponding SNPlevel association measures are likely to be correlated due to the linkage disequilibrium (LD) among SNPs.
Methods
We propose SNPbased parametric robust analysis of geneset enrichment (SNPPRAGE) method which handles correlation adequately among association measures of SNPs, and minimizes computing effort by the parametric assumption. SNPPRAGE first obtains genelevel association measures from SNPlevel association measures by incorporating the size of corresponding (or nearby) genes and the LD structure among SNPs. Afterward, SNPPRAGE acquires the geneset level summary of genes that undergo the same biological knowledge. This twostep summarization makes the withinset association measures to be independent from each other, and therefore the central limit theorem can be adequately applied for the parametric model.
Results & conclusions
We applied SNPPRAGE to two GWA data sets: hypertension data of 8,842 samples from the Korean population and bipolar disorder data of 4,806 samples from the Wellcome Trust Case Control Consortium (WTCCC). We found two enriched gene sets for hypertension and three enriched gene sets for bipolar disorder. By a simulation study, we compared our method to other gene set methods, and we found SNPPRAGE reduced many false positives notably while requiring much less computational efforts than other permutationbased gene set approaches.
Background
The genomewide association (GWA) studies have been successful to investigate generic variants associated with some targeted phenotypes. In general, many GWA methods only consider association of a single SNP and provide the list of the most significant SNPs or related genes due to computational burden.
However, complex diseases often result from compound action of multiple risk factors and therefore the singleSNPbased analysis may miss the genetic variants that affect risk effects jointly but have scarce individual effects. Also, the locus heterogeneity, which implies that alleles at different loci target the same diseases in different individuals, would increase difficulty in replication of association of a single marker [1]. Furthermore, a large number of statistical tests may result in high false positive associations [2]. To resolve these issues, it was suggested to utilize prior biological knowledge or known pathway information, and thus to incorporate a set of related SNPs, which leads a smaller number of tests. This approach was motivated by the gene set analysis (GSA), widely used in the analysis of microarray data. GSA focuses on gene sets rather than individual genes, and combines weak signals from a number of individual genes in a set, when individual genes are weakly associated with the traits. In this way, GSA increases a power of detecting diseaserelated genes and helps to interpret underlying genetic background and has been popularized.
GSA can be classified into nonparametric or parametric approach. The most popular nonparametric GSA method is gene set enrichment analysis (GSEA) [3]. GSEA uses the enrichment score which represents whether the members of gene set tend to occur toward top or bottom in ranked gene list based on a correlation. It permutes the phenotype label and repeats calculating the enrichment score for the test. This requires very expensive computational efforts.
On the other hands, the parametric GSA can reduce computing time by assuming a specific distribution. A hypergeometric distributionbased test [4,5] is a typical choice for the parametric method, and binomial, normal, and chisquare distributions are also widely used [68].
There are several prior works for applying GSA methods to GWA data [1,2,7,915]. For simplicity, we call all these methods as GSAGWA. We address two issues regarding GSAGWA. The first issue is that there has not been a widely agreed and accepted theory on how to combine the measures of multiple SNPs into one single genelevel measure, and moreover how to combine the genelevel measures into one single geneset level measure. In original GSA, the genelevel measure is typically a foldchange or a correlation to represent the effect of a single gene. In GWA data, however, it is often required to calculate association measures of genes by combining the SNPlevel measures. The SNPlevel measures include pvalues, or chisquare test statistics from the univariate SNPtophenotype association tests. Once the SNPlevel measure is decided, the genelevel summary statistics are then derived as the highest SNPlevel statistics [10], the sum of SNPlevel statistics [9], or the combined pvalue [1].
However, there are some substantial limitations in current GSAGWA methods. First, in deriving the summary statistics the correlation among the SNPlevel association measures has not been taken appropriately into account which is expected to play an important role. The SNPlevel association measures are usually correlated because the linkage equilibrium (LD) exists among SNPs. If this correlation is not correctly adjusted, the resulting genesetlevel measure would be inflated [1]. Unfortunately, many GSAGWA methods have not considered the LD structures adequately.
Second, the computational burden is heavy. Once having the genelevel association measures, it is possible to apply different GSA methods to get various genesetlevel statistics and evaluate their performances. However, as explained later, the majority of GSAGWA methods implement nonparametric permutation to calculate the observed significance, which takes a heavy computing time.
There have been several efforts to resolve these limitations. As the pioneering work of GSAGWA, GSEA [3] was extended to GWA data by Wang et al. [10], which has been implemented in GenGen package [11] It repeats permutation of sample label and calculation of gene set statistics 100~1,000 times [2,10,1214]. This permutationbased testing can preserve a correlation among the SNPlevel measures, but this is very computationally expensive in genomewide scale.
In order to reduce computing time, some GSAGWA studies use a parametric test. Peng et al.[1] used various kinds of the parametric test such as Fisher’s combination test, Sidak’s combination test, Simes’ combination test, and a FDRbased test under the independence assumption of the SNPlevel pvalues. A GLOSSI method developed by Chai et al.[9] used Fisher’s combination test under the assumption of correlated pvalues.
Recently, Nam et al. [15] proposed the Zstatistic method that compares a specific geneset to others. This method is the extension of the parametric analysis of gene set enrichment (PAGE) [7], which is the parametric and competitive GSA for microarray data. PAGE uses the mean of the association measures in a set as a summary measure and assumes that it follows a normal distribution by the central limit theorem when the number of genes is large.
However, these parametric methods including the Zstatistic method do not consider the LD structures adequately and assume no correlation between SNPlevel pvalues. In order to overcome these limitations of current GSAGWA, we propose SNPPRAGE, a SNPbased parametric robust analysis of geneset enrichment, which is based on a simple normality assumption. SNPPRAGE estimates the LD information among SNPs based on haploblockwise covariance structure to consider the correlation among SNPlevel measures without taking the permutation step.
We compare our method to other GSAGWA methods via the simulation study in terms of size, power and computing time. We also demonstrate SNPPRAGE using two GWA data sets: hypertension data of 8,842 samples from the Korean population and bipolar disorder data of 4,806 samples from the Wellcome Trust Case Control Consortium (WTCCC).
Methods
Zstatistic method (GSASNP)
Nam et al. [15] implemented the Z statistic method in their software, GSASNP. The negative logarithm of the mth best pvalue within each gene was used as the gene summary measure. Based on this gene summary measure, the Zscore was then calculated as genesetlevel summary. The Zscore was assumed to follow a normal distribution based on the central limit theorem (CLT).
In order to meet a normal distribution assumption, the genelevel order statistic is assumed to have an identical and independent distribution (i.i.d.). Let n_{ij} be the gene size which is the number of SNPs within the jth gene in the ith gene set. If we assume a pvalue follows an independent uniform distribution, the mth order pvalue p_{(}_{m}_{)} follows a beta distribution with the mean m/(n_{ij}+1) and the variance m(n_{ij}–m+1) /{(n_{ij}+1)^{2}(n_{ij}+2)}. This means that the gene with many SNPs have a lower p_{(}_{m}_{)} than genes with a few SNPs. (See Figure 1(a)). So p_{(}_{m}_{)} is not identically distributed over the gene size. To satisfy the identical distribution assumption, the summary measures need some modifications.
Figure 1. Distribution of genelevel measures over the gene size for hypertension data from Korean population. The xaxis is gene size which is a number of SNPs within the gene and the yaxis is mean of genelevel summaries with same gene size.
The genelevel summary measure is also assumed to have a homogeneous variance. However, the variance of their summary measures also depends on the gene size. When the gene size is large, the variance of the summary measure of the gene tends to be small. This problem can be easily addressed by modifying Welch’s t statistic [20] which is designed to handle for the heterogeneous variance of the two groups.
SNPPRAGE
To address these issues of the Zstatistic method we mentioned above, we multiply p_{(}_{m}_{)} by (n_{ij}+1) to have an approximate identical distribution over the gene size. The moment generating function of (n_{ij}+1)p_{(}_{m}_{)} does not depend on the gene size n_{ij} when pvalues are independent from each other and n_{ij} is large enough. However, the SNPlevel pvalues are not independent from each other because of the LD structure. So (n_{ij}+1)p_{(}_{m}_{)} has a nonidentical distribution over the genes (See Figure 1(b)).
In SNPPRAGE, we propose using the effective gene size instead of gene size n_{ij} to make sure that the genelevel summary measure has an approximate identical distribution over the gene size irrespectively of correlation among pvalues. The effective gene size is computed by using the following equation.
is estimated under the independent covariance structure and under the haploblockwise compound symmetric covariance structure.
Note that SNPlevel measures within a LD block are correlated. The withingene covariance matrix can be estimated by using maximum likelihood (ML) estimation. Among the several candidate covariance structures, the Akaike information criterion is used to choose the most appropriate covariance structure [16]. First, we construct the LD block among SNPs in GWA data so that any pair of SNPs from different LD blocks is independent from each other (r^{2}≤ 0.05) [17]. Second, we obtain the ML estimator of the covariance matrix within the LD block for each gene set. The most appropriate covariance structure is then selected via AIC. In the Korean GWA data analysis, the LDblockwise compound symmetric structure (LDCS) was chosen as the appropriate covariance structure.
Within the gene, the highly ranked pvalues tend to be correlated because of the LD structure. Through the simulation study, we found that the average of the top m pvalues from a gene is a more robust genelevel summary measure than only the mth pvalue (data are not shown). The following is the final genelevel summary measure proposed in SNPPRAGE. In Figure 1(c), we can see this measure has the identical mean over the gene size.
However, our empirical study shows that genelevel measure does not have the common variance over the gene set especially with the small gene set size (See Figure 2). Thus, we assume that the genelevel measure has a heterogeneous variance over the gene sets:
Figure 2. Variance of genelevel measure over the gene sets. In the left plot (a), xaxis is gene set size (= number of genes in the gene set) and yaxis is sample variance of genelevel summaries in the gene set. The right plot (b) shows a boxplot of variance of genelevel measures. A red line represents total sample variance in the data.
The mean of the genelevel measures in a gene set follows a normal distribution by the central limit theorem.
We compute the sample variance distinctly over gene set and derive the following setlevel test statistic:
The degree of freedom (df_{i}) is computed by WelchSatterthwaite equation [20].
Results
Hypertension data from the Korean GWA study
We used canonical pathways from MsigDB database [18]. These canonical pathways are curated from other online database such as BioCarta, KEGG and GO and so on. MsigDB database contains 639 pathways and 4934 genes.
We applied SNPPRAGE to GWA data set from the Korean GWA study which was initiated in 2007 to undertake a largescale GWA analysis among 10,038 participants (aged between 40 and 69) of Ansung (n=5,018) and Ansan (n=5,020) populationbased cohorts [19]. These cohorts, established as part of the Korean Genome Epidemiology Study (KoGES) in 2001 provide extensive phenotypic data for over 260 traits, but here we focus on analyses of hypertension. From the total of 10,038 participants, DNA was available for 10,004, all of whom were genotyped with the Affymetrix GenomeWide Human SNP array 5.0 and the Bayesian Robust Linear Modeling using Mahalanobis Distance (BRLMM) algorithm. Markers with high missing gene call rate (>5%), low MAF (<0.01) and significant deviation from HardyWeinberg equilibrium (P < 1 × 10^{6}) were excluded, leaving 352,228 SNPs. After removing samples with low call rates (< 96%, n = 401), sample contamination (n = 11), gender inconsistencies (n = 41), cryptic relatedness (n = 608) and serious concomitant illness (n = 101), GWA genotypes from 8,842 individuals were included. Hypertension phenotype was defined as a systolic blood pressure (SBP) ≥ 140 mm Hg or a diastolic blood pressure (DBP) ≥ 90 mm Hg. The logistic regression analysis with an additive model (1 d.f.) is conducted after adjustment for age, sex, and recruitment area (i.e. Ansung and Ansan). To correct for stratification, some methods that infer genetic ancestry, such as principal component analysis (PCA) and structured association can be used [21]. In our GWA data, there is no evidence of population stratification.
We obtained the SNP ID, rs ID, position information from dbSNP build 128 and gene ID, gene name, and position information from NCBI build 36. Each SNP is mapped to a gene closest to it. Only SNPs located within 500 Kb upstream or downstream of a gene are considered, because most enhancers and repressors are less than 500Kb away from genes, and most LD blocks are within 500Kb [10]. As a result of mapping it covered 60% of all SNPs in our data. If the mapping range is larger, we could save more SNPs, but the risk of SNP’s mapping to shared region of overlapping genes also increases.
Our proposed SNPPRAGE was used to identify the significant gene sets associated with hypertension in Korean GWA data. We used the pvalue from the logistic regression as SNPlevel association measure for each SNP. We compared three kinds of genelevel measures, , , and
In Figure 1, each plot shows the mean of genelevel measures over the gene size. Figure 1(a) is from the genelevel measure used in the Zstatistic method. This measure tends to increase as the gene size increases. The noncausal gene set which has a larger number of genes tend to be detected as significant. Figure 1(b) shows the minimum pvalues within the gene multiplied by (gene size + 1) over the gene size. Figure 1(c) shows the same plot but uses the effective gene size instead of the actual gene size. Figure 1(c) is most robust to the gene size showing the constant pattern.
Next, we checked the homogeneity assumption of variance of genelevel measure () over the gene sets. Figure 2 represents whether has the homogeneous variance over the gene sets. We can see the sample variances are different over the gene sets especially for the gene sets with a small number of genes. Thus, it would be inappropriate to assume the homogeneous variance assumption for the genelevel measures. SNPPRAGE allows the heterogeneous variance of genelevel measure.
In order to handle multiple testing problems, the false discovery rate (FDR) was controlled [22]. The qvalues were calculated to guard against the cost of multiple hypothesis testing [23]. The qvalue provides an expected proportion of false positives among sets with unadjusted pvalues at least as extreme as the current set of interest. Single SNP association test based on a logistic regression cannot detect SNP whose qvalue is less than 0.05. Minimum SNPlevel pvalue is 2.043 × 10^{6} and corresponding qvalue is 0.4. Even though there is no significant SNPlevel association in terms of qvalues, multiple SNPs with moderate effects could affect the phenotype in the gene setlevel.
Table 1 and Table 2 summarize the top 5 gene sets obtained by using the Zstatistic method and SNPPRAGE, respectively. In Zstatistic method, minimum qvalue is 0.06, which is not significant if we use 0.05 as qvalue cutoff. SNPPRAGE yielded 2 significant gene sets (qvalues: 0.01, 0.03) based on qvalue 0.05 as cut off, while Zstatistic method did not yield any significant gene sets.
Table 1. KARE result: top 5 gene sets with smallest qvalue associated with hypertension phenotype from Zstatistic method
Table 2. KARE result: top 5 gene sets with smallest qvalue associated with hypertension phenotype from SNPPRAGE
The significant gene sets in SNPPRAGE are ST_JNK_MAPK_Pathway and ST_ ERK1_ERK2_MAPK_Pathway. The MAPK signaling pathway is known to ultimately result in the dual phosphorylation and activation of terminal kinases, such as p38, cJun Nterminal kinases (JNKs), and extracellular signalregulated kinases (ERK1/2 and ERK5), which are related to pressureoverload–induced cardiac hypertrophy [24]. Esposito et al. [24] mentioned the potential role of ERK activation in White Blood Cells (WBCs) as a novel molecular marker to identify uncontrolled human hypertension. In their study, JNK1 activation was also significantly induced in uncontrolled hypertension patients.
Bipolar disorder data from the WTCCC GWA study
We also applied SNPPRAGE to bipolar disorder (BD) data from the Wellcome Trust Case Control Consortium (WTCCC) which was established in 2005 to conduct GWA analysis for group of 50 research groups across the UK [25]. In our analysis, 1868 BD cases and 2938 controls were included and markers with high missing gene call rate (>5%), low MAF (<0.05) and significant deviation from HardyWeinberg equilibrium (P < 5.7 x 10^{7}) were excluded, leaving 354,093 SNPs. The logistic regression analysis with an additive model (1 d.f.) was conducted after adjustment for age, sex, region, and age x region.
SNPPRAGE yielded 3 gene sets significantly associated with BD in terms of qvalue at the 5% significance level (Table 3), while Zstatistic method did not detect any significant gene set (Table 4). The significant gene sets detected by SNPPRAGE are AGPCR pathway, DREAM pathway, and CK1 pathway.
Table 3. WTCCC result: top 5 gene sets with smallest qvalue associated with bipolar disorder phenotype from Zstatistic method
Table 4. WTCCC result: top 5 gene sets with smallest qvalue associated with bipolar disorder phenotype from SNPPRAGE
AGPCR pathway is Gprotein coupled receptors (GPCRs) signaling pathway which transduces extracelluar signals across the plasma membrane. In a genomewide linkage survey, the region of chromosome 22q12 containing the GRK3 gene was identified as a susceptibility locus for BD in humans and GRK3 is expected to play an important role in the regulation of any one of many GPCRs [26]. DREAM is a multifunctional Ca^{2+}binding protein that can act as a transcriptional repressor for the prodynorphin gene. Subjects with BD were reported to show reduction of prodynorphin mRNA expression in discrete nuclei of the amygdaloid complex [27]. CK1 pathway is well known to be related to the circadian clock. Deregulation of this clock is involved in several human disorders. As a potent CK1ε inhibitor, a imidazole derivative, PF670462 could be used for therapy of cognitive deficits in mood changes in bipolar disorders [28].
Simulation study
In order to compare the performance of SNPPRAGE with other GSAGWA methods, we conducted the simulation study. Simulation data was generated based on a real GWA data. Using the subset of 5 gene sets from MsigDB canonical pathways, we constituted 5 gene sets so that each set has 20 genes. Over the gene sets, we varied the gene size which is the number of SNPs within a gene in order to study the effect of gene size on the gene set analysis. For example, one gene set consists of a small number of genes and other gene set consists of a large number of genes. The range of gene size is from 9 to 49 SNPs. Among 5 gene sets, we chose one causal gene set and selected 5 causal genes within the causal gene set. 500 individuals are randomly generated. For each causal gene, we selected one causal SNP whose minor allele frequency is about 0.2 for the selected individuals.
Given the genotype information of causal 5 SNPs and effect sizes, the case/control status was generated. Let SNP_{ij} denote jth causal SNP in ith individual and β denotes effect size (=log odds ratio). Effect size of each causal SNP is given as 0, 0.3, or 0.6.
Simulated gene sets and their gene sizes are given in Table 5. Either set 1, set 3, or set 5 is used as the causal gene set. For each causal get set, 1000 simulation datasets were generated for the effect size 0 to compute type I error and 100 simulation datasets for effect size 0.3 and 0.6 to compute powers.
Table 5. Simulated gene set based on MsigDB pathways
In order to determine whether or not the central limit theorem works for relatively small gene set, we obtained a null distribution of setlevel summary for reduced number of genes, say 5 and 10. We randomly chose 5 or 10 genes among 20 genes for each set. Figure 3 shows that the set level summary of small gene set follows a normal distribution when the number of genes is 10 and 20. However, there is a violation of normal approximation when the number of genes is 5. Thus, we expect that SNPPRAGE would work well when the number of genes is at least 10. For practical applications, we recommend discarding the gene sets in the analysis if the number of genes is smaller than 10.
Figure 3. QQ plot of setlevel summary with various set size. Under the assumption there is no causal set effect, Figure3 (a), (b), and (c) show the QQ plot of set summary with set size 5, 10, and 20, respectively.
We compared the performance of SNPPRAGE, Zstatistic method (Nam et al., 2010), modified GSEA method (Wang et al., 2007) and GLOSSI (Chai et al., 2009). We used the GenGen package for GSEA and the R package for other methods. SNPPRAGE, Z statistic method and GLOSSI use parametric test and GSEA method use nonparametric test with 1000 permutations. GLOSSI permute the data 100 times to consider the correlation of pvalues resulting from LD among SNPs.
Type 1 error is defined as the proportion of cases whose pvalues is less than the significance level when the effect size of causal SNP is zero. Power is defined as the proportion of cases whose pvalue is less than the significant level when effect size of causal SNP is 0.3 and 0.6. Tables 6 and 7 summarize the type 1 error and power of the methods compared.
Table 6. Type 1 error (when effect size is 0) in simulation studies
Table 7. Power (when effect size is 0.3 or 0.6) in the simulation studies
Type 1 error and power of the Zstatistic depend largely on the gene size. When the causal gene set consisted of the genes with 9~12 SNPs, the Zstatistic method yielded low type 1 error and power. They tended to decrease, as m increased. We think it is because the genes with the smaller number of SNPs tend to have a larger minimum pvalue and weaker LDs than the genes with a larger number of SNPs. When the causal gene set consists of the genes with 36~49 SNPs, on the other hand, the Zstatistic method yielded very high type 1 error and power. They tended to increase, as m increases. So the results from Zstatistic method can have high false positive errors, especially when the gene set has a larger number of genes.
On the other hand, SNPPRAGE gave the consistent results irrespective of gene size. As m goes from 1 to 5, SNPPRAGE gets a little larger power. Based on these results, it is desirable to use the mean of top m pvalues instead of the minimum pvalue as the genelevel measure. If the top m pvalues are from the SNPs in LD, the method using the top m pvalues can yield larger power than that using only the minimum pvalue. The computed power based on SNPPRAGE with appropriate m was similar but slightly larger compared to one of GLOSSI and GSEA. In SNPPRAGE, type 1 error is near 0.05 at the significance level 0.05. Table 8 summarizes the computing time of each method. Zstatistic method has the fastest computing time, because LD structure between SNPs is not taken into account. SNPPRAGE has the fastest computing time among the methods which consider LD between SNPs, Specifically, our simulation results show that GSEA and GLOSSI methods take 18.5 and 22.1 times, respectively, of computational efforts than SNPPRAGE.
Table 8. Computing time for simulation data analysis
The single SNP analysis for the Korean GWA data requires more than 1000 computing time compared to one set of simulation data. So, it would take a very long period of time if GSEA and GLOSSI are applied to our data, because both methods require permutation process. Thus, in practice it would not be easy to handle a large scale GWA data by GSEA and GLOSSI.
Discussion
The power of SNPPRAGE was computed for the several choice of m. When we choose appropriate m for the genelevel summary, the computed power based on SNPPRAGE was similar but slightly larger compared to one of GLOSSI and GSEA in the simulation study. Then how can we choose the appropriate m for the genelevel summary?
The best choice for the number of the top pvalues used in genelevel summary depends on the LD structure among the SNPs within the causal genes. While we set a fixed m over the genes for the summary in SNPPRAGE, setting different m over the genes according to each effective gene size can be considered in the future study.
Our SNPPRAGE can be extended in several ways. In this study, we assume the gene sets are independent from each other. However, the gene sets often share some common genes because one gene can have multiple biological functions. So handling the overlapped common genes between gene set is another challenging issue.
SNPPRAGE method is based on a normal distribution and similar to ANOVA (Analysis of Variance) model. In fact, SNPPRAGE can be expressed as ANOVA model with some contrast and modified estimation of variance. As an extension, another welldefined parametric model can be applied. A nested ANOVA can be applied to the gene set analysis in terms of that gene effect is nested within geneset effect. A mixed effect model can also be applied by treating the gene specific effects as random effects. Addressing these challenges we expect a more powerful GSASNP method in our near future.
Conclusions
Single SNP analysis in GWAS offers only a limited understanding of complex diseases because the complex disease often arises from the joint action of multiple genetic variants. Single SNP analysis can find only a few most significant SNPs. GSAGWA increases the power to detect the genetic variants which have a weak association but a meaningful biological association with a phenotype .GSAGWA methods test the significance of gene set via permutation by generating permuted data more than hundred times, which requires expensive computational efforts. The use of a parametric test can reduce the computing time, because it needs to calculate the gene set statistic only once.
We compared the performance and computing time of three parametric testbased GSAGWAs (Zstatistic method, GLOSSI, SNPPRAGE) and one nonparametric testbased GSAGWA (GSEA) in simulation study. The Zstatistic method does not consider the LD and has the shortest computing time but may have lots of false positive results because of overestimated gene set statistics when the gene set has many large genes. GLOSSI uses a parametric test but it needs to permute phenotype 100 times for an estimation of the correlation between association measures and GSEA requires much more permutations than GLOSSI. SNPPRAGE reduces computing time much and has comparable performance to GLOSSI and GSEA without going through the permutation step.
We found that consideration of LD blocks between SNPs helps us to deal with the correlation between pvalues more appropriately. The approach based on the mean of top m pvalues provides more consistent and stable result than the approach based on the top mth pvalue. Multiplying the effective gene size to the minimum pvalue for the genelevel summary of SNPPRAGE can reduce the false positive errors when the gene size is large. We expect the SNPPRAGE to play an important role in the parametric gene set analysis of largescale GWA data.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JL designed the summarization algorithm and drafted the manuscript. SA provided general trends of gene set analysis and drafted some of background part. SO and BW critically read the draft and contributed to the design of the algorithm. TP coordinated the work and help to draft the manuscript.
Acknowledgements
This work was supported by the Korea Healthcare Technology R&D Project, Ministry for Health & Welfare, Republic of Korea (A101915). The Korean GWA data analyzed in this study were obtained from the Korean Genome Analysis Project (4845301) and the epidemiologic data including hypertension were provided from the KoGES (4851302) that were funded by a grant from the Ministry for Health and Welfare, Republic of Korea.
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/17520509/5?issue=S2.
References

Peng G, Luo L, Siu H, Zhu Y, Hu P, Hong S, Zhao J, Zhou X, Reveille J, Jin L, et al.: Gene and pathwaybased secondwave analysis of genomewide association studies.
European Journal of Human Genetics 2010, 18(1):111117. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Holden M, Deng S, Wojnowski L, Kulle B: GSEASNP: applying gene set enrichment analysis to SNP data from genomewide association studies.
Bioinformatics 2008, 24(23):27842785. PubMed Abstract  Publisher Full Text

Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, et al.: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proceedings of the National Academy of Sciences 2005, 102(43):1554515550. Publisher Full Text

Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of genetic network architecture.
Nature Genetics 1999, 22:281285. PubMed Abstract  Publisher Full Text

Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression.
Genomics 2003, 81:98104. PubMed Abstract  Publisher Full Text

Jiang Z, Gentleman R: Extensions to gene set enrichment.
Bioinformatics 2007, 23(3):306313. PubMed Abstract  Publisher Full Text

Kim S, Volsky D: PAGE: parametric analysis of gene set enrichment.
BMC bioinformatics 2005, 6:144. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Goeman J, Geer S, Kort F, Houwelingen H: A global test for groups of genes : testing association with a clinical outcome.
Bioinformatics 2004, 20(1):9399. PubMed Abstract  Publisher Full Text

Chai HS, Sicotte H, Bailey K, Turner S, Asmann Y, Kocher J: GLOSSI: a method to assess the association of genetic lociset with complex diseases.
BMC Bioinformatics 2009, 10:102. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Wang K, Li M, Bucan M: Pathwaybased approaches for analysis of genomewide association studies.
The American Journal of Human Genetics 2007, 81(6):12781283. Publisher Full Text

GenGen Package [http://www.openbioinformatics.org/gengen/gengen_ download.html] webcite

Chasman D: On the utility of gene set methods in genomewide association studies of quantitative traits.
Genetic Epidemiology 2008, 32:658668. PubMed Abstract  Publisher Full Text

Chen L, Zhang L, Zhao Y, Xu L, Shang Y, Wang Q, Li W, Wang H, Li X: Prioritizing risk pathways: a novel association approach to searching for disease pathways fusing SNPs and pathways.
Bioinformatics 2009, 25(2):237242. PubMed Abstract  Publisher Full Text

Yu K, Li Q, Bergen A, Pfeiffer R, Rosenberg P, Caporaso N, Kraft P, Chatterjee N: Pathway analysis by adaptive combination of Pvalues.
Genetic Epidemiology 2009, 33(8):700709. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nam D, Kim J, Kim SY, Kim S: GSASNP: a general approach for gene set analysis of polymorphisms.
Nucleic Acids Res 2010, 38:W749W754. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Akaike H: A new look at the statistical identification model.
IEEE Transactions on Automatic Control 1974, 19:716723. Publisher Full Text

Levinson DS, Holmans P: The effect of linkage disequilibrium on linkage analysis of incomplete pedigrees.
BMC Genet 2005, 6(Suppl 1):S6. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

MsigDB Database [http://www.broadinstitute.org/gsea/msigdb/index.jsp] webcite

Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, Yoon D, Lee MH, Kim DJ, Park M, Cha SH, et al.: A largescale genomewide association study of Asian populations uncovers genetic factors influencing eight quantitative trait.
Nature Genetics 2009, 41(5):527534. PubMed Abstract  Publisher Full Text

Welch BL: The generalisation of student's problems when several different population variances are involved.
Biometrika 1947, 34:2835. PubMed Abstract

Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, Reich D: Principal components analysis corrects for stratification in genomewide association studies.
Nat Genet 2006, 38:904909. PubMed Abstract  Publisher Full Text

Storey JD: Direct approach to false discovery rates.
Journal of the Royal Statistical Sciety: Series B (Statistical Methodology) 2002, 64(3):479498. Publisher Full Text

Storey JD: The positive false discovery rates: a Bayesian interpretation and the qvalue.
Annals of Statistics 2003, 31(6):20132035. Publisher Full Text

Esposito G, Perrino C, Schiattarella GG, Belardo L, di Pietro E, Franzone A, Capretti G, Gargiulo G, et al.: Induction of mitogenactivated protein kinases is proportional to the amount of pressure overload.
Hypertension 2010, 55:137143. PubMed Abstract  Publisher Full Text

The Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 2007, 447:661678. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Barrett TB, Hauger RL, Kennedy JL, Sadovnick AD, Remick RA, Keck PE, McElroy SL, Alexander M, Shaw SH, Kelrose JR: Evidence that a single nucleotide polymorphism in the promoter of the G protein recepter kinase 3 gene is associated with bipolar disorder.
Mol Psychiatry 2003, 8:546557. PubMed Abstract  Publisher Full Text

Hurd YL: Subjects with major depression or bipolar disporder show reduction of prodynorphin mRNA expression in discrete nuclei of the amygdaloid complex.
Mol Psychiatry 2002, 7:7581. PubMed Abstract  Publisher Full Text

Perez DI, Gil C, Martinez A: Protein kinases CK1 and CK2 as new targets for neurodegenerative diseases.
Med Res Rev 2011, 31:924954. PubMed Abstract  Publisher Full Text