Istituto di Studi sui Sistemi Intelligenti per l'Automazione - CNR, Via Amendola 122/D-I, 70126 Bari, Italy

Ospedale "Casa Sollievo della Sofferenza" IRCCS, Laboratorio di Gastroenterologia, Foggia, Italy

Departments of Statistical Science, Computer Science, Mathematics, Institute for Genome Sciences & Policy, Duke University, Durham, NC, USA

Abstract

Background

The typical objective of Genome-wide association (GWA) studies is to identify single-nucleotide polymorphisms (SNPs) and corresponding genes with the strongest evidence of association (the 'most-significant SNPs/genes' approach). Borrowing ideas from micro-array data analysis, we propose a new method, named RS-SNP, for detecting sets of genes enriched in SNPs moderately associated to the phenotype. RS-SNP assesses whether the number of significant SNPs, with p-value

Results

We applied RS-SNP to the Crohn's disease (CD) data set collected by the Wellcome Trust Case Control Consortium (WTCCC) and compared the results with GENGEN, an approach recently proposed in literature. The enrichment analysis using RS-SNP and the set of pathways contained in the MSigDB C2 CP pathway collection highlighted 86 pathways rich in SNPs weakly associated to CD. Of these, 47 were also indicated to be significant by GENGEN. Similar results were obtained using the MSigDB C5 pathway collection. Many of the pathways found to be enriched by RS-SNP have a well-known connection to CD and often with inflammatory diseases.

Conclusions

The proposed method is a valuable alternative to other techniques for enrichment analysis of SNP sets. It is well founded from a theoretical and statistical perspective. Moreover, the experimental comparison with GENGEN highlights that it is more robust with respect to false positive findings.

Background

The objective of genome-wide association studies (GWAS) is to identify genetic variants, a subset of single nucleotide polymorphisms (SNPs), associated with the onset and progression of complex disease phenotypes at a genome-wide scale

Recently a new trend is emerging in genetics and computational biology in which groups of genes are analyzed simultaneously for association with a phenotype or disease

The same principle has been recently applied in GWAS for assessing association of sets of SNPs and phenotypes

In this paper we describe a new methodology that assesses the association of gene sets to a trait by including simultaneously strong association signals as well as SNPs moderately associated to the phenotype. The approach belongs to the general class of Random Set methods

Implementation

Defining a SNP set

Before introducing a detailed description of the method used to perform SNP set analysis, it is important to clarify how a SNP set can be defined.

The first step in defining a SNP set is mapping SNPs to genes. SNPs may fall within coding regions of genes, non-coding regions of genes, or in the inter-genic regions between genes. Each SNP _{i }
_{j}

The second step is mapping genes to pathways. The pathways are pre-defined lists of genes based on a

Random set methods

Random Set (RS) scoring methods were primary introduced by Efron and Tibshirani

The main idea pointed out by RS methods is that any method for assessing gene sets should compare a given gene set score not only to scores from permutations of the sample labels, but also taking into account scores from sets formed by random selections of genes.

In fact, any approach to gene set analysis begins with the computation of some enrichment score

In order to better clarify RS positions let us consider a simplified statement of the gene set problem, proposed by Efron and Tibshirani but adapted to the SNP data framework.

Let **X **indicate an _{1 }columns of **X **representing healthy control samples and the remaining _{2 }are case samples, _{1 }+ _{2 }= _{i}

• **Permutation Model**. Let **X**
_{S }
**X **corresponding to

• **Randomization Model**. The null hypothesis **X **are drawn at random, giving randomized values

The randomization of the markers and the permutation of the labels can be combined into a method that is called "Restandardization". Restandardization can be thought as a method for correcting the permutation values of ES to take into account the overall null distribution of ES in the randomization model. The restandardized enrichment score (RES) used is defined as:

where (^{†}, ^{†}) are the mean and standard deviation of ^{† }and (

Random Set method for SNP data: RS-SNP

RS-SNP is designed for genome-wide SNP data with binary categorical phenotypes, for example cases and healthy controls.

The first step in the method is computing a correlation or association statistic _{i }
_{i}
^{2 }test (or Fisher's exact test) on genotype entries to compute association. The multiplicative risk model uses a ^{2 }test or Fisher's exact test on allelic entries to compute association. The additive risk model uses a Cochran-Armitage test for trend

After computing the single SNP associations, RS-SNP computes the enrichment of these associations in a predefined gene set _{i }

•

•

•

•

RS-SNP assesses whether the number

(1) Permute the labels of the samples Π times. For each permutation

(i) Compute the number of significant SNPs

(ii) Compute the number of significant SNPs belonging to

(iii) Compute the mean

(iv) From the above

(2) Compute the p-value

where

Since several gene sets are considered in the analysis, the false-discovery rate (FDR) and the family-wise error rate (FWER) are computed as proposed by Wang et al.

FDR, i.e. the fraction of expected false-positive findings, is calculated as:

where

Results and Discussion

Experimental data set

WTCCC data set

The data set provided by WTCCC is composed of 2005 Crohn's Disease (CD) patients and 3004 healthy controls (HC). The control individuals came from two sources: 1504 individuals from the 1958 British Birth Cohort (58 C) and 1500 individuals selected from blood donors recruited as part of the WTCCC project (UK Blood Services (UKBS) controls). All 5009 samples were genotyped with the GeneChip 500 K Mapping Array set (Affymetrix chip), which comprises 500,568 SNPs. The quality control analysis was carried out following the details specified by WTCCC

• SNPs with Hardy-Weinberg exact p-value ^{-7 }in the combined set of 2938 controls;

• SNPs with p-value ^{-7 }for either a one or two-degree of freedom test of association between the two control groups;

• SNPs with a

In total,

SNP set construction

Two different collections of gene sets were used, that can be downloaded from the MSigDB website

• MSigDB C2 CP collection, composed of pathways collected from various sources such as online databases, biomedical literature in PubMed, and knowledge of domain experts. In particular, the canonical pathways (CP) collection consists of 639 gene sets;

• MSigDB C5 collection, composed of 1454 gene sets derived from Gene Ontology (GO). This collection is composed of 825 GO biological processes, 233 GO cellular components and 396 GO molecular functions. We have considered only those GO terms associated with a specific reference that describes the work or analysis upon which the association between a specific GO term and gene product is based. Each annotation includes an evidence code to indicate how the annotation to a particular term is supported

The mapping of SNPs to genes has been carried out by using the Affymetrix annotation files Mapping250 K Nsp Annotations and Mapping250 K Sty Annotations, CSV format, version 26. In this study, SNPs were assigned to a given gene if they are within 5 kb from it.

Experimental results of RS-SNP and GENGEN

Results on MSigDB C2 CP collection

The association of each SNP to CD was assessed by using the Cochran-Armitage trend test with 1 degree of freedom

Statistical significance and adjustment for multiple hypothesis testing were determined by a permutation-based procedure with Π = 10,000 random permutations of the phenotypic status of the subjects. The FDR and FWER were also computed.

The enrichment analysis highlighted 86 pathways (p-value

Detailed tables, concerning the list of significant pathways in MSigDB C2 collection obtained by RS-SNP and GENGEN methods, are reported in the additional file

**Experimental results of RS-SNP and GENGEN on MSigDB C2 collection**. Tables reporting the experimental results obtained by the proposed method, RS-SNP, and by GENGEN on the MSigDB C2 pathway collection.

Click here for file

Results on MSigDB C5 collection

The association of each SNP to CD was computed using the same methodology as above. Statistical significance and adjustment for multiple hypothesis testing was also estimated using the same procedure as stated above with Π = 10,000 random permutations of the phenotypic status of the subjects.

The enrichment analysis performed by RS-SNP on the MSigDB C5 collection highlighted 196 pathways (p-value

Detailed tables, concerning the list of significant pathways in MSigDB C5 collection obtained by RS-SNP and GENGEN methods, are reported in the additional file

**Experimental results of RS-SNP and GENGEN on MSigDB C5 collection**. Tables reporting the experimental results obtained by the proposed method, RS-SNP, and by GENGEN on the MSigDB C5 pathway collection.

Click here for file

Computational complexity evaluation

To evaluate and compare the computational cost of RS-SNP and GENGEN we used a computer equipped with two quadcore 2.67 GHz processors, 24 Gbyte of RAM, working under Linux OS. The first step, common to both the algorithms, was to assess the association between each SNP and the phenotype. The computation of the additive trend test statistics on the whole set of markers available in the WTCCC data required 18 sec for the actual phenotypic status of the samples and 50 min for random permutations of their phenotypic status. The second step was to assess the statistical significance of the enrichment score, under both the null and alternative hypotheses, for each of the 639 gene sets of the considered C2 CP collection. This step required 29 min for RS-SNP and 50 min for GENGEN. These computational costs indicate that the algorithmic complexity of both approaches is comparable.

Discussion

We conclude with a discussion of the biological and statistical aspects of the RS-SNP approach. The FDR seems the most relevant summary statistic in this type of analysis as the number of true positives is expected to be a small fraction of the total number of hypotheses tested. More sophisticated scores can be used to measure enrichment instead of the simple indicator function. However, an advantage of the scoring we propose is that it assigns equal weights both to markers strongly associated to CD as well as markers with moderate association, markers with with p-value ^{-10 }and ^{-3 }are treated equally. This property of the score ensures that the enrichment of a gene set is due to the simultaneous presence of many markers with association and not a few with strong association. The methodology also corrects for gene set size automatically.

The linkage disequilibrium (LD) structure is preserved by the proposed method and does not alter the statistical significance of the identified pathways. This is due to the fact that the method uses random permutations of the phenotypic status of the subjects in the sample to assess the significance of the enrichment score. The column permutation procedure does not modify the genotypic profile of the subjects because it limits itself to assign randomly phenotypic states to subjects. The row permutation procedure adopted by the method has the objective of normalizing the enrichment score. This is realized comparing the actual number of markers associated to the phenotype in the gene set with the one obtained by chance. So, the LD structure of a given gene set remains the same under both null and alternative hypothesis. Finally note that the row permutations are only implicitly realized in our approach. This is due to the fact that the number of markers belonging to the gene set and associated to the trait has a hypergeometric distribution. For this reason the computational complexity of RS-SNP is proportional only to the number of column permutations required, that is equal to the inverse of the minimum observable P-value.

From a biological point of view, significant associations were highlighted by RS-SNP analysis between CD and key inflammatory

A comparative study of RS-SNP and GENGEN suggests that gene set methods that use both types of null hypotheses may reduce false positives, GENGEN does not randomize with respect to gene set size. It is worth noting that GENGEN found a greater number of significant pathways, but several pathways of these pathways may be false positives. For example, the HSA04810 REGULATION OF ACTIN CYTOSKELETON pathway was found significant by GENGEN. This is a very large pathway composed of 166 genes and 2650 SNPs. Only 32 SNPs are weakly associated (p-value

Conclusions

A new method for detecting association of SNP sets to a trait has been proposed. The approach, named RS-SNP, assesses whether the number of SNPs associated to the phenotype and belonging to a given SNP set is statistically significant. Strong signals as well as SNPs weakly associated to the trait are taken into account simultaneously for assessing association of a given SNP set. The proposed method, well founded from a theoretical perspective, is a valuable alternative to other techniques for enrichment analysis of SNP sets. When applied to the CD data set collected by the WTCCC, the method highlighted many relevant pathways which play a key role in CD as well as in other inflammatory diseases.

Availability and requirements

The RS-SNP approach has been implemented in Matlab in the compute_rs.m file (see additional file

**RS-SNP package**. The proposed RS-SNP software is contained in this compressed file, together with: • the help documentation, • example files with the SNP-gene mapping and gene-pathway mapping; • example of input files.

Click here for file

To compute the association of each single SNP with the trait, the compute_association.m program is also enclosed in the RS-SNP package. It allows to perform sample and marker quality controls and then to test the association by choosing the more suitable genetic model.

Authors' contributions

All the authors conceived the study. AD'A, SM and NA designed the algorithms and conduced the experiments and, together with OP, AL and VA, evaluated and compared the experimental results. All the authors contributed to the drafting of the article.

Acknowledgements

This work was supported by grants from Regione Puglia, Progetto Strategico PS_012 and Progetto Reti di Laboratori Pubblici di Ricerca BISIMANE