Abstract
Background
Gene Set Enrichment Analysis (GSEA) is a computational method for the statistical evaluation of sorted lists of genes or proteins. Originally GSEA was developed for interpreting microarray gene expression data, but it can be applied to any sorted list of genes. Given the gene list and an arbitrary biological category, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list. Usually, significance scores (pvalues) of GSEA are computed by nonparametric permutation tests, a time consuming procedure that yields only estimates of the pvalues.
Results
We present a novel dynamic programming algorithm for calculating exact significance values of unweighted Gene Set Enrichment Analyses. Our algorithm avoids typical problems of nonparametric permutation tests, as varying findings in different runs caused by the random sampling procedure. Another advantage of the presented dynamic programming algorithm is its runtime and memory efficiency. To test our algorithm, we applied it not only to simulated data sets, but additionally evaluated expression profiles of squamous cell lung cancer tissue and autologous unaffected tissue.
Background
Modern highthroughput methods deliver large sets of genes or proteins that can not be evaluated manually. For example, cDNA microarrays are used to measure the expression of a variety of genes under different conditions, e.g. in normal and cancer tissues. Usually, for each gene the expression quotient is computed and the genes are sorted by their expression quotient. The question of interest is whether overexpressed or underexpressed genes accumulate in certain biological categories, as for example biochemical pathways or Gene Ontology categories. To answer this question different approaches can be applied. First, the socalled "OverRepresentation Analysis" (ORA) that compares a reference set to a test set of genes by using either the hypergeometric test or Fisher's exact test. Second, "Gene Set Enrichment Analysis" (GSEA) evaluates the distribution of genes belonging to a biological category in a given sorted list of genes or proteins by computing running sum statistics.
Performing GSEA for a biological category C and sorted list L of m genes of which l belong to C means that a running sum statistic RS is computed for L. RS statistics evaluate whether the genes of C are accumulated on top or bottom of the sorted list or whether they are randomly distributed. Hereby, the sorted list is processed from top to bottom. Whenever a gene belonging to C is detected, the running sum is increased by a certain number, otherwise it is decreased. The value of interest is the running sum's maximal deviation from zero, denoted as RS_{C}. An example is provided in Figure 1 for a list containing 8 genes of which 4 belong to C. The black graph corresponds to all possible running sum statistics. The red pathway represents the example where the first three genes and the seventh gene belongs to C. The RS_{C }value of the red path is 12.
Figure 1. Example of possible running sum statistics. The figure shows all possible running sum statistics for an ordered list of 8 genes of which 4 belong to a functional category. The red labeled running sum statistic has a RS_{C }value of 12 and the corresponding pvalue is . The numbers on the xaxis refer to the index and the number of possible running sum values in the current step.
Usually, the pvalue is computed by nonparametric permutation tests, i.e. RS_{C }is calculated for permuted gene lists. Two approaches to compute these lists exist. First, the sorted gene list is randomly permuted. Second, if L is sorted by the median expression quotient of expression values in one group divided by the median expression value in another group, the samples are randomly assigned to the two groups and thereby permuted gene lists are generated. Notably, these methods do not always yield the same results. The permutation procedure is repeated t times and the running sum statistics together with the corresponding maximal deviations from zero, denoted as RS_{i}, i ∈ {1,...,t}, are computed. Usually, the pvalue computes as the fraction of RS_{i }values that are larger or equal than RS_{C}:
Since its development in 2003 [1,2], Gene Set Enrichment Analysis has been enhanced [3] and integrated in a number of analysis tools [4]. Among the most popular programs are "ermineJ" [5] and "GSEAp" [6]. These two tools estimate the significance values by using nonparametric permutation tests. However, such tests entail three disadvantages:
First, repeated runs of the permutation test algorithm may lead to different significance values because of the random sampling.
Second, the permutation test procedure causes problems if the significance values are small. Given a running sum statistic whose true pvalue is 0.00001. If, as usual, 1000 permutation tests are performed, probably none will have a higher maximal deviation as the original running sum statistics. According to the formula given above, the pvalue would compute as , which may be a bad estimation. Since the next iteration may lead to a higher deviation, a more reasonable estimation would be
Since GSEA is often applied to many biological categories, pvalues have to be adjusted for multiple testing by using Bonferroni Hochberg [7], Benjamini [8], or similar adjustment approaches. However, given the above estimation and the known multiple testing methods, the pvalue cannot be adjusted in an appropriate way.
Third, it is difficult to estimate how many permutations should be performed to obtain a sample of reasonable size. Obviously, if m = 20000 and l = 2000, a sample size of 1000 permutations may be by far too small. Remarkably, the number of possible different running sum statistics amounts to . On the example given above, the number of different running sums adds up to approximately 4·10^{2821}, emphasizing that 1000 permutation represent a very small sample. The example shown in Figure 2 indicates that for a sorted list of length 2000 and a biological category including only 14 genes 1000 permutations do not yield reliable significance values. The required large number of permutation tests leads to an unacceptable computational effort, especially if thousands of biological categories are tested. An alternative, parametric method is the so called Parametric Analysis of Gene Set Enrichment "PAGE" method [9] that calculates a zscore for a given gene set and infers the significance value of this zscore against standard normal distribution.
Figure 2. pvalue as function of permutation test number. For each number of permutation tests we performed 100 runs. The figure shows the mean value of these runs together with the respective standard deviation. The dashed lines represent the maximum and minimum of the 100 computed pvalues.
In this study, we address the exact and efficient pvalue computation for unweighted Gene Set Enrichment Analysis. Unweighted means that the number by which the running sum statistic is increased if a gene of C is found and the number by which the running sum statistic is decreased if the gene does not belong to C are constants. In our case, whenever a gene of C is found the running sum is increased by m  l, and otherwise it is decreased by l. The dynamic programming method is similar to the "DRIM" approach ([10]) that computes the optimal partition of a gene set in a target and a background set.
We integrated our dynamic programing algorithm into the gene set analysis tool "GeneTrail" [11] that is freely available at genetrail.bioinf.unisb.de. GeneTrail tests a wide variety of biological categories, among them Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [12], TRANSPATH pathways [13], transcription factors [14], Gene Ontology GO, [15], granzyme B clevage sites [16], and proteinprotein interactions [1720]. GeneTrail relies on the Biological Information System BN++ [21] that provides easy access to a wide variety of biological data.
Results and Discussion
Dynamic programming algorithm
Before presenting our algorithm, we discuss some important features of the running sum statistic. Given the sorted list L of m genes of which l belong to the considered biological category C, we calculate a running sum statistic as follows: whenever we find one of the l genes of the considered category C, we increase the running sum by m  l leading to a total sum of l·(m  l) over all genes in C. Otherwise, we decrease the running sum statistic by l leading to a total sum of (m  l)·(l) over all genes not in C. Therefore, the running sum's final value will always be zero. Moreover, the running sum's maximal possible value is l(m  l), whereas its minimal possible value is l(m  l).
As mentioned in the Introduction, the value of interest is the running sum's maximal deviation from zero, denoted as RS_{C}. The pvalue can be computed as the probability that a random running sum reaches a maximal deviation greater or equal as RS_{C}. We compute this probability via the complement of the event as:
where X is the number of running sum statistics with a maximum deviation of at most RS_{C } 1 and Y is the number of all possible different running sum statistics which can be obviously computed as . To compute X, we count all running sum statistics that have a maximum deviation of at most RS_{C } 1.
We use a matrix M of dimension (2l(m  l) + 1) × (m + 1), where the different rows represent all possible values of the running sum and the columns represent the indices of the sorted list L from 1,..., m and an initialization column with index 0. Let M(j, i) denote the number of running sum statistics with value j in step i whose maximum deviation of zero is less than RS_{C } 1. The entries of M are computed using dynamic programming, starting with the first column. M(0, 0) is set to 1 and all other values are set to 0.
We fill the matrix column by column, where the matrix entry M(j, i) is recursively computed as:
where the constraint
ensures that only the running sum statistics with maximal deviation of smaller than RS_{C }are counted. The total number of running sum statistics with maximum deviation smaller than RS_{C }can be found at matrix entry M(0, m).
Implementation details
At first glance, the presented algorithm seems to be inefficient concerning both, space requirement and runtime, which are of order O(m^{2}l). For example, if m = 20000 genes and a functional category with l = 2000 genes is considered, M would have about 1.44·10^{12}entries.
We have implemented the above described algorithm in C++ using time and space efficient data structures which will be discussed here.
As the recurrence equation implies, filling the ith column of M only requires the values of the i  1th column. Thus, the dynamic programming approach requires only two columns of the matrix reducing the memory requirements to O(ml).
Obviously, the first column M(·, 0) contains only one number unequal to zero, the second column two numbers, and the lth column l values unequal to zero (see Figure 1 and Figure 3). By using two standard STL hash maps instead of the matrix M, the space requirements can be reduced to O(l) and the expected runtime can be reduced to O(ml).
Figure 3. Dynamic Programming Matrix. The figure shows the dynamic programming matrix for the example provided in Figure 1. Matrix entries that are unequal zero are shaded. The yellow matrix entries do not have to be computed due to the extended side constraints, the number of running sum statistics with a smaller deviation of zero (RS_{C }value) than 12 amounts to 54.
Another important feature of the running sum statistics implies that certain parts of the matrix M do not have to be computed. The running time of the algorithm can be further reduced by adding a second constraint
for each column i to the recurrence equation. The right side of the constraint holds because, for column i, the value j of the running sum can be computed as
where a is the number of genes that belong to C up to index i in the ordered list. Since a can be at most l, the following inequality holds
Equivalently, for the left side of constraint (**) and column i the following equation holds:
where b is the number of genes that do not belong to C up to index i in the ordered list. Since b can be at most m  l,
Although the additional constraint does not lead to an asymptotically improved runtime, an increased performance has been measured, especially for small pvalues.
Additionally, the runtime of the presented algorithm can be improved by computing only the first half of M. Due to a certain "symmetry" of the running sum statistics it suffices to compute either the column in the middle or the two columns in the middle to derive the required number of pathways.
Dependence of runtime on pvalue
The recurrence equation of the dynamic programming algorithm shows that in each step only computations for running sum values in the interval of ]  RS_{C}, RS_{C}[ have to be computed. However, the running sum's maximal deviations from zero and the corresponding pvalues are mutually dependent on each other, i.e., the higher the running sum's maximal deviation, the lower the pvalue. Thus, the running time of the presented algorithm depends on the pvalue (see Figure 4A and 4B). The lower the pvalues are the higher is the runtime of the presented algorithm.
Figure 4. Running time as function of the significance value. A: For a sorted list of 1000 genes and a category containing 100 genes the runtime of our algorithm was computed for a set of discrete pvalues (0, 0.003, 0.006,...,0.996, 1). The shown runtime is the median of 100 runs for each pvalue on a standard 2 GHz PC. The maximal runtime was 0.05 seconds spent for computing the complete matrix. A naive permutation test procedure needed about 0.7 seconds for 1000 permutations. B: Running time as function of the significance value for small pvalues.
As described above, our algorithm is applied to evaluate several thousands of biological categories/hypotheses using the gene set analysis toolkit "GeneTrail". In general, findings are considered to be significant, if the pvalue is smaller than 0.05. Most computation time is spend for small pvalues. However, only few of the considered categories are statistically significant, whereas the others will lead to intermediate and larger pvalues. Since our algorithm is especially fast for intermediate and large pvalues, the complete GSEA analysis is highly efficient and most of its running time is spent for the pvalue calculation of the most significant categories.
Comparison to other approaches
To get maximal performance, we implemented our algorithm in C++. Other available GSEA tools have been implemented in Java or are available as "R" scripts. For this reason, a fair comparison of our tool to other nonparametric permutation tests is not possible. We implemented a permutation test procedure in C++ with expected running time of O(number of permutations·m). We applied the algorithm to the example presented in Figure 4 using 1000 permutations. On average the presented dynamic programming approach was more than ten times faster compared to the permutation test procedure. Please note that the two approaches are not directly comparable. The runtime of both methods depends on the length of the gene list l. In addition, the runtime of our algorithm depends on the pvalue whereas the running time of the permutation test approach depends on the number of performed permutations.
Evaluation of lung cancer expression profiles
We tested our algorithm by evaluating freely available expression profiles of lung cancer tissue and autologous control samples. In detail, we downloaded expression profiles of 5 squamous cell lung cancer patients [22] from the "Gene Expression Omnibus" [23]. Together with the cancer tissue, unaffected tissue of autologous patients was extracted at surgery and 5 control expression profiles were generated. The 10 expression profiles were measured using the Affymetrix HGU133A including more than 22000 transcripts and 13000 genes. In a preprocessing step, the profiles were median normalized. Thereafter, for each transcript a paired ttests was performed in order to detect differentially expressed genes. Paired ttest is applicable here, since the control samples were taken from the normal lung tissues of autologous patients. To generate a sorted list, ttest statistic values were sorted in increasing order such that the top of the resulting list contains the most significantly upregulated genes in lungcancer and the bottom of the list the most significantly downregulated genes.
Finally, the list was analyzed by GeneTrail. A detailed evaluation of all findings would be beyond the scope of this paper. Therefor, we provide a summary of our findings in Table 1 and the complete list of results in the supplemental material. Some significant findings pointing out the advantages of the presented dynamic programming algorithm are given below. Although the evaluation of the data set revealed a variety of significant categories (see Table 1), the complete analysis needed only about one hour. Of 6428 categories, 1744 were statistically significant at an αlevel of 0.05 without adjustment, and 711 with BenjaminiHochberg adjustment. About 300 categories would achieve a pvalue of zero by application of 1000 permutation tests. However, these categories differ extremely in their significance.
Table 1. Evaluation on lung cancer
We detected many significantly downregulated KEGGpathways. Among them the Cell Adhesion Molecules (pvalue of 0.00024). The Cell Cycle is the most significantly upregulated pathway (pvalue of 0.0011). It is very likely that both pathways would achieve a pvalue of zero by permutation tests, however, they are not equally significant as demonstrated above. The upregulated rRNAbinding achieved a pvalue of 0.0488. This category represents an example where permutation tests might define a pathway as significant in one run and as not significant in another run.
Conclusion
We presented a novel dynamic programming algorithm that enables the efficient computation of exact significance values of unweighted "Gene Set Enrichment Analysis" and thus avoids typical problems of nonparametric permutation tests. Additionally, we showed that the runtime of the presented algorithm decreases as the pvalues increase, i.e. our algorithm spends most time for computing small pvalues of significant categories.
We integrated our algorithm in the gene set analysis tool "GeneTrail" that allows for performing a wide variety of statistical analyses efficiently. Using GeneTrail, we evaluated the differential expression of genes in squamous cell lung cancer expression profiles, demonstrating the usefulness of the presented dynamic programming algorithm.
Methods
GMP library for arbitrary numbers
The number of possible running sum statistics increases exponentially, i.e. . On the example given above, a microarray containing m = 20000 genes and a category with l = 2000 genes, the number of different running sums adds up to approximately 4·10^{2821}. In the worst case, the matrix entry M(0, m) amounts to 4·10^{2821}, if all genes of C are either top or bottom ranked. This example shows that the approach must be able to handle very large numbers. Hence, we use the "GNU Multiple Precision Arithmetic Library" (GMP), a numerically stable and fast library that can compute arbitrary large natural numbers and is freely available.
Authors' contributions
AK developed and implemented the dynamic programming algorithm and wrote the manuscript. CB integrated the algorithm in GeneTrail and supported the implementation of the algorithm. HPL is the senior author of the study.
Acknowledgements
This work was supported by "Deutsche Krebshilfe", grant 107342 and by the "Deutsche Forschungsgemeinschaft", grant BIZ 4/1(1,...,4).
References

Mootha V, Lindgren C, Eriksson K, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.
Nat Genet 2003, 34:26773. PubMed Abstract  Publisher Full Text

Lamb J, Ramaswamy S, Ford H, Contreras B, Martinez R, Kittrell F, Zahnow C, N P, TR G, Ewen M: A Mechanism of Cyclin D1 Action Encoded in the Patterns of Gene Expression in Human Cancer.
Cell 2003, 114:32332. PubMed Abstract  Publisher Full Text

Jiang Z, Gentleman R: Extensions to gene set enrichment.
Bioinformatics 2007, 23:30613. PubMed Abstract  Publisher Full Text

Rubin E: Circumventing the cutoff for enrichment analysis.
Brief Bioinform 2006, 7:2023. PubMed Abstract  Publisher Full Text

Lee H, Braynen W, Keshav K, Pavlidis P: ErmineJ: tool for functional analysis of gene expression data sets.
BMC Bioinformatics 2005, 9:269. BioMed Central Full Text

Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles.
Proc Natl Acad Sci USA 2005, 102:1554550. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hochberg Y: A sharper Bonferroni procedure for multiple tests of significance.
Biometrika 1988, 75:8002. Publisher Full Text

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.

Kim S, Volsky D: PAGE: parametric analysis of gene set enrichment.
BMC Bioinformatics 2005, 6:144. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Eden E, Lipson D, Yogev S, Yakhin Z: Discovering Motifs in Ranked Lists of DNA Sequences.
PloS Comput Biol 2007., 3(3) PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Backes C, Keller A, Küntzer J, Kneissl B, Comtesse N, Elnakadi Y, Müeller R, Meese E, Lenhof H: GeneTrail – Advanced Gene Set Enrichment Analysis.
Nucleic Acids Research 2007, in press. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kanehisa M, Goto S, Hattori M, AokiKinoshita K, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG.
Nucleic Acids Res 2006, 34:D354D357. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Krull M, Pistor S, Voss N, Kel A, Reuter I, Kronenberg D, Michael H, Schwarzer K, Potapov A, Choi C, KelMargoulis O, Wingender E: TRANSPATH: An Information Resource for Storing and Visualizing Signaling Pathways and their Pathological Aberrations.
Nucleic Acids Res 2006, 34:D546D551. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Matys V, KelMargoulis O, Fricke E, Liebich I, Land S, BarreDirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, LewickiPotapov B, Saxel H, Kel A, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.
Nucleic Acids Res 2006, 34:D108D110. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Consortium TGO: Gene Ontology: tool for the unification of biology.
Nature Genet 2000, 25:259. PubMed Abstract  Publisher Full Text

Backes C, Küntzer J, Lenhof H, Comtesse N, Meese E: GraBCas: a bioinformatics tool for scorebased prediction of Caspase and Granzyme Bcleavage sites in protein sequences.
Nucleic Acids Research 2005, 33:W208W213. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hermjakob H, MontecchiPalazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct an open source molecular interaction database.
Nucleic Acids Res 2004, 32:D452D455. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Peri S: Development of human protein reference database as an initial platform for approaching systems biology in humans.
Genome Research 2003, 13:236371. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Salwinski L, Miller C, Smith A, Pettit F, Bowie J, Eisenberg D: The database of interacting proteins: 2004 update.
Nucleic Acids Res 2004, 32:D449D451. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zanzoni A, MontecchiPalazzi L, Quondam M, Ausiello G, HelmerCitterich M, Cesareni G: MINT: a Molecular INTeraction database.
FEBS Letters 2002, 513:13540. PubMed Abstract  Publisher Full Text

Küntzer J, Blum T, Gerasch A, Backes C, Hildebrandt A, Kaufmann M, Kohlbacher O, Lenhof H: BN++ – A Biological Information System.

Wachi S, Yoneda K, Wu R: Interactometranscriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues.
Bioinformatics 2005, 21(23):42058. PubMed Abstract  Publisher Full Text

Barrett T, Troup D, Wilhite S, Ledoux P, Rudnev D, Evangelista C, Kim I, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profilesdatabase and tools update.
NAR 2006, 35:D760D765. PubMed Abstract  Publisher Full Text  PubMed Central Full Text