Abstract
Background
Researchers using RNA expression microarrays in experimental designs with more than two treatment groups often identify statistically significant genes with ANOVA approaches. However, the ANOVA test does not discriminate which of the multiple treatment groups differ from one another. Thus, post hoc tests, such as linear contrasts, template correlations, and pairwise comparisons are used. Linear contrasts and template correlations work extremely well, especially when the researcher has a priori information pointing to a particular pattern/template among the different treatment groups. Further, all pairwise comparisons can be used to identify particular, treatment groupdependent patterns of gene expression. However, these approaches are biased by the researcher's assumptions, and some treatmentbased patterns may fail to be detected using these approaches. Finally, different patterns may have different probabilities of occurring by chance, importantly influencing researchers' conclusions about a pattern and its constituent genes.
Results
We developed a four step, post hoc pattern matching (PPM) algorithm to automate single channel gene expression pattern identification/significance. First, 1Way Analysis of Variance (ANOVA), coupled with post hoc 'all pairwise' comparisons are calculated for all genes. Second, for each ANOVAsignificant gene, all pairwise contrast results are encoded to create unique pattern ID numbers. The # genes found in each pattern in the data is identified as that pattern's 'actual' frequency. Third, using Monte Carlo simulations, those patterns' frequencies are estimated in random data ('random' gene pattern frequency). Fourth, a Zscore for overrepresentation of the pattern is calculated ('actual' against 'random' gene pattern frequencies). We wrote a Visual Basic program (StatiGen) that automates PPM procedure, constructs an Excel workbook with standardized graphs of overrepresented patterns, and lists of the genes comprising each pattern. The visual basic code, installation files for StatiGen, and sample data are available as supplementary material.
Conclusion
The PPM procedure is designed to augment current microarray analysis procedures by allowing researchers to incorporate all of the information from post hoc tests to establish unique, overarching gene expression patterns in which there is no overlap in gene membership. In our hands, PPM works well for studies using from three to six treatment groups in which the researcher is interested in treatmentrelated patterns of gene expression. Hardware/software limitations and extreme number of theoretical expression patterns limit utility for larger numbers of treatment groups. Applied to a published microarray experiment, the StatiGen program successfully flagged patterns that had been manually assigned in prior work, and further identified other gene expression patterns that may be of interest. Thus, over a moderate range of treatment groups, PPM appears to work well. It allows researchers to assign statistical probabilities to patterns of gene expression that fit a priori expectations/hypotheses, it preserves the data's ability to show the researcher interesting, yet unanticipated gene expression patterns, and assigns the majority of ANOVAsignificant genes to nonoverlapping patterns.
Background
In DNA microarray and other massively parallel measurement technologies, analysis of data from twotreatment group experimental designs can be viewed as yielding three 'patterns': 1significantly upregulated, 2significantly downregulated, and 3 no significant change. Because the third 'pattern' (no significant change) is typically ignored, only the two patterns, 'upregulated' and 'downregulated', are reported. As more treatment groups are added [e.g., [1,2]], pattern assignment becomes more complex. Although a number of pattern recognition techniques are available [3,4], researchers often choose ANOVA for an overall statistical test.
Faced with identifying/discriminating different patterns of expression among the significant genes, researchers typically employ 'directed' pattern discovery. A priori information/assumptions are used to construct templates of expected changes in gene expression across treatment groups [510], of which time course based pattern discovery could be considered a specialized subset [1126]. These approaches are often applied post hoc to an overall test. Directed pattern discovery has the advantage of identifying the subset of ANOVA significant results that support the investigator's assumptions. However, disadvantages of this approach include missing unexpected but highly prevalent patterns and not estimating the likelihood of the directed pattern's occurrence by chance. 'Downweighting' is a unique subset of this directed approach in which the contribution of one or more of the treatment groups is deemphasized [e.g., [1,2]], turning the study into a modified twogroup comparison by primarily focusing on differences between only the two 'most important' groups. 'Less important' treatment groups may be used to triage/classify changes between the two important groups, but do not carry equal weight in the overall analysis.
Alternatively, some researchers use 'undirected' pattern discovery approaches, in which patterns of expression are discovered using clustering methodologies, and do not take a priori expectations into account [2730]. These undirected techniques have the advantage of handling highly complex data sets [31,32]. However, estimating the number of clusters is not a trivial process and can dramatically affect the outcome of the analysis [e.g., see [33]] and clusters identified in one study may not be directly relevant to clusters found in another study, although recent advances have been made regarding these determinations [34]. While bootstrapping (e.g., 'LeaveOneOutValidation') and other techniques can help identify stable clusters [3537], the likelihood that any given cluster, even a stable one, would have that number of genes by chance can be difficult to assess. Finally, these 'undirected' approaches also can identify important sources of variance that are not associated with treatment. This is a powerful tool for identification of abnormally behaving microarray data and even for the isolation of procedurerelated contributions to technical variance, and therefore is critical to microarray analysis and normalization steps. However, this same property can make 'undirected' clustering approaches less desirable for the assessment of treatmentbased effects.
All of these approaches are valid and have contributed importantly to microarraybased investigations of biological processes and many array analysis tools have been developed [reviewed in [38]]. Further, new cutting edge techniques merge directed and undirected approaches to allow for more powerful analyses [39]. Finally one of the most highly successful applications (at least in terms of popularity among bench researchers), has been the Significance Analysis of Microarrays (SAM) application [40], which combines multiple testing correction with permutation analysis using classical statistical tests. However, to date no work has been published demonstrating a nonclusteringbased approach for treatmentassociated, statistically validated gene expression pattern identification within multigroup microarray data.
Here, we developed an algorithm using 1way ANOVA, followed by all pairwise Fisher's Protected Least Significant Difference (PLSD) testing, to categorize ANOVA significant genes by their expression patterns (as determined by the results of their post hoc pairwise comparisons). The number of genes falling into each expression pattern is compared to the number of genes that fall into that pattern by chance (using a Monte Carlobased random number simulation. The patterns of expression are Zscored according to their MonteCarlobased chance probability estimates. The algorithm was applied to a previously published microarray dataset [2] and discovers patterns reflecting the major findings of that study, as well as a novel pattern with implications for the neurobiology of aging. Further, results from other pattern detection approaches (support tree hierarchical clustering, Kmeans support with Figures of Merit cluster number estimation, Pavlidis template matching), are compared.
The PPM analysis technique is useful for identifying significant patterns of gene expression within datasets having 3–6 treatment groups that are initially tested by ANOVA. The PPM approach should allow researchers to group significant genes into expression patterns and to estimate probabilities for each of those patterns' occurrence.
Results
Algorithm
Figure 1 depicts the steps involved in the analysis process (StatiGenspecific instructions are included in the software's help file). Steps are discussed in terms of Affymetrixderived expression array data, although the algorithm is applicable to any data with similar dimensions.
Figure 1. StatiGen algorithm. Microarray data (e.g., from Affymetrix pivot tables) is parsed into signal intensity and presence/absence calls, and annotation information is appended. Data is then filtered according to user input (i.e., absent and unannotated probe sets are removed). Using the Filtered data, StatiGen constructs a Monte Carlo simulation of the data. Both the filtered and Monte Carlo datasets are tested by 1ANOVA and post hoc all pairwise Fisher's PLSD tests. Results from all pairwise comparisons are used to encode pattern IDs (see Methods). Pattern frequency is give by # genes identified in pattern and is statistically compared (Ztest) to that pattern's frequency in a Monte Carlo simulation. Graphic output of significantly overrepresented patterns, along with a list of member genes and annotation information, is included and can be saved as a separate worksheet for further analysis.
Importing
Signal intensity and presence call pvalues are often provided by microarray core facilities as either an integrated pivot table, or as separate tables. StatiGen accepts either format and creates two tables, one containing signal intensities and one containing presence call pvalues. A third table of annotation information (e.g., gene symbols) should also be provided. The first column for these files should contain the same unique identifiers (usually probe_set ID) in the same order (although Statigen provides some limited protection against misaligned data by checking for an equal number of rows in all import sheets, as well as by sorting on the first column).
Filtering
The user defines a presence call pvalue cutoff (default = 0.05), and then establishes the number of chips that must achieve at least this level of presence (default = 1/2 n of the smallest treatment group). We routinely filter out probe sets with no gene symbol annotation [1,2,41,42] as a matter of convenience for subsequent functional grouping analysis. Filtering at this level is also possible (although not required) with StatiGen.
Monte Carlo simulation
A table of random numbers matching the filtered data table's dimensions is created. The random numbers themselves can be regenerated/tested multiple times. Both filtered and random data are run through the following steps (the random data may be run through these steps thousands of times, depending on the iterations selected by the user).
Gene level statistics
Omnibus test
Numerous studies have demonstrated the utility of the Analysis of Variance (ANOVA) approach for microarray studies [4345]. Here, we apply a basic one way ANOVA (1ANOVA, see Methods) approach. The mean squared error within (MSE_{within}) calculated during the ANOVA is used again in post hoc testing.
Pairwise comparisons
In the present work, we chose Fisher's Protected Least Significant Difference (PLSD) test. In general, the PLSD test is less conservative than other post hoc allpairwise tests. Therefore, if a significant ANOVA result is found, then Fisher's PLSD is more likely than some other tests to identify a significant pairwise comparison. The pvalue cutoff for the the PLSD test is defaulted to 0.05, although users can alter this.
Pattern level statistics
Number of pairwise comparisons
The number of pairwise comparisons is given by the formula 'k choose 2' as:
where c is the number of pairwise comparisons and k is the number of treatment groups. Thus, a study with 3 treatment groups would have 3 pairwise comparisons, one with 4 treatment groups would have 6 comparisons, one with 5 groups would have 10 comparisons, etc. Each comparison generates three potential results (r): 'significant increase', 'significant decrease', and 'not significant'. For each ANOVAsignificant gene, the results from all of the pairwise comparisons are encoded into a single 'pattern ID' (see Methods).
Pattern ID
We combine results from all of the pairwise comparisons for each probe set, creating a pattern ID. Pattern IDs are constructed using logic gates that use 'increase', 'no significant change', or 'decrease' results from each pairwise comparison. The first pairwise comparison is assigned 1, 0, or 1; the second is assigned 10, 0, or 10; the third is assigned 100, 0, 100 and so on. In this way, the sum of each combination of pairwise comparisons for a given probe set creates a pattern ID encoding that pattern's statistically defined shape, and allowing researchers to easily group different genes that belong to the same pattern. Further, two patterns of opposite sign and the same absolute value will be mirror reflections of one another, which may have value for assessing opposing actions in single pathways [46].
Actual and estimated frequencies
Some patterns are statistically more difficult to generate from random data. For instance, patterns in which all pairwise comparisons are significant have a much lower probability of occurring by chance than any other pattern. Therefore, each pattern found in the actual data is assigned its own probability (Zscore) based on that pattern's frequency within the random Monte Carlo simulation (estimated frequency see Methods).
Output
Expression levels for each gene are standardized (so that each gene has a mean of 0 and individual measures are expressed in standard deviations), allowing genes of the same pattern but different signal intensities to be averaged and plotted together. Genes are grouped by pattern and patterns are ranked by overrepresentation significance. Graphs of the mean standardized expression levels for all of the genes in each pattern, along with a list of that pattern's genes, are displayed and can be saved to individual worksheets for further analysis (Fig. 4).
Limitations
Algorithm
The number of different pattern IDs can be calculated by PID = r^{c }where PID is the number of different pattern IDs, c is the number of pairwise comparisons and r is the number of possible results. The PID value rises exponentially as the number of treatment groups increases (Fig. 2). Because of this, we feel this method is not useful for studies with more than 6 treatment groups, where the number of patterns rivals the number of genes on the chip, obviating the tool's usefulness for reducing complexity. Further, this exponential rise depends on the assumption that the comparisons are independent, when they are actually conditional. Therefore, some patterns (e.g., A v B increase, A v C decrease, B v C increase), while predicted by the independent calculation, are not possible in the conditionally dependent data, reducing the number of possible patterns (Fig. 2).
Figure 2. Complexity increases with number of treatment groups. Quadratic increase between the number of treatment groups (X axis) and the number of pairwise comparisons (right Y axis) is shown in black. The exponential increase in the number of post hoc patterns (left Y axis) is shown in gray. Note that by the time there are seven treatment groups, we predict nearly a billion different patterns. However, due to the lack of independence among the comparisons, the actual number of possible comparisons generated in 100 iterations of a 10,000 gene, 30 array model data system (dotted gray line) is considerably less.
In order to address this issue, only the union of patterns found within the actual data and/or Monte Carlo simulations are tested. This avoids testing for 'impossible patterns'. Presently, we have restricted the test to identification of overrepresented, rather than underrepresented, patterns.
Selecting the number of iterations
Exceedingly rare patterns that occur in the real data may not be detected by Monte Carlo. In these cases, the pattern is included as significant, but is flagged. This failure of the Monte Carlo to detect identified patterns is strongly dependent on the number of iterations chosen, the number of treatment groups, and the number of observations within each treatment group. To determine an appropriate number of iterations, we repeat the analysis and observe detected pattern stability. If the pattern detection is stable, then the number of iterations is at least sufficient. If the pattern detection is not stable, then a rule of thumb would be to double the iterations and recheck for stability. The algorithm and software default to one thousand iterations.
Excel
Statistical calculations in Excel have been reported to be inaccurate in some cases [e.g., [47,48]]. Thus, in the present work, ANOVA calculations were broken down into individual calculations of Total, Within (which was also used for the Fisher's LSD calculation), and Between/Residual sum of squared errors using Excel's DEVSQ function. From these results, F statistics were calculated and the FDIST function was used to look up pvalues. These values agree with output in SigmaStat (v. 3.01A, Systat). Finally, the Monte Carlo simulation uses Excel's RAND function, generating evenly distributed values between 0 and 1 (15 decimal places). This does not generate a normal distribution (much like the roll of a single die does not), however, the combined results of multiple RAND calculations do closely approximate a Gaussian distribution (Fig. 3).
Figure 3. Normality of Excel's random number generator. The average of 10 random numbers generated using Excel's RAND function were recalculated one million times, and the % of observations (Y axis) is plotted against the reported mean (X axis). The generated data (black dots) were fit in SigmaPlot (v. 9.0, Systat Software) using a Gaussian model (gray lines see R^{2 }value in graph).
Data reanalysis
Blalock et al., 2003 (Gene Expression Omnibus ID: GSE 854)
Nine to ten chips per treatment group, and three treatment groups (Young, MidAge, and Aged), were used (see Methods for description). Of the 8799 probe sets, 5865 were rated present (having 5 or more chips with 'present' calls) and 673 were significant by 1ANOVA (p < 0.05). Of these ANOVA significant probe sets, 2 probe sets did not have any significant post hoc Fisher's PLSD comparisons, 138 were significant between MidAge and Age; 353 between Young and MidAge; and 497 between Young and Aged. A Venn diagram (Fig. 5) shows the relative overlap among the three pairwise comparisons. Nearly three quarters of all genes found significant by ANOVA were also significant by the Young vs. Aged comparison. The Young vs. MidAge comparison was the second strongest comparison and MidAge vs. Aged had the fewest significant comparisons (statistically, each of these pairwise comparisons have the same probability of identifying genes).
Figure 4. StatiGen output example showing a significantly overrepresented pattern. 1. The pattern summary box gives details regarding the pattern's significance, rank among significant patterns, and # genes in pattern. 2. Below the pattern summary box, there is a drop down menu allowing users to rapidly switch their view to other significant patterns. 3. The export button allows users to export their list to another Excel worksheet for further analysis. 4. A graph of the average (± SEM standard error of the mean) of the standardized expression values from all genes in the pattern (Y axis) across the treatment groups in the study (X axis) is displayed. 5. The list of this pattern's member genes is presented. The first column is probe set ID, and the last column is hyperlinked to the National Cancer Institute's database of Affymetrix probe set IDs. The intermediate columns are provided by the user at the 'annotation information' stage of the algorithm (see Fig. 1).
Figure 5. Venn diagram of pairwise comparisons. Among ANOVA significant (p ≤ 0.05) genes from the Blalock et al., 2003 study, all possible pairwise comparisons were applied using Fisher's PLSD. The Venn diagram shows the number of ANOVA significant genes that were significant in at least one pairwise comparison, and notes overlap (direction of change was ignored). Interestingly, although all pairwise comparisons had an equal probability of detecting genes, the Young vs. Aged comparison was clearly the strongest comparison, and MidAged vs. Aged was clearly the weakest.
When pairwise comparisons are considered in concert using StatiGen's post hoc pattern matching algorithm (Fig. 1), interesting patterns emerge. Five of twentyfour patterns are significantly overrepresented (Fig. 6 and Table 1). The list of significant genes contained within each pattern (Additional File 1) was uploaded to: 1) DAVID [49] website and compared with a custom background list containing all probe sets in the study rated present and annotated, and 2) Onto Express [50] and contrasted with the RGU34A chip as a background. Some selected functional categories that agreed between the two analyses and appeared to represent biological processes of the individual patterns are listed (Fig. 6).
Additional file 1. Complete list of probe sets for the five identified patterns.
Format: XLS Size: 61KB Download file
This file can be viewed with: Microsoft Excel Viewer
Figure 6. Significantly overrepresented patterns. Five significantly overrepresented patterns are plotted, along with highlighted functional categories found to be overrepresented in each pattern (complete lists of each pattern's probe sets in Additional file 3).
Table 1. PPMDefined Expression Patterns
The most significant pattern (1110) was not the pattern with the most genes, but one among six possible patterns with the least likelihood of occurring by chance (all three pairwise comparisons significant). Many of the genes in this pattern reflect a wellcharacterized and robust increase in inflammatory markers seen in our and other researchers' microarraybased studies of the aging brain [5153]. Also note that pattern 100 reflects a weaker, but significant monotonic rise with aging that appears to contain genes associated with similar functional categories.
The second and third most prevalent patterns are mirror reflections of one another (110 and 110), and highlight genes whose expression levels were significantly different in two comparisons (Young vs. Aged, and Young vs. MidAged), but no different in the third (MidAged vs. Aged). Downregulated in Aged relative to Young (110) genes in this category are enriched in immediate early genes (e.g., transcription factor activity) and genes associated with intracellular signaling cascades (e.g., lipocalcin). Upregulated in Aged relative to Young (110 pattern) genes included functional categories associated with stress response (e.g., Metabolism of Xenobiotics and Glucoronosyl transferase activity).
Finally, although there were relatively few genes that were significantly changed from MidAge to Aged, a subset of genes (pattern 1100: no significant difference from Young to MidAge, a significant difference from Young to Aged and from MidAge to Aged) was rated as significantly overrepresented by StatiGen and included genes related to calcium binding and antigen presentation.
Comparison to other approaches
The PPM algorithm was developed to assign statistical probabilities to patterns identified post hoc to 'per gene' statistical testing in a multitreatment group setting, and shares some features with other approaches. Therefore, in this section we compare PPM output to two popular clustering approaches that use resampling techniques to assess stability (Support Trees and KMeans Support), as well as a template matching approach (Paul Pavlidis' Template Matching, PTM; [6]), using TIGR's MeV software [54]. Standardized gene expression data for the probe sets previously identified as present and annotated were imported into MeV. Because the goal of StatiGen is identification of patterns present among the ANOVAsignificant (and therefore heavily treatmentgroup biased) data, these other approaches were also applied to the ANOVAsignificant genes.
Support Trees (Figure 7)
Figure 7. Support Trees. Hierarchical clustering using MeV software with 100 bootstrap iterations on samples and experiments was used to generate a dendrogram. For display purposes the most stable Upper Node (left) and Lower Node (right) are separated for display purposes. Branches in the dendrogram are colorcoded according to reliability (as a percentage of all bootstrap iterations in which cluster was identified: Black 100%, Gray 90–100%, Blue 80–90%, Green 70–80%, Yellow 60–70%, Orange 50–60%, Fuscia 0–50%, Red 0% and Pink unrecovered), and shaded triangular areas are used to consolidate subjects/genes for display. Beneath each section of the dendrogram is a graph depicting mean standardized intensities for highlighted groups of subjects that were reliably clustered with one another. Below the graphs are representative Gene Ontology overexpressed Biological Processes for Upper and Lower nodes. Gene expression intensities are expressed in standard deviations from the mean for each gene (see scale bar).
Support Trees is a version of hierarchical clustering that uses bootstrap methods to establish branch stability. Here, we used Pearson correlation as a distance metric, average linkage as a linkage method, and clustered on both genes and experiments using one hundred bootstrapping iterations. Branches are colorcoded according to stability (see caption text). Genes could be reliably divided into two groups (left and right panels) but showed highly unstable branching patterns at lower levels (mainly red 0% support) while experiments were more stable, with a majority of aged chips being separated from their young and middleaged counterparts. Further, subsets of chips formed highly stable experimental clusters (from left to right): [A3, A4, A8, M8, Y4 a mix of different age groups], [M6, A1, A5, A9, A7, A10, A6, M3, M7, M9 the majority of aged subjects], [A2, Y3, Y6, Y2, Y7, Y8, Y9 the majority of Young subjects] and [M2, M4, M5, M10, M1, Y5, Y10 the majority of MidAged subjects]. Finally, it appears that the two experimental clusters most specifically enriched in aged vs young subjects (the middle two experimental clusters), in large part drove the discrimination of the genes into the left and right panels, with the outer two experimental clusters contributing relatively little information at this level of branching.
KMeans Support (KMS; Figure 8)
Figure 8. KMeans Support (KMS). The KMS algorithm with 100 iterations and Pearson correlation for distance, established eight 'consensus clusters' that appeared in at least 80 iterations. For each panel, individual observations are plotted along the Xaxis and standardized signal intensity averaged for genes in the cluster, is plotted along the Yaxis. For reference, a '0 line' has been added indicating baseline (scale bar = one standard deviation). Within each panel, the members of the cluster are indicated. Panels A, B, E) These clusters show an apparent discrimination between young subjects and all others. Panel C) The cluster most analogous to a monotonic increase. Panels D & FH) These clusters appear to be isolating patterns based on nontreatment effects. Inset: Figure of Merit (FOM) procedures were used to estimate number of clusters for Kmeans but results were difficult to interpret.
KMS uses the KMeans clustering algorithm run multiple times (here, 100 times) to establish 'consensus clusters' that appear in at least 80% of the iterations, again demonstrating clustering stability. The Pearson correlation metric was used for distance, and Figure of Merit (FOM) procedures were used to estimate number of clusters for Kmeans. FOM analysis was difficult to interpret, showing that more than one cluster was present, but indicating a relatively flat line effect out to 20 clusters (graph in inset truncated at 10 clusters). Using a combination of information from FOM, and previous analyses by StatiGen, we selected five clusters as a starting point for KMS. In the resulting procedure (Fig. 8) KMS ran 100 fivecluster iterations and reserved the genes that were clustered together in at least 80 of those iterations. The resultant set of genes fell into eight clusters which are depicted in Figure 8. However the majority of ANOVAsignificant genes (93%) failed to be assigned to a cluster.
Pavlidis Template Matching (PTM; Figure 9)
Figure 9. Pavlidis Template Matching (PTM). Four PTM templates were created. For each template, the standardized signal intensity average of all genes that fit the template are plotted by subject along with a superimposed template to which they were correlated. A. The 'monotonic increase' template matches two StatiGenidentified patterns, 1100 and 100 (Fig. 6). B. Young to MidAge increase corresponds to pattern 110 (Fig. 6). C. Young to MidAge decrease corresponds to pattern 110 (Fig. 6) the mirror reflection of pattern 110. D. MidAge to Age increase corresponds to pattern 1100 (Fig. 6). Genes in each pattern were tested for overrepresentation of Biological Processes in the Gene Ontology using DAVID (see Methods). Selected overrepresented categories are listed in each panel. Inset: High degree of overlap between A, B, and D, with the three templates identifying similar sets of genes (by overrepresentation analysis, humoral immune response genes could not be distinguished among the three templates).
The PTM approach allows researchers to construct a 'template' expression pattern and use Pearson's correlation to identify genes that significantly correlate with that template. Here, we chose to apply PTM to those genes that were significant by ANOVA, effectively turning the PTM procedure into a post hoc test. PTM investigates one userdefined pattern at a time and we used StatiGenidentified patterns to establish templates for PTM.
The two monotonic patterns found by StatiGen (Fig. 6, patterns 1110 and 100) were fit by the same template in PTM (Fig. 9A), and, among the ageupregulated patterns identified by StatiGen, this monotonic increase template found the largest number of genes in PTM. However, other upregulated patterns (Panels B and D) also found a large number of highly overlapping genes in PTM as evidenced in the Venn diagram (inset), as well as the highly analogous Biological Processes found to be overrepresented among genes identified panels A, B, and D. Decreased expression patterns from Young to MidAge, and sustained through Age using PTM (Fig. 9C) revealed a completely nonoverlapping set of genes that were related to cellular catabolism and neuronal plasticity supporting previous work suggesting that neuronal involution may play a critical role in cognitive deficits seen with aging.
Discussion
Prestatistical filtering
Researchers often triage microarray results with metrics that are blind to treatment groups, such as spot quality, signal intensity, and/or Microarray Suite 4 or 5 (MAS4 or MAS5) derived 'presence' calls. Such approaches can dramatically improve statistical performance and reduce the error associated with multiple testing [e.g., [55,56]]. There are a number of methods for such filtering. Observations that fall below some criterion could be treated as missing values, artificially brought up to a minimum intensity value, weighted according to the strength of the quality control measure, or the number of chips for which a given probe set exceeds some threshold value can be calculated. In the present work, we have opted for the latter approach.
Statistical tests
A number of different approaches could be used to assign significant results to the data (e.g., fold change, coefficient of variance, volcano plot style combinations of pvalues and fold changes). In theory, any of these approaches would work as an initial step for the detection of patterns within the data. In the present work, we focus on predicted reliability, rather than magnitude, of change.
Interestingly, studies in which magnitude of change, irrespective of variance, are applied (i.e., fold change) require an a priori assumption on the part of the investigator, that some level of change is necessary for a biological effect to be exerted, and further, that such a level of change is the same across all expressed genes. Moving to statistical criteria ignores potential biological effects, instead focusing on the degree of variance and the likelihood that such a difference in means, given the variance of the measures, could have occurred by chance. Thus, the statistical results infer relative security of findings, but it is still up to the investigator to ascertain the biological meaning (or lack thereof) of any change. A change in gene X may be very reliable, yet epiphenomenal with regard to the biological process under investigation.
Other pairwise comparisons would be appropriate post hoc to the ANOVA (e.g., Scheffe's, Tukey's). In the present work, we chose Fisher's Protected Least Significant Difference (PLSD) test. In general, the PLSD test is less conservative than other tests. Therefore, if a significant ANOVA result is found, then Fisher's PLSD is more likely than some other tests to identify at least one significant pairwise comparison. By assembling genes into their post hoc defined patterns, the statistical reliability of the pattern may 'protect' statistically weaker findings. This approach has been used to great effect in functional grouping analysis of microarray data [see [41,50,5759]].
Patterns found in Blalock et al., 2003
The PPM method applied with StatiGen confirmed and extended the work of the original paper, finding that a majority of genes had changed by MidAge, and identifying upregulated inflammatory genes and downregulated genes related to neuronal function. The number of genes significant by each pairwise comparison (Fig. 5) alone is often useful, helping researchers determine which comparisons show the largest number of significant results (suggesting treatments with the most powerful effects on the transcriptome). Here, the expected result, that the greatest agedependent difference in transcriptional profile would be between the Young and the Aged groups, was clearly supported by this analysis. However, such approaches are limited in their ability to assess a particular comparison's effects on the transcriptome while simultaneously appreciating the effects of other comparisons.
Further, StatiGen identified a significantly overrepresented pattern associated with a selective, MidAge to Aged change, and many of the genes in this pattern are associated with calcium dysregulation, a wellsupported hypothesis of neuronal dysfunction and cognitive deficit in aging [6062]. Thus, this approach identified not only agerelated and possibly precipitating causes of agerelated cognitive deficits in an animal model, but was also able to isolate a pattern of expression that directly and temporally correlated with that cognitive decline.
Four of the five identified patterns (1110, 100, 110, and 110) strongly validate conclusions of the original study [2] that transcriptional levels in the MidAge group are generally intermediate between Young and Aged groups, or are similar to the Aged group. Further, the MidAge animals, although they had yet to show a statistically significant cognitive deficit, generally had transcriptional profiles more similar to Aged than to Young hippocampal CA1 regions. The identification of these patterns by StatiGen highlights the unambiguous manner in which patterns can be defined and examined, and further highlights, at least in the example shown, that the conclusions of the researchers regarding transcriptional changes were largely supported by the data.
The genes comprising the fifth pattern (1100) may be of particular interest as their expression levels inversely correlate with behavioral deficits observed with age (a moderate and nonsignificant decrease by MidAge, followed by a significant dropoff in the Aged group). Interestingly, many of the genes found here represent inflammatory (e.g., Lps, S100 A1 and A9, Rt1Dmb) and astrocyte/oligodendrocyte processes (e.g., Gfap, Mobp, Mag), suggesting that these potential biomarkers may influence, or be influenced by, cognitive status changes with age.
The potential interactions among oligodendrocytic, myelin, and inflammation related genes, were a key, novel proposition in the original work. The finding here supports that interaction's potential role in cognitive deficits with age. Importantly, perturbed calcium homeostasis seen here has been a longstanding hypothesis of brain aging [reviewed in [6063]] supported by numerous studies [e.g., [6467]]. In the present context, it suggests that calcium signaling perturbations are common to many cell types in the brain. Further, altered calcium and inflammatory changes together suggest that other popular aging hypotheses [e.g., reactive oxygen species, see [68,69]] may all play a role in altered cognition with aging. This pattern's discovery therefore highlights the PPM algorithm's second strength, discovery of patterns that were not anticipated (based on the results of the previous work).
Other methods
As expected, support trees applied post hoc to the ANOVA showed a strong tendency to group subjects according to treatment, as the ANOVA selection should heavily bias this procedure towards treatmentbased clustering. However, expression pattern identification among genes was not as refined, with a relatively stable discrimination between up and down regulated genes among two of the four experimental clusters, and other patterns of expression showing poor replication.
Kmeans support, in conjunction with Figure of Merit estimation of cluster number, reliably identified eight clusters but was unable to assign more than 90% of the ANOVA significant genes. This suggests that some KMS parameters may need further adjustment, the data may need further transformation, the ANOVA criterion is inappropriate, or that this approach is not adequate for this data set.
Pavlidis template matching (PTM) clearly identified sets of genes using statistical Pearson's correlation probabilities. However, because each fitted template is performed in isolation, there is a high degree of overlap between different, but related patterns of expression. One way to reduce the degree of overlap would be to increase the pvalue stringency criterion for inclusion in each template. However, increased stringency would also reduce the proportion of the ANOVAsignificant data set identified by the procedure. Interestingly, the PTM approach does point to a potential improvement of the PPM strategy employed by StatiGen. Presently, the PPM procedure considers each unique combination of pairwise contrast results as a separate pattern. However, it is possible that, like the PTM procedure, two patterns that completely correlate with one another in PPM (e.g. Fig. 6, patterns 1110 and 100) could be merged, reducing the complexity of pattern output in PPM.
Conclusion
The PPM algorithm was born of necessity in our microarray research dealing with multiple group studies and the relatively large amount of data generated using arrays [70]. Although newer methodologies are greatly improving undirected approaches at both the gene expression and functional analyses levels ([7173], cluster number estimation, statistical likelihood of a cluster's occurrence, and gene membership across iterations are still important issues. Directed approaches are unable to detect unexpected patterns, as the discriminating features of the patterns must be determined a priori by the investigator. The PPM algorithm's implementation in StatiGen skirts these issues: pattern number and statistical likelihood are defined and estimated, and genetopattern assignments are stable. However, these improvements come at the cost of limited complexity reduction. Large numbers of treatment groups (e.g., > 6) are inadequately handled by this process because the number of patterns increases with increasing number of treatment groups (Fig. 2).
Applied to a published microarray experiment, the StatiGen program successfully flags patterns that had been manually assigned in prior work, and further identifies other gene expression patterns that may be of interest. Thus, over a moderate range of treatment groups, PPM appears to work well, allowing researchers to assign statistical probabilities to patterns of gene expression that fit a priori expectations/hypotheses while still preserving the data's ability to show the researcher interesting, yet unanticipated gene expression patterns.
Important future work with this approach will include adding the option to identify and merge highly similar patterns, convert the software language to R format, and provide options for noise reduction/outlier removal prior to analysis.
Methods
PPM algorithm
The post hoc pattern matching algorithm was created stepwise in Excel (v. 2003, SP2, Microsoft). All statistical calculations were verified in SigmaStat (v. 3.0, SyStat) on representative probe sets.
Statistics
Here, we apply a basic one way ANOVA approach, where each probe set is tested individually, and total sum of squared variance is partitioned into variance attributable to treatment, and the remainder is considered residual.
where y is the observation, i is one of k groups, and j is the number of observations within group. 'Between sum of squared error' degrees of freedom equals the number of treatment groups 1 (k  1) and 'Within sum of squared error' degrees of freedom equals the total number of observations – the number of groups (N  k). The summed errors are divided by their respective degrees of freedom to produce their mean squared errors. A ratio of between/within mean squared error generates the Fstatistic, which, along with the degrees of freedom for the numerator and denominator of the Fstatistic (k  1 and N  k, respectively), is used to generate a pvalue for each probe set. The mean squared error within (MSE_{within}) is used again in post hoc testing.
Post hoc to a significant ANOVA, Fisher's PLSD follows the form:
where the MSE_{within }is from the above ANOVA calculation, the Fstatistic is based on k1 (in this case, equal to 1, because only two groups are being contrasted), n is the geometric mean of the n's in the two groups being compared. The LSD then represents the minimum value of the difference between two means in order for their difference to be considered significant.
Pattern ID
We combine results from all of the pairwise comparisons for each probe set, creating a pattern ID. Pattern IDs are constructed using logic gates that use 'increase', 'no significant change', or 'decrease' results from each pairwise comparison. The first pairwise comparison is assigned 1, 0, or 1; the second is assigned 10, 0, or 10; the third is assigned 100, 0, 100 and so on. In this way, the sum of each combination of pairwise comparisons for a given probe set creates a pattern ID encoding that pattern's statistically defined shape, and allowing researchers to easily group different genes that belong to the same pattern. Further, two patterns of opposite sign and the same absolute value will be mirror reflections of one another, which may have value for assessing opposing actions in single pathways [46].
Z Score (probability of pattern's chance occurrence)
The distance, in standard deviations, of each pattern's prevalence in the real data compared to its prevalence in the Monte Carlo simulation is calculated using the Zscore as follows:
where γ is the number of times a pattern appears in real data, Σγ is the total for all unique patterns in real data, R is the number of times pattern appears in Monte Carlo, and ΣR is the total for all unique patterns in Monte Carlo.
Software
Description
StatiGen is written in Visual Basic using the .NET 1.1 architecture and recapitulates all of the steps described above for the PPM algorithm. StatiGen also standardizes gene expression levels, allowing multiple genes in a single pattern to be averaged together and plotted. Graphic displays of this output are provided, along with lists of identified genes. StatiGen also creates hyperlinks for probe set ID information (these are relevant for Affymetrixbased data only). The program runs on Windows 2000/XP operating systems and its performance is generally improved by increased RAM. The installation file is Additional File 2 and source code is Additional File 3. Importantly, the code 'passes' information across Excel worksheets; therefore, Excel must be installed for the program to work. Uptodate versions of the software will linked through our Departmental [74] and Microarray Core [75] websites.
Additional file 2. StatiGen installation file.
Format: ZIP Size: 3.2MB Download file
Additional file 3. StatiGen source code.
Format: ZIP Size: 8.9MB Download file
User input
Users are required to provide signal intensity, presence absence call, and annotation data as either text files or Excel worksheets (the unique identifier probe set ID should always be in the same order in the leftmost column of all sheets, and the top row should contain title information), although StatiGen provides some protection from misaligned inputs by sorting by the first column and making sure the number of rows match across worksheet. A number of different signal intensity algorithms are available [7678] (e.g., PLIER, MAS4, MAS5, RMA, gcRMA, DCHIP, GLA), and, within each of these, there are multiple settings. Therefore, StatiGen makes no assumptions regarding signal intensity transformation. Users should run their transformations (logging), etc. prior to running StatiGen. Presence absence calls (P, M, A) or p values (derived from Affymetrixbased algorithms), as well as userdefined spot quality flags, can be used to pre statisticallyfilter ('triage') data prior to statistical analysis. Because annotation is based on current knowledge and is therefore a 'moving target' [79], no attempt at annotation is provided with StatiGen. Instead, users provide their own annotation file. Importantly, this annotation file can have as many columns (up to the 255 column limit in Excel worksheets less the leftmost unique ID column and the StatiGen generated hyperlink column) as the user would like.
Original data
Data from one of our laboratory's earlier microarray studies [2] is provided to highlight StatiGen's functionality. Raw (.cel files), MAS4 signal intensity, and presence data are available through the Gene Expression Omnibus (GSE 854), and signal intensity (Additional File 4), presence call (Additional File 5) and annotation (Additional File 6) files are also provided with this manuscript. In that study, male Fischer 344 rats of three ages (3 month Young, 12 month MidAge, and 24 month Aged; n = 9–10/group) were behaviorally characterized on two hippocampusdependent cognitive tasks. Their hippocampi were removed, the CA1 regions dissected, and each animal's isolated RNA was hybridized to its own microarray (RGU34A, Affymetrix; one chip per animal). Microarray analysis included a 1Way ANOVA followed by post hoc Pearson's correlation between signal intensity and pre mortem behavioral scores.
Additional file 4. Signal Intensity input file for StatiGen.
Format: XLS Size: 2MB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 5. Presence call input file for StatiGen.
Format: XLS Size: 3.9MB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 6. Annotation file for StatiGen.
Format: XLS Size: 1MB Download file
This file can be viewed with: Microsoft Excel Viewer
Authors' contributions
RH wrote StatiGen and associated help files, and EMB devised the algorithm and constructed prototype Excel files depicting its use.
Acknowledgements
The authors thank Anjali Chhabra for excellent technical assistance, Dr.s Phil Landfield, KueyChu Chen, Susan Kraner, Nada Porter, Jerzy Jaromcyk and Chris Norris for helpful comments and discussion, Chris Saunders and Dr. Arnie Stromberg for statistical advice, and Dr. Robert Hadley for generously allowing his graduate student to participate in this project. This work was supported by NIA AG10836 and startup funding from the University of Kentucky.
References

Norris CM, Kadish I, Blalock EM, Chen KC, Thibault V, Porter NM, Landfield PW, Kraner SD: Calcineurin triggers reactive/inflammatory processes in astrocytes and is upregulated in aging and Alzheimer's models.
J Neurosci 2005, 25(18):46494658. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Blalock EM, Chen KC, Sharrow K, Herman JP, Porter NM, Foster TC, Landfield PW: Gene microarrays in hippocampal aging: statistical profiling identifies novel processes correlated with cognitive impairment.
J Neurosci 2003, 23(9):38073819. PubMed Abstract  Publisher Full Text

Perelman S, Mazzella MA, Muschietti J, Zhu T, Casal JJ: Finding unexpected patterns in microarray data.
Plant physiology 2003, 133(4):17171725. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Valafar F: Pattern recognition techniques in microarray data analysis: a survey.
Annals of the New York Academy of Sciences 2002, 980:4164. PubMed Abstract  Publisher Full Text

Cavallaro S, D'Agata V, Manickam P, Dufour F, Alkon DL: Memoryspecific temporal profiles of gene expression in the hippocampus.
Proc Natl Acad Sci USA 2002, 99(25):1627916284. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain.
Genome Biol 2001, 2(10):RESEARCH0042. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Reid R, Dix DJ, Miller D, Krawetz SA: Recovering filterbased microarray data for pathways analysis using a multipoint alignment strategy.
Biotechniques 2001, 30(4):762766.
768
PubMed Abstract 
Genter MB, Van Veldhoven PP, Jegga AG, Sakthivel B, Kong S, Stanley K, Witte DP, Ebert CL, Aronow BJ: Microarraybased discovery of highly expressed olfactory mucosal genes: potential roles in the various functions of the olfactory system.
Physiol Genomics 2003, 16(1):6781. PubMed Abstract  Publisher Full Text

Hutton JJ, Jegga AG, Kong S, Gupta A, Ebert C, Williams S, Katz JD, Aronow BJ: Microarray and comparative genomicsbased identification of genes and gene regulatory regions of the mouse immune system.
BMC Genomics 2004, 5(1):82. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li H, Wood CL, Liu Y, Getchell TV, Getchell ML, Stromberg AJ: Identification of gene expression patterns using planned linear contrasts.
BMC Bioinformatics 2006, 7:245. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liu H, Tarima S, Borders AS, Getchell TV, Getchell ML, Stromberg AJ: Quadratic regression analysis for gene discovery and pattern recognition for noncyclic short timecourse microarray experiments.
BMC Bioinformatics 2005, 6:106. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Balasubramaniyan R, Hullermeier E, Weskamp N, Kamper J: Clustering of gene expression data using a local shapebased similarity measure.
Bioinformatics 2005, 21(7):10691077. PubMed Abstract  Publisher Full Text

Conesa A, Nueda MJ, Ferrer A, Talon M: maSigPro: a method to identify significantly differential expression profiles in timecourse microarray experiments.
Bioinformatics 2006, 22(9):10961102. PubMed Abstract  Publisher Full Text

Eckel JE, Gennings C, Chinchilli VM, Burgoon LD, Zacharewski TR: Empirical bayes gene screening tool for timecourse or doseresponse microarray data.
J Biopharm Stat 2004, 14(3):647670. PubMed Abstract  Publisher Full Text

Guo X, Qi H, Verfaillie CM, Pan W: Statistical significance analysis of longitudinal gene expression data.
Bioinformatics 2003, 19(13):16281635. PubMed Abstract  Publisher Full Text

Liang Y, Kelemen AG: Hierarchical Bayesian neural network for gene expression temporal patterns.
Stat Appl Genet Mol Biol 2004., 3(1)
Article20
PubMed Abstract 
Liang Y, Tayo B, Cai X, Kelemen A: Differential and trajectory methods for time course gene expression data.
Bioinformatics 2005, 21(13):30093016. PubMed Abstract  Publisher Full Text

Luan Y, Li H: Clustering of timecourse gene expression data using a mixedeffects model with Bsplines.
Bioinformatics 2003, 19(4):474482. PubMed Abstract  Publisher Full Text

Peddada SD, Lobenhofer EK, Li L, Afshari CA, Weinberg CR, Umbach DM: Gene selection and clustering for timecourse and doseresponse microarray experiments using orderrestricted inference.
Bioinformatics 2003, 19(7):834841. PubMed Abstract  Publisher Full Text

Schliep A, Schonhuth A, Steinhoff C: Using hidden Markov models to analyze gene expression time course data.
Bioinformatics 2003, 19(Suppl 1):i255263. PubMed Abstract  Publisher Full Text

Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW: Significance analysis of time course microarray experiments.
Proc Natl Acad Sci USA 2005, 102(36):1283712842. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wentzell PD, Karakach TK, Roy S, Martinez J, Allen CP, WernerWashburne M: Multivariate curve resolution of time course microarray data.
BMC Bioinformatics 2006, 7(1):343. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Whiteside MA, Chen DT, Desmond RA, Abdulkadir SA, Johanning GL: A novel timecourse cDNA microarray analysis method identifies genes associated with the development of cisplatin resistance.
Oncogene 2004, 23(3):744752. PubMed Abstract  Publisher Full Text

Wu FX, Zhang WJ, Kusalik AJ: Dynamic modelbased clustering for timecourse gene expression data.
J Bioinform Comput Biol 2005, 3(4):821836. PubMed Abstract  Publisher Full Text

Xu XL, Olson JM, Zhao LP: A regressionbased method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model.
Hum Mol Genet 2002, 11(17):19771985. PubMed Abstract  Publisher Full Text

Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data.
Bioinformatics 2005, 21(1):7179. PubMed Abstract  Publisher Full Text

Kim JH, Ha IS, Hwang CI, Lee YJ, Kim J, Yang SH, Kim YS, Cao YA, Choi S, Park WY: Gene expression profiling of antiGBM glomerulonephritis model: the role of NFkappaB in immune complex kidney disease.
Kidney Int 2004, 66(5):18261837. PubMed Abstract  Publisher Full Text

Byrne KA, Wang YH, Lehnert SA, Harper GS, McWilliam SM, Bruce HL, Reverter A: Gene expression profiling of muscle tissue in Brahman steers during nutritional restriction.
J Anim Sci 2005, 83(1):112. PubMed Abstract  Publisher Full Text

Poulsen CB, Penkowa M, Borup R, Nielsen FC, Caceres M, Quintana A, Molinero A, Carrasco J, Giralt M, Hidalgo J: Brain response to traumatic brain injury in wildtype and interleukin6 knockout mice: a microarray analysis.
J Neurochem 2005, 92(2):417432. PubMed Abstract  Publisher Full Text

Yagil C, Hubner N, Monti J, Schulz H, Sapojnikov M, Luft FC, Ganten D, Yagil Y: Identification of hypertensionrelated genes through an integrated genomictranscriptomic approach.
Circ Res 2005, 96(6):617625. PubMed Abstract  Publisher Full Text

Mertens BJ: Microarrays, pattern recognition and exploratory data analysis.
Statistics in medicine 2003, 22(11):18791899. PubMed Abstract  Publisher Full Text

Boutros PC, Okey AB: Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data.
Briefings in bioinformatics 2005, 6(4):331343. PubMed Abstract  Publisher Full Text

Bolshakova N, Azuaje F: Estimating the number of clusters in DNA microarray data.
Methods Inf Med 2006, 45(2):153157. PubMed Abstract  Publisher Full Text

Kapp AV, Tibshirani R: Are clusters found in one dataset present in another dataset?
Biostatistics 2006. PubMed Abstract  Publisher Full Text

Patel S, LyonsWeiler J: caGEDA: a web application for the integrated analysis of global gene expression patterns in cancer.
Appl Bioinformatics 2004, 3(1):4962. PubMed Abstract

McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R: Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data.
Bioinformatics 2002, 18(11):14621469. PubMed Abstract  Publisher Full Text

Datta S, Datta S: Comparisons and validation of statistical clustering techniques for microarray gene expression data.
Bioinformatics 2003, 19(4):459466. PubMed Abstract  Publisher Full Text

Steinhoff C, Vingron M: Normalization and quantification of differential expression in gene expression microarrays.
Briefings in bioinformatics 2006, 7(2):166177. PubMed Abstract  Publisher Full Text

Yuan M, Kendziorski C: A unified approach for simultaneous gene clustering and differential expression identification.
Biometrics 2006, 62(4):10891098. PubMed Abstract  Publisher Full Text

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98(9):51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Blalock EM, Chen KC, Stromberg AJ, Norris CM, Kadish I, Kraner SD, Porter NM, Landfield PW: Harnessing the power of gene microarrays for the study of brain aging and Alzheimer's disease: Statistical reliability and functional correlation.
Ageing Res Rev 2005, 4(4):481512. PubMed Abstract  Publisher Full Text

Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW: Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses.
Proc Natl Acad Sci USA 2004, 101(7):21732178. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pavlidis P: Using ANOVA for gene selection from microarray studies of the nervous system.
Methods 2003, 31(4):282289. PubMed Abstract  Publisher Full Text

Chen D, Liu Z, Ma X, Hua D: Selecting genes by test statistics.
J Biomed Biotechnol 2005, 2005(2):132138. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data.
J Comput Biol 2000, 7(6):819837. PubMed Abstract  Publisher Full Text

Saxena V, Orgill D, Kohane I: Absolute enrichment: gene set enrichment analysis for homeostatic systems.
Nucleic Acids Res 2006. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cabaleiro DR, Stockl D, Thienpont LM: Error messages when calculating chisquare statistics with microsoft EXCEL.
Clin Chem Lab Med 2004, 42(2):243. PubMed Abstract  Publisher Full Text

Stockl D, Cabaleiro DR, Thienpont LM: Peculiarities and problems with the EXCEL Ftest.
Clin Chem Lab Med 2004, 42(12):1455. PubMed Abstract  Publisher Full Text

Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery.
Genome Biol 2003, 4(5):P3. PubMed Abstract  BioMed Central Full Text

Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: OntoTools, the toolkit of the modern biologist: OntoExpress, OntoCompare, OntoDesign and OntoTranslate.
Nucleic Acids Res 2003, 31(13):37753781. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jiang CH, Tsien JZ, Schultz PG, Hu Y: The effects of aging on gene expression in the hypothalamus and cortex of mice.
Proc Natl Acad Sci USA 2001, 98(4):19301934. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lee CK, Weindruch R, Prolla TA: Geneexpression profile of the ageing brain in mice.
Nat Genet 2000, 25(3):294297. PubMed Abstract  Publisher Full Text

Lu T, Pan Y, Kao SY, Li C, Kohane I, Chan J, Yankner BA: Gene regulation and DNA damage in the ageing human brain.
Nature 2004, 429(6994):883891. PubMed Abstract  Publisher Full Text

Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, et al.: TM4: a free, opensource system for microarray data management and analysis.
Biotechniques 2003, 34(2):374378. PubMed Abstract

Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset.
Genome Biol 2005, 6(2):R16. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Blalock EM: A Beginner's Guide to Microarrays. Boston: Kluwer; 2003.

Mirnics K, Middleton FA, Lewis DA, Levitt P: Analysis of complex brain disorders with gene expression microarrays: schizophrenia as a disease of the synapse.
Trends Neurosci 2001, 24(8):479486. PubMed Abstract  Publisher Full Text

Mirnics K, Pevsner J: Progress in the use of microarray technology to study the neurobiology of disease.
Nat Neurosci 2004, 7(5):434439. PubMed Abstract  Publisher Full Text

Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global geneexpression profile from microarray data.
Genome Biol 2003, 4(1):R7. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Khachaturian ZS: Calcium hypothesis of Alzheimer's disease and brain aging.
Annals of the New York Academy of Sciences 1994, 747:111. PubMed Abstract

Landfield PW, Eldridge JC: The glucocorticoid hypothesis of agerelated hippocampal neurodegeneration: role of dysregulated intraneuronal calcium.
Annals of the New York Academy of Sciences 1994, 746:308321.
discussion 321–306
PubMed Abstract 
Landfield PW, Eldridge JC: Evolving aspects of the glucocorticoid hypothesis of brain aging: hormonal modulation of neuronal calcium homeostasis.
Neurobiology of aging 1994, 15(4):579588. PubMed Abstract  Publisher Full Text

Landfield PW, Eldridge JC: The glucocorticoid hypothesis of brain aging and neurodegeneration: recent modifications.
Acta endocrinologica 1991, 125(Suppl 1):5464. PubMed Abstract

Disterhoft JF, Oh MM: Pharmacological and molecular enhancement of learning in aging and Alzheimer's disease.
Journal of physiology, Paris 2006, 99(2–3):180192. PubMed Abstract  Publisher Full Text

Kelly KM, Nadon NL, Morrison JH, Thibault O, Barnes CA, Blalock EM: The neurobiology of aging.
Epilepsy research 2006, 68(Suppl 1):S520. PubMed Abstract  Publisher Full Text

Toescu EC, Verkhratsky A: Ca2+ and mitochondria as substrates for deficits in synaptic plasticity in normal brain ageing.
Journal of cellular and molecular medicine 2004, 8(2):181190. Publisher Full Text

McEwen BS: Stress and the aging hippocampus.
Frontiers in neuroendocrinology 1999, 20(1):4970. PubMed Abstract  Publisher Full Text

de Magalhaes JP: Openminded scepticism: inferring the causal mechanisms of human ageing from genetic perturbations.
Ageing Res Rev 2005, 4(1):122. PubMed Abstract  Publisher Full Text

de Magalhaes JP, Church GM: Cells discover fire: employing reactive oxygen species in development and consequences for aging.
Experimental gerontology 2006, 41(1):110. PubMed Abstract  Publisher Full Text

Hess KR, Zhang W, Baggerly KA, Stivers DN, Coombes KR: Microarrays: handling the deluge of data and extracting reliable information.
Trends in biotechnology 2001, 19(11):463468. PubMed Abstract  Publisher Full Text

Azuaje F, AlShahrour F, Dopazo J: Ontologydriven approaches to analyzing data in functional genomics.
Methods Mol Biol 2006, 316:6786. PubMed Abstract

Dopazo J: Functional interpretation of microarray experiments.
Omics 2006, 10(3):398410. PubMed Abstract  Publisher Full Text

Dudoit S, Fridlyand J: Bagging to improve the accuracy of a clustering procedure.
Bioinformatics 2003, 19(9):10901099. PubMed Abstract  Publisher Full Text

Department Molecular and Biomedical Pharmacology in the College of Medicine at the University of Kentucky [http://www.mc.uky.edu/pharmacology/] webcite

University of Kentucky Microarray Core Facility [http://www.mc.uky.edu/ukmicroarray/bioinformatics.htm] webcite

Dudoit S, Gentleman RC, Quackenbush J: Open source software for the analysis of microarray data.
Biotechniques 2003, (Suppl):4551. PubMed Abstract

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics.
Genome Biol 2004, 5(10):R80. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhou L, Rocke DM: An expression index for Affymetrix GeneChips based on the generalized logarithm.
Bioinformatics 2005, 21(21):39833989. PubMed Abstract  Publisher Full Text

Ouzounis CA, Karp PD: The past, present and future of genomewide reannotation.
Genome Biol 2002., 3(2)
COMMENT2001
PubMed Abstract  Publisher Full Text  PubMed Central Full Text