Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Highly Accessed Methodology article

Graph-based iterative Group Analysis enhances microarray interpretation

Rainer Breitling12*, Anna Amtmann1 and Pawel Herzyk23

Author Affiliations

1 Plant Science Group, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow G12 8QQ, United Kingdom

2 Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom

3 Sir Henry Wellcome Functional Genomics Facility, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow G12 8QQ, United Kingdom

For all author emails, please log on.

BMC Bioinformatics 2004, 5:100  doi:10.1186/1471-2105-5-100

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/5/100


Received:27 April 2004
Accepted:23 July 2004
Published:23 July 2004

© 2004 Breitling et al; licensee BioMed Central Ltd.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

One of the most time-consuming tasks after performing a gene expression experiment is the biological interpretation of the results by identifying physiologically important associations between the differentially expressed genes. A large part of the relevant functional evidence can be represented in the form of graphs, e.g. metabolic and signaling pathways, protein interaction maps, shared GeneOntology annotations, or literature co-citation relations. Such graphs are easily constructed from available genome annotation data. The problem of biological interpretation can then be described as identifying the subgraphs showing the most significant patterns of gene expression. We applied a graph-based extension of our iterative Group Analysis (iGA) approach to obtain a statistically rigorous identification of the subgraphs of interest in any evidence graph.

Results

We validated the Graph-based iterative Group Analysis (GiGA) by applying it to the classic yeast diauxic shift experiment of DeRisi et al., using GeneOntology and metabolic network information. GiGA reliably identified and summarized all the biological processes discussed in the original publication. Visualization of the detected subgraphs allowed the convenient exploration of the results. The method also identified several processes that were not presented in the original paper but are of obvious relevance to the yeast starvation response.

Conclusions

GiGA provides a fast and flexible delimitation of the most interesting areas in a microarray experiment, and leads to a considerable speed-up and improvement of the interpretation process.

Background

Microarray experiments can provide a comprehensive picture of gene expression levels in biological samples. In a typical application they compare expression of several thousand genes under two different conditions (e.g. healthy vs. diseased tissue, wild type vs. mutant animals, drug-treated vs. control cells), using a small number of replicate experiments. Various techniques have been developed to rank genes according to their expression changes, e.g. based on the t-statistic [1] or the strong non-parametric RankProducts [2]. The resulting list of genes can then be restricted to those genes that fulfill a certain statistical criterion, usually an arbitrarily chosen maximum accepted false discovery rate.

The main challenge to the biologist is contained in the next step of the analysis. It consists in identifying the biologically relevant expression changes, the "big picture" of the experiment. As microarray experiments tend to generate unexpected observations in areas outside the specialized expertise of the experimentalist, this can be quite difficult and time-consuming. A principled mechanism to identify the significant higher-level features of the experimental results would therefore be very useful.

The biological interpretation process consists to a large extent of finding evidence connecting certain genes that are differentially expressed. This evidence can consist, e.g., of joint participation in some physiological process, physical interaction at the protein level, reported co-expression in earlier microarray experiments, a shared functional annotation, etc. This kind of evidence can intuitively be represented as a graph, and this feature is regularly used to visualize biological data, in the form of metabolic or signaling pathways or protein interaction maps. The task can then be described as the identification of subgraphs that as a whole show a statistically significant expression change. This would allow the biologist to focus her analysis on the most promising areas, without prior bias, while at the same time presenting the relevant evidence underlying each association for critical evaluation.

Results and Discussion

The Algorithm

We have recently developed an approach, iterative Group Analysis (iGA) that identifies significantly changed functional classes of genes in a microarray experiment [3]. In contrast to similar approaches such as [4-8], the iGA method does not require a previous delimitation of a set of "differentially expressed genes", but uses an iterative calculation of p-values to determine the subset of class members that is most likely to be changed. Due to this feature, the iGA method is more sensitive in identifying functional classes that are slightly but consistently regulated, and works well on noisy data with small numbers of replicates, where the delimitation of gene lists can be overly restrictive.

Here we extend this approach to the analysis of "evidence graphs", which offers much larger flexibility of the annotations that can be used and allows substantially improved visualization. Evidence graphs can be represented as bigraphs with two types of nodes, one for genes and one for the associated "evidence" (Fig. 1A). For evaluation purposes we focused on two types of networks, one where the evidence consists of GeneOntology annotations (GO network) and one where the evidence comprises enzyme substrates (metabolic network). The construction of these networks from gene annotation files is fast and simple, but more unconventional networks are also straightforward (e.g. regulatory networks inferred from previous gene expression experiments, or "literature networks" based on co-occurrence of genes in publications). Before the calculation the bigraph is converted to a simple graph, eliminating the evidence nodes and introducing pairwise edges between all nodes that were connected to a common evidence node (Fig. 1B).

thumbnailFigure 1. Principle of Graph-based Iterative Group Analysis. A Evidence network. Genes are associated with their annotation in the form of a bigraph (two types of nodes). B The same evidence represented as a simple network. Genes that share an annotation are connected. C-H Example of a GiGA analysis using fictitious microarray results. C Genes are assigned their ranks based on observed expression changes. D Local minima are found, i.e. genes that have no connection to genes with a better rank. E-H Iterative expansion of subgraphs from one of the local minima, gene 2 (rank 1). E The neighboring node with the smallest rank is included (gene 4, rank 4), which leads to the additional inclusion of genes 5 (rank 3) and 6 (rank 2). F Gene 3 (rank 5) is included). G Gene 7 (rank 7) is included, leading to the inclusion of gene 8 (rank 6). H The last gene reachable from this local minimum, gene 1 (rank 8), is included and the process terminates. For each of the subgraphs a p-value can be calculated (see text) and the subgraph with the smallest p-value is declared the "regulated neighborhood" of the local minimum. In the example, genes 2, 4, 5, and 6 form a regulated neighborhood (p = 0.014). The graph expansion process would then be repeated for the remaining two local minima.

In addition to the graph, a complete list of genes sorted by differential expression is provided. All nodes without corresponding expression data are eliminated from the network. In the first step of the analysis, each gene node is assigned its rank in the list of genes, such that the node for the most strongly changed gene is labeled "1" (Fig. 1C), the second most changed gene is labeled "2", etc. Then, local minima are identified in the graph, i.e. those nodes that have a lower rank than all their direct neighbors in the graph (Fig. 1D). In the next step, subgraphs are iteratively extended from each of those local minima by including the neighboring node with the next highest rank (m) and, if present, all adjacent nodes of ranks equal or smaller than m. Hence, at each step of the extension process the newly extended subgraph is not adjacent to any outside node with a rank lower than m (Fig. 1E,1F,1G,1H). At each extension step we thus obtain a subgraph with n members with a maximum rank m and can calculate the p-value for observing all n of n genes at rank m or better in a list of N total genes in the graph, , which follows easily from inserting these values in the cumulative hypergeometric equation, which is also used for iGA [3].

The extension process is continued until all nodes reachable from the local minimum are included or the subgraph reaches an arbitrary maximum size. After extending the subgraphs, for each local minimum the subgraph yielding the smallest p-value is selected as its "regulated neighborhood" and all local minima are sorted by increasing p-value of these regulated neighborhoods. The subgraphs at the top positions of the resulting list should contain the most relevant regions of the total evidence network.

Comparison with Previous Approaches

The method devised by Ideker et al. [9] for the determination of signaling and regulation circuits from a combination of protein interaction information and expression data could easily be extended to cover the more general case of microarray interpretation addressed by GiGA. This would only require the extension of "interaction" to include, e.g., participation in a shared cellular process or shared functional annotation. Their approach uses aggregate z-scores to evaluate the quality of each subgraph. This requires a relatively complex parameter estimation procedure followed by simulated annealing. In contrast, the rank-transform of the data that is the basis of GiGA allows non-parametric p-value calculations and is thus much faster (and computationally less demanding) than the method described by Ideker et al. A disadvantage of both methods is that they both do not guarantee to find the optimally scoring subgraph.

The recently released commercial Pathway Analysis software from Ingenuity Systems http://www.ingenuity.com webcite seems to be based on a similar concept as GiGA, i.e. the determination of regulated subgraphs in annotation networks. However, it just classifies genes as changed ("focus genes") or unchanged based on an arbitrary selected significance cut-off and thus discards most of the relative-change information (gene ranks) used by GiGA http://www.ingenuity.com/products/Circadian_Application_Note0104.pdf webcite. Therefore, this method is difficult to apply to very noisy or unreplicated experiments where a reliable delimitation of the "changed" genes becomes impossible.

Experimental Case Study

To validate the GiGA approach, we used the yeast diauxic shift experiment by DeRisi et al. [10]. In this classical study the authors examined the response of yeast cells to glucose depletion in the growth medium. As the biology of this process is extremely well understood and the functional annotation of the yeast genome is very comprehensive, we could use this dataset to test the ability (and reliability) of GiGA to identify the relevant subgraphs of interest.

In their original publication, DeRisi et al. highlight the following changes during starvation: Rechanneling of metabolites into the tricarboxylic acid (TCA) and glyoxylate cycle, increase in aerobic respiration (cytochrome c oxidase and reductase), gluconeogenesis, and carbohydrate storage (glycogen and trehalose biosynthesis). In contrast, 95% of ribosomal proteins, as well as tRNA synthetases and translation elongation/initiation factors were strongly down-regulated. Twenty hours after the initial inoculation of the sample about 20% of all genes showed at least a two-fold change in expression.

Table 1 and 2 show the result of an iterative Group Analysis of these expression data. Seven time points after inoculation were examined. Significant expression changes become apparent at 13.5 hours. It can be seen that all the processes identified manually by DeRisi et al. are already apparent at this level of analysis. In addition, iGA highlights the up-regulation of sugar transporters at the early stages of starvation, obviously a desperate attempt by the cells to take up the last remaining sugars from the medium. It also highlights the induction of heat shock proteins and the repression of ribosome biosynthesis processes in the nucleolus, which were not discussed in the original paper. What is missing in these lists, however, is the automatic identification of the interrelationships between the identified processes, which would be particularly informative in more realistic applications with a less well understood biology.

Table 1. Iterative Group Analysis of gene expression during the yeast diauxic shift. Up-regulated groups.

Table 2. Iterative Group Analysis of gene expression during the yeast diauxic shift.

This connectivity between genes and functional classes is provided by GiGA. Table 3 summarizes the results for the 20.5 hour time point, using two different networks, one for GeneOntology classes, and one for enzyme substrates, extracted from the SwissProt catalytic activity descriptors of yeast proteins. For both up- and down-regulated genes, the most significant subgraph is widely separated from the next best one, contains the largest number of genes, and comprises almost all the processes detected by iGA and the original authors. (We here restricted the size of subgraphs to a maximum of 40 genes to keep navigation of the results simple.) Figures 2 and 3 show the automatically generated visualization of the corresponding most significant subgraphs and the associated annotation. It is obvious how GiGA highlights the functional connections between different enzymes. In Fig. 2, the association between small and large ribosomal subunits, nucleolar rRNA processing and translational elongation is faithfully reconstructed. In Fig. 3, which uses the enzyme substrate network (which we considered to be more challenging for the algorithm), the interplay between the TCA cycle, the overlapping glyoxylate cycle, and all the relevant protein complexes of the respiratory chain are readily apparent. The agreement with the manual interpretation by DeRisi et al. extends down to the single gene level, while at the same time adding additional, obviously relevant connections, e.g. from the TCA cycle to the ATP synthase complex. Fig. 3 also shows the second best subgraph, which contains the cytochrome c oxidase subunits together with two catalase genes that may be involved in the detoxification of the hydrogen peroxide generated by the respiratory burst induced by starvation.

thumbnailFigure 2. Visualization of the most significant "down-regulated neighborhood" identified by GiGA using a GeneOntology-based network. The expression data are taken from the 20.5 h timepoint of the yeast diauxic shift (DeRisi et al., 1997). The layout was generated from the output of GiGA by the free software aiSee http://www.aisee.com webcite using a force-directed algorithm with default parameters. The same software can also be used for the versatile real-time navigation of the network. Colored boxes show the regulated genes (darker shading indicates stronger regulation), white boxes show the evidence linking the genes (in this case GeneOntology numbers and terms). Several important components of this regulatory neighborhood are indicated (small and large ribosomal subunit proteins, rRNA processing/snRNP, nucleolar proteins, translation elongation factors). These components were also identified in the original publication after manual analysis. GiGA finds them automatically, and also detects the – biologically obvious – connections between them. As all the evidence is included in the same picture, the biologist can then use her expertise to assess the relevance of each link, without having to make the connections ad hoc by tedious literature studies.

thumbnailFigure 3. Visualization of the two most significant "up-regulated neighborhoods" identified by GiGA using a metabolic network derived from Swissprot annotations. The expression data are taken from the 20.5 h timepoint of the yeast diauxic shift (DeRisi et al., 1997). The layout was generated as in Fig 2. Colored boxes show the regulated genes (darker shading indicates stronger regulation), white boxes show the substrates that are in common between genes. Important components of this regulatory neighborhood are indicated (TCA cycle and glyoxylate cycle enzymes, and the various respiratory chain complexes). Here it can be seen that GiGA not only detects protein complexes (such as ribosomes or the respiratory chain complexes), but also "linear" metabolic pathways such as TCA cycle and glyoxylate cycle (and potentially signal transduction pathways or regulatory cascades etc.). Almost all the enzymes discussed by DeRisi et al. (1997) are included in these two subgraphs, plus the relevant enzymatic information necessary to assess the relevance of each link, without the danger of missing some genes (unless the annotation is incomplete).

Table 3. Graph-based iterative Group Analysis of gene expression during the yeast diauxic shift.

The performance of GiGA (as well as iGA) is best appreciated when compared to the results of an extensive expert interpretation of the same data. Table 1 to 3, and Fig. 2 and 3 show how both techniques succeed in detecting and condensing exactly the genes and processes that were considered relevant by the expert biologists when first interpreting the same data [10]. GiGA effectively summarizes the original publication in three subgraph pictures (Fig. 2 and 3). This is even more astonishing when considering that these results are achieved for each single time-point separately (see, e.g., Tab. 1 and 2). This reveals the stability of the approach towards the measurement variance inherent in any unreplicated microarray experiment.

It is important to be aware that each of the highlighted subgraphs has to be carefully evaluated for its biological relevance. On the one hand, the sheer number of possible subgraphs in an evidence network creates a major multiple-testing problem, which means that some of the detected associations may be due to chance. Random permutations of the expression data – which can be generated by the GiGA software – can give an idea of the expected false-discovery rate. On the other hand, functional annotations are at present notoriously unreliable and spurious edges may affect the details of the results. Also, not all genes within a detected subgraph will necessarily show a strong expression change, because sometimes less strongly affected genes may connect those genes that do change. Such a relation is for example expected for many transcription factors and their targets [9]. Nonetheless, GiGA is able to direct the user to the most interesting areas in the evidence network.

The GiGA method is not restricted to use with exhaustively annotated genomes. It can work on a wide variety of "evidence" to build the necessary network, including hypothetical predicted functions or associations. It is even possible to apply GiGA to metabolomics results, which are characterized by the absence of any significant amount of annotation – usually only exact molecular masses and their differential abundance are known. In that case, an evidence network can be built from the measured masses themselves, linking compounds m1 and m2 whenever their mass difference (Δm = |m1 - m2|) can be explained by a common biochemical transformation (e.g. dehydrogenation: Δm = 2* mass(hydrogen)). The set of relevant transformations can easily be collected from any biochemistry textbook. In addition, one can introduce edges for condensation reactions between observed masses, i.e. if m1 + m2 = m3 + mass(H2O) then edges between m1 and m3, and m2 and m3 are added to the evidence network. We are currently developing the application of GiGA to this kind of data in the Sir Henry Wellcome Functional Genomics Facility at the University of Glasgow http://www.gla.ac.uk/functionalgenomics/ webcite; data not shown).

Conclusions

The present analysis of a biologically well-understood test case demonstrates the reliable performance of GiGA. The method automatically identifies all relevant physiological processes, puts them into context, summarizes them in an intuitive format, and associates them with the underlying evidence (Fig. 2 and 3). It can be applied to experiments with very small numbers of replicates (a single time point in the diauxic shift test case) and can be used with any available functional annotation, including protein interaction networks, co-expression data or literature mining results, as well as in areas beyond microarray analysis. For visualization, we have used the graph-layout software aiSee, but output files suitable for a variety of graphical tools can easily be generated by slight modifications in the implementation. GiGA can be used as a stand-alone tool, but we expect that it will be most useful when integrated into existing microarray analysis software, and for that reason the GiGA algorithm is freely available without restrictions.

Methods

Yeast gene expression data for the diauxic shift experiment were obtained from the Stanford Microarray Database http://genome-www5.stanford.edu/ webcite. GeneOntology annotations http://www.geneontology.org webcite were obtained from Affymetrix http://www.affymetrix.com/analysis/download_center.affx webcite. The enzyme substrate networks were built based on information contained in the annotation of the yeast proteome in the SwissProt database http://us.expasy.org/sprot/ webcite. The GiGA algorithm has been implemented as a Perl script and compiled as a Windows command line executable. These files are available (together with a manual and example files) as 1 to 8.

Additional File 1. GiGA program. For use from the Windows command line.

Format: EXE Size: 671KB Download fileOpen Data

Additional File 2. GiGA source code.

Format: PL Size: 15KB Download fileOpen Data

Additional File 3. GiGA manual. Describes the use of GiGA applied to the example data (Additional files 4 to 6).

Format: PDF Size: 104KB Download file

This file can be viewed with: Adobe Acrobat ReaderOpen Data

Additional File 4. Gene expression data. Sorted list of genes, based on expression during the yeast diauxic shift.

Format: TXT Size: 339KB Download fileOpen Data

Additional File 5. Evidence network. List of gene pairs connecting genes whenever their gene products are enzymes that share a common substrate. Based on annotation derived from SwissProt.

Format: TXT Size: 957KB Download fileOpen Data

Additional File 6. Genenames file. Contains descriptive names of the yeast genes contained in Additional file 4.

Format: TXT Size: 294KB Download fileOpen Data

Additional File 7. Example output in text format. List of significantly affected subgraphs detected in the experimental data (Additional file 4) using GiGA with default settings.

Format: TXT Size: 5KB Download fileOpen Data

Additional File 8. Example output in graph-description language format. Contains the same results as Additional file 7, but in a format that can be visualized and explored using graph-layout software, e.g. aiSee, which is freely available for academics at http://www.aisee.com webcite.

Format: GDL Size: 24KB Download fileOpen Data

Authors' contributions

RB devised and implemented the GiGA technique and drafted the manuscript. AA and PH supervised the project. All authors read and approved the final manuscript.

Acknowledgements

We thank Dr. Mike Barrett for stimulating discussions on the application of GiGA to metabolomics data. We also thank the referees for their very constructive evaluation of the software. This work was supported by BBSRC grants 17/GG17989 and 17/P17237.

References

  1. Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.

    Statistica Sinica 2002, 12:111-139. OpenURL

  2. Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.

    FEBS Letters, in press. OpenURL

  3. Breitling R, Amtmann A, Herzyk P: Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments.

    BMC Bioinformatics 2004, 5:34. PubMed Abstract | BioMed Central Full Text OpenURL

  4. Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data.

    Genome Biol 2003, 4:R7. PubMed Abstract | BioMed Central Full Text OpenURL

  5. Hosack DA, Dennis G., Jr., Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE.

    Genome Biol 2003, 4:R70. PubMed Abstract | BioMed Central Full Text OpenURL

  6. Kim CC, Falkow S: Significance analysis of lexical bias in microarray data.

    BMC Bioinformatics 2003, 4:12. PubMed Abstract | BioMed Central Full Text OpenURL

  7. Provart NJ, Zhu T: A Browser-based Functional Classification SuperViewer for Arabidopsis Genomics.

    Currents in Computational Molecular Biology 2003, 2003:271-272. OpenURL

  8. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data.

    Genome Biol 2003, 4:R28. PubMed Abstract | BioMed Central Full Text OpenURL

  9. Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks.

    Bioinformatics 2002, 18:S233-S240. PubMed Abstract | Publisher Full Text OpenURL

  10. DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale.

    Science 1997, 278:680-686. PubMed Abstract | Publisher Full Text OpenURL