DNA microarray data and contextual analysis of correlation graphs

Rougemont, Jacques; Hingamp, Pascal

doi:10.1186/1471-2105-4-15

Research article
Open access
Published: 29 April 2003

DNA microarray data and contextual analysis of correlation graphs

Jacques Rougemont¹ &
Pascal Hingamp¹

BMC Bioinformatics volume 4, Article number: 15 (2003) Cite this article

12k Accesses
26 Citations
Metrics details

Abstract

Background

DNA microarrays are used to produce large sets of expression measurements from which specific biological information is sought. Their analysis requires efficient and reliable algorithms for dimensional reduction, classification and annotation.

Results

We study networks of co-expressed genes obtained from DNA microarray experiments. The mathematical concept of curvature on graphs is used to group genes or samples into clusters to which relevant gene or sample annotations are automatically assigned. Application to publicly available yeast and human lymphoma data demonstrates the reliability of the method in spite of its simplicity, especially with respect to the small number of parameters involved.

Conclusions

We provide a method for automatically determining relevant gene clusters among the many genes monitored with microarrays. The automatic annotations and the graphical interface improve the readability of the data. A C++ implementation, called Trixy, is available from http://tagc.univ-mrs.fr/bioinformatics/trixy.html.

Background

Measurements of gene expression levels by microarray experiments create a high-throughput of data, the interpretation of which increasingly requires novel and efficient dimensionality reduction strategies. Many clustering methods have been proposed (see for example [1–5] and the more comprehensive reviews [6, 7]) and are widely used. These algorithms group genes and/or samples into clusters of similar expression profiles, in order to suggest possible functional relationships between them. The importance of graphical representations and of automatic cluster annotations stands out from many recent publications [1, 8–12] devoted to gene functions prediction, prognosis or diagnosis of cancer subtypes for instance.

Similar problems arise in the analysis of large interaction networks [13–18] where one tries to extract sub-networks satisfying some significance criteria. The problem of finding web pages dedicated to the same topic is an example that will appeal to the experience of every reader (in this case the network's nodes are the URLs, with HTML links).

We propose a new method which combines one of these network analysis techniques with the classical correlation-based clustering for studying DNA microarray data. It provides a novel graphical representation, a cluster forming rationale and cluster annotations through correlation with gene or sample keywords. The algorithm relies on only two user-controlled parameters, therefore sensitivity of the results to a particular choice of parameters can be checked effectively.

The algorithm is based on the notion of curvature introduced in [13] (this is the same as the clustering coefficient of [19]), which we apply to the network of co-expressed genes where nodes are genes (or samples) and links symbolize co-expression. We define clusters as connected regions of the graph with high curvature, which is the local density of triangular relations. The gene or sample clusters are the densest regions of the corresponding correlation graph, which we will show has biological relevance as intuitively expected. We must emphasize that curvature is typically extremely low in random graphs that have small average degree compared to the number of nodes (which is usually the case in biological networks [14, 20]). Clusters of high curvature are thus highly non-random structures.

We have implemented these concepts in the freely available program Trixy. It is a graphical interface for visualising the graph, the clusters and the automatic annotations providing a straightforward tool for exploring microarray data. The C++ source code and sample Perl parsers are freely available from http://tagc.univ-mrs.fr/bioinformatics/trixy.html. We also provide the data files adapted from the original yeast [1] and lymphoma [21] sets as examples. We have compiled and used the program on both Linux and Windows platforms. Compiling on other platforms has not been attempted but is theoretically possible.

On the performance side, clustering and display with Trixy requires CPU time and memory size comparable to hierarchical clustering as performed in [1].

Algorithm

Curvature on Graphs

The discussion below focuses on the problem of clustering genes. The symmetric question of clustering samples can be treated similarly.

A DNA microarray data set consists of expression levels of N genes in M different experimental conditions (M different RNA samples). This is organised in an N × M matrix X_i,j, i = 1,...,N; j = 1,...,M each row of which contains the expression profile of a given gene across all samples. We are interested in patterns of co-expression, namely groups of genes with parallel or anti-parallel profiles. We measure co-expression of genes g_kand g_ℓ by the (Pearson) correlation cor(k, ℓ) between their profiles:

where μ_iand σ_idenote the mean and the standard deviation of row i. This creates a correlation matrix which is an N × N symmetric matrix (because cor(k, ℓ) = cor(ℓ, k)).

We construct a correlation graph as follows. We first make a node n for each gene. We then choose a threshold T_cor ∈ [0,1] and draw a link between genes g_kand g_ℓ if cor(k, ℓ) ≥ T_cor. This can be understood as follows: a graph with N nodes is defined by its adjacency matrix A (the N × N matrix such that A_i,j= 1 if i and j are joined, 0 otherwise [22]). We obtain A from the correlation matrix by binarisation: we replace cor(k, ℓ) by 0 if it lies between -T_cor and T_cor and by 1 otherwise.

We next introduce the concept of curvature (or clustering coefficient) on a graph [13, 19]. Each node n has a curvature which is a function of the number v of neighbours (nodes to which it is linked) and the number t of triangles (pairs of adjacent neighbours, see Figure 1), given by the formula

Remark that v(v - 1)/2 is the maximum number of triangles that can be drawn on v neighbours hence curv(n) lies between 0 and 1 if v > 1 and is undefined otherwise (see Figure 1 for examples of graphs and curvature).

There is a natural notion of distance between nodes in a graph [22]: it is the number of links in the shortest path connecting them (distance is infinite if there is no such path). Let d_n(i, j) be the distance between the i th and j th neighbours of n: either d_n(i, j) = 1 (these two neighbours are linked) or d_n(i, j) = 2 (they are not, the shortest path goes through n). A simple computation shows that

where <d_n> is the average distance between pairs of neighbours of n. Hence one can picture high curvature as high local density (low average distances).

Given a curvature threshold T_curv ∈ [0; 1] we select a subgraph by deleting all nodes with a curvature below T_curv as well as their links. This splits the graph into several connected components. Each such connected component will be interpreted as a coherent cluster of co-expressed genes (see Figure 2). Varying T_curv adds or removes nodes and links and thus modifies the clusters displayed (possibly merging or splitting some of the clusters). The reader can think of the graph as a sea bed, curvature being height. The curvature threshold T_curv is the sea level and the clusters are the emerging islands. Changing the correlation threshold T_cor changes the landscape while changing the curvature threshold T_curv only moves the sea level up and down.

Consider the Internet analogy of a University web site: the index page has many links to all department's web pages. It is unlikely that, for instance, the biology department's page provides a link to the literature department's page. Therefore, the index page will surely have a small curvature (few of its neighbours have links between them). However, the home page of the biology department has external links to biology departments in other universities with which it has common interests. These other pages will certainly also have external links to many of the same pages, again because they share similar interests. Therefore, a cluster of high curvature will emerge, comprising all the biology departments web pages. This reasoning applies to virtually any communication network and we demonstrate below that it can also be usefully applied to correlation graphs of gene expression profiles.

The program Trixy implements the algorithm described above in a user-friendly graphical interface. It is written in C++ using the free Qt graphical library. It uses embedded Perl for parsing data inputs, which has the advantage that loading data saved under a new format only requires rewriting a Perl script which can be picked at run time. We have mostly used Trixy for clustering genes, but sample clustering can also be performed simply by using a modified parser which rotates the matrix. Similarly, an appropriate Perl script could simply fetch gene annotations from web servers such as http://www.geneontology.org/ rather than read them from a local file.

Normalisation and Parameters

A few simple data processing tools are provided in Trixy: log transform, samples centering (by subtracting the mean or the median) and samples reduction (division by the standard deviation). After these operations have been performed, the correlation matrix is computed and the curvature of each node is deduced from it. At this point, the user can view (using the "Eisengram" standard colour representation of the matrix, such as Figure 4) or save the resulting data set as a flat file. The graph is then built and displayed (as in Figure 2). Although the correlation threshold T_cor is set before loading the data, the curvature threshold T_curv can be varied as the graph is displayed. Starting from an initially high value of T_curv and lowering it progressively unveils new nodes and new clusters. It increases the size of existing clusters, sometimes merging several of them (Figure 2). This gives a feeling for the robustness of the clustering and for the closeness of clusters.

Our advice for the choice of T_cor is to set it to a value which retains only links significantly stronger than expected by pure chance (this depends on the particular data set and can be determined by bootstrapping, see e.g. [5]). The parameter T_curv is different. We have observed that the best value is often cluster-dependent. We have a more dynamic view on this parameter: the way clusters change as T_curv moves is informative. A good way of picking the best threshold is by maximising the annotation scores (see below).

Automatic Annotation

Trixy allows the user to provide annotation files for sample and genes. They consist of a list of keywords associated with each of the gene and/or sample names.

On the one hand, a cluster of genes can be associated with an over-represented gene keyword by giving a score to each annotation equal to its frequency in the cluster.

On the other hand, for sample annotations, a correlation score is computed. Suppose a cluster consists of genes g₁,..., g_K. For each sample keyword W, we create a discriminating vector g₀ which takes the value 1 on each sample associated with W and -1 otherwise. The annotation score is the average absolute correlation with keyword .

Both scores yield numbers between 0 and 1, the closer to 1 the more significant the annotation. We discard annotations that were not present for at least 10% of the samples and 2 of the genes in each cluster.

Visualisation

The graph is displayed with a different colour code for links representing positive or negative correlations (in Figure 2, negative correlations are shown in green, see also Figure 7).

Each cluster can be selected and the corresponding data subset viewed (as a colour-coded table such as Figure 4). If annotations were provided, those with the highest scores are listed and the cluster can be saved as a data file, gene list or colour picture.

Results

Yeast Gene Expression Data

We have applied our algorithm to the data set of gene expression of the budding yeast Saccharomyces cerevisiae available from the website http://rana.lbl.gov/EisenData.htm and described in [1]. We have used Gene Ontology gene annotations from the Saccharomyces Genome Database (SGD) [23]. The sample keywords were extracted from the original expression data file and in this case do not yield interpretable annotations (see the lymphoma section below for a more convincing example of the usefulness of sample annotations).

Even with threshold values as high as T_cor = 0.90 and T_curv = 0.70, we get clearly delineated clusters (see Figure 3). Note that only 263 out of a total of 6221 genes have positive curvature at this value of the correlation threshold, yielding 2075 links.

Most of the clusters obtained appear biologically coherent. For example the chromatin assembly cluster contains all the 9 histone genes for T_cor = 0.80 and T_curv = 0.64 (Figure 3, cluster A: it only shows 7 of the genes at this level of correlation). It is disconnected from the rest of the graph and extremely robust with respect to changes in the parameters. The ubiquitin dependent protein catabolism (Figure 4) cluster appears at a much lower curvature but is extremely coherent with 17 out of 17 proteolysis genes. For the sake of comparison, a proteasome cluster of similar size obtained using hierarchical clustering contains 3 genes unrelated to proteolysis.

Much larger clusters are also visible, such as the protein biosynthesis cluster (Figure 3, cluster B and Figure 2). However, it is very sensitive to variations of the thresholds and can include over 900 genes.

B-cell Lymphoma

We have also used the data set of the lymphoma study available from the website http://llmpp.nih.gov/lymphoma/ and published in [21]. Sample and gene annotations were extracted from the names included in the data file. We were in particular interested in the classification of the tumor subtypes called chronic lymphocytic leukaemia (CLL) and diffuse large B-cell lymphoma (DLCL).

Using a threshold T_cor = 0.80 we obtain nicely discriminating clusters, for example Figure 5 which separates the CLL tumors and Figure 6 which sorts the DLCL tumors. The latter cluster eventually merges with a cluster of Interferon-induced genes (data not shown) as the curvature threshold decreases: they are like two hills on the same island.

We also have a good example of a property that is often observed with our graphical representation: negative correlations are much rarer than positive ones and are carried by just a few nodes, which are almost certainly represser genes. Figure 7 shows a small part of the graph where 7 nodes are mostly anti-correlated with the rest of a cluster of 307 genes. However, let us emphasize that the graphs shown here do not represent gene interaction networks per se, they are merely a means of clustering genes co-expressed within the selected samples.

This data set provides a good test bed for sample clustering. In this case, the graph's nodes are the samples and links denote a correlation between samples (columns of the expression level matrix X). As shown in Figure 8, the result is clear cut: all clusters are associated with a single sample subtype, be it B-cells, T-cells, FL (follicular lymphoma) or CLL.

Statistical Validation

As stated earlier, mathematical theory shows that random graphs have a very low number of triangles [14, 20]. We can check this statement numerically by using random permutations of a real data set. We have randomly reordered the yeast data, gene by gene (a random permutation in each line), computed the curvature of each node, and repeated this operation 1000 times. At T_cor = 0.6 we deduced a probability of positive curvature of 5 × 10^-5, the probability of having a degree larger than 1 was 2 × 10^-4 (maximum degree observed: 6). At T_cor = 0.7 the probability of positive curvature was 10^-6. The comparison with the curvature distribution of the real biological data is displayed in Figure 9.

Statistical validation of an annotation by a particular keyword can be performed with similar methods. For example in the case of the clusters shown in Figure 5,6, we have performed 10,000 random permutations of the sample keywords. We have obtained a maximum annotation score of 0.315 for the CLL cluster and the mean score plus two standard deviations was equal to 0.177, whilst the annotation score of the original cluster was 0.510. Similarly for the DLCL cluster, the maximum score after 10, 000 permutations was 0.367, the mean score plus two standard deviations 0.172, whilst the original score was 0.729.

Permutations of gene keywords can be computed explicitely. For example 66 of the 6221 yeast genes have the GO annotation ubiquitin dependent protein catabolism. The probability of having 17 of them in the same cluster of size 17 (see Figure 4) is of the order of 10^-35.

Discussion and Conclusions

We have described an algorithm for visualising and analysing large microarray data sets. It combines traditional correlation distances and new graph-theoretical ideas. We have implemented this algorithm in a convenient graphical interface and evaluated its performance on well established data sets.

Curvature thresholds split the graph into clusters which appear to be biologically meaningful. An automatic annotation procedure associates keywords with clusters, which are consistent with previous publications [1, 21].

Our approach uses a local analysis of the correlation graph as opposed to global properties such as small-world [19] or scale-free properties [24]. Hubs [25] are not seen as relevant in our approach since they usually make little contribution to the information carried by the graph (like index pages in the worldwide web). The closest approach to ours is that of [14, 15] in the sense that triangles are network motifs, but again, we only draw conclusions from their local density not their frequency at the level of the whole graph. This also emphasizes the difference between our method and the more commonly used hierarchical clustering [1]: because we focus on triangular relations (rare motif in graphs) rather than simple links (very common motif), we obtain a drastic dimensional reduction (see Figure 3) whereas hierarchical clustering retains all the data and does not in itself delineate clusters. Furthermore the stronger constraint offered by triangular links as opposed to single link methods ensure more coherent clusters.

Further Developments

Future development should include finer statistical analysis tools to validate the automatic annotations. In particular a bootstrap validation of a discriminant score [9] would be more accurate that the correlation score explained in the methods, which detects consistency with the annotation rather than actual discrimination. Also, more sophisticated methods for determining optimal annotations exist in the literature and could be applied to our clusters (see e.g. [11]).

A method for determining a natural correlation threshold T_cor would be most welcome (such methods have been discussed in [26, 27]). It would leave only one free parameter, the curvature threshold T_curv Again a bootstrap calculation could provide an estimate of a significant deviation from average random correlation. It was also suggested to use a hierarchical construction of the graph: first use the strongest links (largest correlations) to build small clusters, then link clusters with weaker links and continue until all nodes belong to the same cluster. Varying T_curv would subsequently split this unique cluster into significant parts. Memory and speed limitations may hamper these developments.

References

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc NatL Acad Sci USA 1998, 95: 14863–14828. 10.1073/pnas.95.25.14863
Article PubMed Central CAS PubMed Google Scholar
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999, 96: 2907–2912. 10.1073/pnas.96.6.2907
Article PubMed Central CAS PubMed Google Scholar
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics 1999, 22: 281–285. 10.1038/10343
Article CAS PubMed Google Scholar
Gasch AP, Eisen MB: Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biology 2002, 3: Research0059. 10.1186/gb-2002-3-11-research0059
Article PubMed Central PubMed Google Scholar
Herrero J, Valencia A, Dopazo J: A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 2001, 17: 126–136. 10.1093/bioinformatics/17.2.126
Article CAS PubMed Google Scholar
Shamir R, Sharan R: Algorithmic approaches to clustering gene expression data. In Current Topics In Computational Molecular Biology (Edited by: Jiang T, Xu Y, Smith T). 2002, 269–300.
Google Scholar
Tamames J, Clark D, Herrero J, Dopazo J, Blaschke C, Fernandez JM, Oliveros JC, Valencia A: Bioinformatics methods for the analysis of expression arrays: data clustering and information extraction. J Biotechnol 2002, 98: 269–283. 10.1016/S0168-1656(02)00137-2
Article CAS PubMed Google Scholar
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D: Support vector machine classification and validation of cancer tissue samples using microarray data. Bioinformatics 2000, 16: 906–914. 10.1093/bioinformatics/16.10.906
Article CAS PubMed Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531
Article CAS PubMed Google Scholar
Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics 2002, 31: 255–265. 10.1038/ng906
Article CAS PubMed Google Scholar
Pe'er D, Regev A, Tanay A: Minreg: inferring an active regulator set. Bioinformatics 2002, 18: s258-s267.
Article PubMed Google Scholar
Zhou X, Kao MC, Wong WH: Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci USA 2002, 99: 12783–12788. 10.1073/pnas.192159399
Article PubMed Central CAS PubMed Google Scholar
Eckmann JP, Moses E: Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc Natl Acad Sci USA 2002, 99: 5825–5829. 10.1073/pnas.032093399
Article PubMed Central CAS PubMed Google Scholar
Shen-Orr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli . Nature Genetics 2002, 31: 64–68. 10.1038/ng881
Article CAS PubMed Google Scholar
Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs: Simple building blocks of complex networks. Science 2002, 298: 824–827. 10.1126/science.298.5594.824
Article CAS PubMed Google Scholar
Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Gene 2001, 38: A21–28. 10.1038/88213
Google Scholar
Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA 2000, 97: 12182–12186. 10.1073/pnas.220392197
Article PubMed Central CAS PubMed Google Scholar
Yanai I, DeLisi C: The society of genes: networks of functional links between genes from comparative genomics. Genome Biology 2002, 3: Research0064.
PubMed Central PubMed Google Scholar
Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 440–442. 10.1038/30918
Article CAS PubMed Google Scholar
Collet P, Eckmann JP: The number of large graphs with a positive density of triangles. J Stat Phys 2002, 109: 923–943. 10.1023/A:1020489507547
Article Google Scholar
Alizadeh AA, Eisen MB, Davis RE, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403: 503–522. 10.1038/35000501
Article CAS PubMed Google Scholar
Van Lint JH, Wilson RM: A course in combinatorics. Cambridge, Cambridge University Press 2001.
Google Scholar
Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, et al.: Saccharomyces genome database (SGD) provides secondary gene annotation using the gene ontology (GO). Nucleic Acids Res 2002, 30: 69–72. 10.1093/nar/30.1.69
Article PubMed Central CAS PubMed Google Scholar
Jeong H, Tombor B, Albert R, Oltval ZN, Barabasi AL: The large-scale organization of metabolic networks. Nature 2000, 407: 651–654. 10.1038/35036627
Article CAS PubMed Google Scholar
Featherstone DE, Broadie K: Wrestling with pleiotropy: Genomic and topological analysis of the yeast gene expression network. Bio Essays 2002, 24: 267–274. 10.1002/bies.10054
CAS Google Scholar
Domany E: Cluster analysis of gene expression data. J Stat Phys 2003, 110: 1117–1139. 10.1023/A:1022148927580
Article CAS Google Scholar
Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf lntell Syst Mol Biol 2000, 8: 93–103.
CAS Google Scholar

Download references

Acknowledgments

We thank D. Gautheret for critically reading a first version of the manuscript. J.R. is grateful to J.-P. Eckmann and E. Moses for useful comments on earlier versions of the program and for detailed explanations of their results. Part of this research was supported by the Temblor project EU grant QLRT-2001-00015.

Author information

Authors and Affiliations

TAGC, INSERM-ERM 206, Parc Scientifique de Luminy Case 906, 13288, Marseille, Cedex 09, France
Jacques Rougemont & Pascal Hingamp

Authors

Jacques Rougemont
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Hingamp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacques Rougemont.

Additional information

Authors' Contributions

JR conceived of the study and carried out all programming as a postdoc in the group of PH. PH supervised the study and provided valuable input for the analyses and biological interpretations. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rougemont, J., Hingamp, P. DNA microarray data and contextual analysis of correlation graphs. BMC Bioinformatics 4, 15 (2003). https://doi.org/10.1186/1471-2105-4-15

Download citation

Received: 23 December 2002
Accepted: 29 April 2003
Published: 29 April 2003
DOI: https://doi.org/10.1186/1471-2105-4-15

DNA microarray data and contextual analysis of correlation graphs