Email updates

Keep up to date with the latest news and content from BMC Proceedings and BioMed Central.

This article is part of the supplement: Selected Proceedings of Machine Learning in Systems Biology: MLSB 2007

Open Access Proceedings

Identification of functional modules based on transcriptional regulation structure

Etienne Birmelé1*, Mohamed Elati2*, Céline Rouveirol2 and Christophe Ambroise1*

Author Affiliations

1 Laboratoire Statistique et Génome, UMR CNRS 8071, INRA 1152, Tour Evry 2, F-91000 Evry, France

2 LIPN – UMR 7030 CNRS – Université Paris 13, 99 Av. J.B. Clément, F-93430 Villetaneuse, France

For all author emails, please log on.

BMC Proceedings 2008, 2(Suppl 4):S4  doi:

The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1753-6561/2/S4/S4


Published:17 December 2008

© 2008 Birmelé et al; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Identifying gene functional modules is an important step towards elucidating gene functions at a global scale. Clustering algorithms mostly rely on co-expression of genes, that is group together genes having similar expression profiles.

Results

We propose to cluster genes by co-regulation rather than by co-expression. We therefore present an inference algorithm for detecting co-regulated groups from gene expression data and introduce a method to cluster genes given that inferred regulatory structure. Finally, we propose to validate the clustering through a score based on the GO enrichment of the obtained groups of genes.

Conclusion

We evaluate the methods on the stress response of S. Cerevisiae data and obtain better scores than clustering obtained directly from gene expression.

Background

An important step in analyzing gene functions is to cluster genes according to their expression patterns. Such clusters can then be analyzed in several ways, for example by assigning unannotated genes to the majority function of each cluster's genes (see [1] for a review).

However, this approach has several limitations. On the one hand, genes of similar expression patterns may not necessarily have the same or similar functions; on the other hand, genes with related functions may not show close correlation in their expression patterns. For example, a transcription factor can activate some genes and repress others in the same pathway.

The principal assumption of this paper is that unsupervised clustering of genes on the basis of similar regulators (activators/inhibitors) should assemble functional co-regulated groups of genes. To compute a similarity measure between genes as a function of inferred regulators of these genes, we use the output of a data mining algorithm called LICORN [2], that infers cooperative regulation relations from expression data only. The resulting similarity matrix between genes is considered as the adjacency matrix of a weighted graph. Clustering is then performed to find functional modules of genes in the network.

To objectively evaluate clustering, we use Gene Ontology to determine if the obtained clusters can be associated with terms of the Biological Process ontology. The strength of such an association is given by a p-value from an Hypergeometric test. We compare different clusterings by calculating a score based on the p-values which becomes greater when the significant p-values are smaller and more numerous.

In section Methods, we introduce our model of gene regulation and briefly describe a data mining algorithm for inferring large-scale cooperative gene regulation. We then propose a similarity measure for the genes based on the inferred regulator sets and define the score of a clustering. Finally, in section Results and discussion, we evaluate our system on a yeast data set.

Methods

Cooperative regulation networks

Let us denote by <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M1">View MathML</a> the set of genes with a known or putative regulation activity and <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M2">View MathML</a> as the set of genes without such an activity. The input of the mining method is a discretised expression matrix for genes of <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M3">View MathML</a>. Each expression value can take the value -1 (under-expressed), 0 (normal), or 1 (over-expressed). A gene regulatory network (GRN) associated with a target gene g is a pair (A, I), where <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M4">View MathML</a> is a co-activator set, and <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M5">View MathML</a> is a co-inhibitor set. The set of GRNs for all target genes can also be seen as a bipartite graph where the top layer contains regulators, the bottom layer contains target genes, and edges code for a regulatory interaction between regulators and target genes, each edge being labelled with a regulatory mode (i.e., activator or inhibitor). The regulation relations we are interested in are combinatorial: each target gene has a number of activators and/or inhibitors. Activators on one side and inhibitors on the other side are aggregated in our model through an extended logical AND, i.e., a regulator set S (activator or repressor) is over-expressed (resp. under-expressed) if and only if all the regulators in S are over-expressed (resp. under-expressed). Finally, we describe in Figure 1 a discrete function called Regulatory Program RP, which, given the combined states of activators A and inhibitors I of g in a sample s computes <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6">View MathML</a>s(A, I), the estimated state of g in s. The main features of our regulation model are therefore the explicit representation of activation and repression relationships for a given target gene, and the representation of co-operative transcriptional regulation.

thumbnailFigure 1. The regulatory program. Definition of the regulatory program RP, which can be interpreted as follows: i) If GRN contains co-activators only, <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6">View MathML</a>(A, I) corresponds to the aggregated status of these co-activators. ii) If GRN contains co-inhibitors only, <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6">View MathML</a>(A, I) is the inverse of the aggregated status of these co-inhibitors. iii) Otherwise, <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6">View MathML</a>(A, I) depends on a combination of the statuses of co-activators and co-inhibitors, as described by the matrix on the right. For example, <a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M6">View MathML</a>(A, I) = 1 when the co-activators are over-expressed and the co-inhibitors are not.

Learning algorithm

We have recently proposed [2] an original, scalable technique called LICORN for deriving co-operative regulations, in which many co-regulators act together to activate or repress a target gene. LICORN uses an original heuristic approach to accelerate the search for an appropriate structure for the regulation network. It first computes frequent co-regulator sets, i.e., regulator sets that frequently occur together as over (1) or under (-1)-expressed in the discretised expression matrix. This is done by using an extension of the Apriori algorithm [3] to handle both 1 and -1 supports (The x-support of a co-regulator C in the three-valued expression matrix is the set of samples that include all the regulators of C with the state x).

From this representation, a limited subset of candidate co-regulator sets is then associated with each gene. The learning algorithm looks for each gene for regulator sets which have a high "overlap" with the target gene. Intuitively, the overlap constraint checks the size of the intersection between supports of the target gene and a given candidate co-regulator set. A candidate activator set for a target gene g is frequently over-expressed when g is over-expressed or frequently under-expressed when g is under-expressed. On the opposite, a candidate repressor set for a target gene g is frequently over-expressed when g is under-expressed and vice-versa. This search can be efficiently performed because of the property of anti-monotonicity of the overlap constraint with respect to set inclusion. Then, once a limited number of candidate activator and inhibitor sets have been obtained, exhaustive search for the best gene regulatory network can be performed. Finally, a permutation-based procedure is used for selecting statistically significant regulation relations. We have shown in [2] that the co-operative regulation patterns inferred by LICORN cannot be identified by clustering or pairwise methods, and are only partly revealed by constrained Bayesian or decision tree-based techniques, such as those used in previous studies [4,5].

Identification of functional co-regulation modules

Partial overlap of the regulator sets for a set of target genes can be used as an alternative measurement of the distance between genes.

Computation of the co-regulation matrix

We design the co-regulation matrix by using a similarity measure defined as follows: let λ ∈ [0, 1] and (g1, g2) be two genes. The similarity between g1 and g2 is defined by

<a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M7">View MathML</a>

where |A| and |I| are respectively the number of activators and inhibitors of both g1 and g2, |AI| is the number of regulators which activate one gene and inhibit the other and |TF| is the number of transcription factors regulating at least one of the genes. This similarity considers two genes as being far appart (ϕ(g1, g2) = 0) if they do not share any regulators. Two genes are considered most similar if their set of activators and inhibitors are exactly the same (ϕ(g1, g2) = 1). In intermediate situations, λ represents the weight given to common regulators which have opposite effects.

Clustering

To cluster genes from the similarity matrix, we use the MCL algorithm [6,7]. That algorithm, based on the fluxes in a graph, is well suited to weighted graphs and does not require any prior knowledge about the number of clusters. Moreover, it does not require any initial conditions and is therefore reproducible. The inflation parameter of the algorithm is fixed to 1.8, as suggested in [8].

Mapping to GO-terms

To assess the functional significance of obtained clusters, and suggest putative functions for genes with unknown functions, we calculate the enrichment of gene ontology (GO) [9].

To determine the over-represented GO terms in each cluster, we apply the R package GOstats [10] with a p-value cut-off of 5% and the biological process ontology. For each cluster C, we obtain a set TC of GO terms over-represented in C with a rate of 5% and a set of associated p-values {pt, t TC}.

We define the score of the clustering by

<a onClick="popup('http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1753-6561/2/S4/S4/mathml/M8">View MathML</a>

where Ct is the set of genes of C associated with the GO term t. Parameters cmin and cmax allow us to avoid clusters that are too small, which don't have a biological meaning, as well as too big ones, which don't have any functional unity.

Results and discussion

As a proof of concept, we used gene expression data sets for S. Cerevisiae. The Gasch data set [11] measures the response of yeast to 173 stress conditions for 6152 genes. We used a set of 237 known and putative transcription factors.

We applied LICORN and retained only those GRNs (gene regulatory networks) identified as significant with a 5% FDR level (see [2] for details). 2041 GRNs (of 5703 GRNs) were identified as significant. The structural organization of the learned GRNs has been shown to be consistent with recent advances [12] concerning the characterization of topological transcriptional network features in yeast and provide the first evidence of the relevance of inferred GRNs.

In order to choose the parameter λ for the similarity matrix, we computed the matrices and the associated clusterings for several values of λ and compared their scores with parameters cmin = 5 and cmax = 200. Figure 2 shows that the best one is obtained for λ = 0.1.

thumbnailFigure 2. Score and number of p-values for λ varying from 0 to 1.

For λ = 0.1, the clustering gives 30 clusters among which one is too big to be considered (407 genes) and 3 have less than 5 genes. Table 1 gives the best GO term association for the 10 best of them when ranked according to their best p-value. The biological evaluation of these clusters is ongoing.

Table 1. GO-enrichment of the clusters obtained for S. Cerevisiae.

The cluster number 15 that appears on the second line of Table 1 is in fact associated to 32 GO terms with a p-value lower than 1e – 07, most of those terms being related to phosphorylation or triphosphate metabolic process. Moreover, five genes of that cluster belong to the 167 genes having no associated GO term, namely the genes YLR296W, YDR215C, YBL044W, YIR040C and YPR027C. All of them appear in the Entrez gene database but without known functions.

We have finally validated our method by comparing clustering performances based on other similarity matrices. We therefore have computed from the original expression data matrices of euclidian distance, partial correlation [13] and mutual information [14]. To compare clustering results with the same number of clusters, we used the hierarchical clustering method AGNES [15] to cluster the genes in 20, 30, 40, and 50 groups. Figure 3 shows the scores for those three methods as well as for ours with λ = 0.1. It clearly shows that inferring the regulatory network from LICORN preprocessing improves the score of the clustering and provide more biologically relevant clusters.

thumbnailFigure 3. Comparison of the clustering based on LICORN with existing methods. Figure of the scores obtained for hierarchical clustering into 20, 30, 40 and 50 clusters. The red circles are the scores obtained for the similarity matrix given by LICORN and λ = 0.1. The similarity measures which are compared to are euclidian distance, partial correlation and mutual information.

Conclusion

The problem of discovering functional modules from expression data is both biologically important and computationally challenging. From a biological perspective, identifying members of functional modules is the first step toward understanding the regulatory network of the cell. We provide here an alternative way for constructing gene modules: genes are clustered in the same module if they share many regulators, as they have been inferred by LICORN from gene expression data. We expect this way of clustering will discover modules that are out the scope of classical co-expression clustering techniques. From a computational perspective, one of the key challenges is dealing with over-fitting as the number of data samples is so small.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ME and CR designed the inference and participated in experimentations and drafting the manuscript. EB and CA proposed the clustering and evaluation methods.

Acknowledgements

Te authors would like to thank Monique Bolotin-Fukuhara for her helpful comments.

This article has been published as part of BMC Proceedings Volume 2 Supplement 4, 2008: Selected Proceedings of Machine Learning in Systems Biology: MLSB 2007. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/2?issue=S4.

References

  1. Armstrong N, Wiel M: Microarray data analysis: from hypotheses to conclusions using gene expression data.

    Cellular Oncology 2004., 26 PubMed Abstract | Publisher Full Text OpenURL

  2. Elati M, Neuvial P, Bolotin-Fukuhara M, Barillot E, Radvanyi F, Rouveirol C: LICORN: learning co-operative regulation networks from gene expression data.

    Bioinformatics 2007, 23:2407-2414. PubMed Abstract | Publisher Full Text OpenURL

  3. Agrawal R, Imielinski T, Swami A: Mining Association Rules between sets of items in large databases.

    Proceedings of the International Conference on Management of Data 1993, 207-216. OpenURL

  4. Pe'er D, Regev A, Tanay A: Minreg: inferring an active regulator set.

    Bioinformatics 2002, 18(Suppl 1):S258-S267. PubMed Abstract | Publisher Full Text OpenURL

  5. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data.

    Nature Genetics 2003, 34:166-176. PubMed Abstract | Publisher Full Text OpenURL

  6. van Dongen S: Graph Clustering by Flow Simulation. PhD thesis. University of Utrecht; 2000. OpenURL

  7. Enright A, Ouzounis C, van Dongen S: An efficient algorithm for large-scale detection of protein families.

    Nucleic Acids Research 2002, 30(7):1575-1584. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  8. Brohée S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks.

    BMC Bioinformatics 2006, 7:488. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  9. Cherry J, Adler C, Ball C, Chervitz S, Dwight S, Hester E, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database.

    Nucleic Acids Res 1998, 26:73-79. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association.

    Bioinformatics 2007, 23(2):257-258. PubMed Abstract | Publisher Full Text OpenURL

  11. Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression programs in the response of yeast cells to environmental changes.

    Mol Biol Cell 2000, 11(12):4241-57. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Guelzim N, Bottani S, Bourgine P, Képès F: Topological and causal structure of the yeast transcriptional regulatory network.

    Nature Genetics 2002, 31:60-63. PubMed Abstract | Publisher Full Text OpenURL

  13. Opgen-Rhein R, Strimmer K: Inferring gene dependancy networks from genomic longitudinal data: a functional data approach.

    REVSTAT 2006, 4:53-65. OpenURL

  14. Meyer P, Kontos K, Lafitte F, Bontempi G: Information-theoretic inference of large transcriptional regulatory networks.

    EURASIP Journal of Bioinformatics and Systems Biology 2007. OpenURL

  15. Kaufman L, Rousseeuw P: Finding Groups in Data: an Introduction to Cluster Analysis. Wiley, New-York; 1990. OpenURL