Abstract
Background
microRNAs (miRNAs) are a class of small noncoding RNAs which have been recognized as ubiquitous posttranscriptional regulators. The analysis of interactions between different miRNAs and their target genes is necessary for the understanding of miRNAs' role in the control of cell life and death. In this paper we propose a novel data mining algorithm, called HOCCLUS2, specifically designed to bicluster miRNAs and target messenger RNAs (mRNAs) on the basis of their experimentallyverified and/or predicted interactions. Indeed, existing biclustering approaches, typically used to analyze gene expression data, fail when applied to miRNA:mRNA interactions since they usually do not extract possibly overlapping biclusters (miRNAs and their target genes may have multiple roles), extract a huge amount of biclusters (difficult to browse and rank on the basis of their importance) and work on similarities of feature values (do not limit the analysis to reliable interactions).
Results
To overcome these limitations, HOCCLUS2 i) extracts possibly overlapping biclusters, to catch multiple roles of both miRNAs and their target genes; ii) extracts hierarchically organized biclusters, to facilitate bicluster browsing and to distinguish between universe and pathwayspecific miRNAs; iii) extracts highly cohesive biclusters, to consider only reliable interactions; iv) ranks biclusters according to the functional similarities, computed on the basis of Gene Ontology, to facilitate bicluster analysis.
Conclusions
Our results show that HOCCLUS2 is a valid tool to support biologists in the identification of contextspecific miRNAs regulatory modules and in the detection of possibly unknown miRNAs target genes. Indeed, results prove that HOCCLUS2 is able to extract cohesivenesspreserving biclusters, when compared with competitive approaches, and statistically confirm (at a confidence level of 99%) that mRNAs which belong to the same biclusters are, on average, more functionally similar than mRNAs which belong to different biclusters. Finally, the hierarchy of biclusters provides useful insights to understand the intrinsic hierarchical organization of miRNAs and their potential multiple interactions on target genes.
Background
miRNAs are posttranscriptional regulators which represent one of the major regulatory gene families in animals, plants and viruses. They bind to complementary sequences on target mRNAs, resulting in negative regulation (transcript degradation and sequestering, translational suppression) or positive regulation (transcriptional and translational activation) [1,2]. Studies on interactions between miRNAs and their target genes are of the utmost importance to understand the role of miRNAs in controlling cell processes and metabolic pathways [3] as well as to discover unknown functional synergies.
This work aims to contribute to the elucidation of miRNAs' complex biological functions by proposing a method for biclustering miRNAs and mRNAs. Biclustering is a data mining task whose goal, similar to the classical clustering, is to group together similar objects. The difference is that objects that fall in the same cluster are of two different types. Furthermore, objects of one type are clustered together according to their relationships with objects of the other type (symmetrically).
The method we propose identifies (possibly unknown) highly connected networks of miRNAs and mRNAs, that is, regulatory networks/modules. Thus, the aim is to provide the biologists with a tool which can support them in two challenging tasks: the identification of contextspecific miRNAs regulatory modules and the detection of (possibly previously unknown) miRNAs target genes.
As recognized in [4], the problem of discovering regulatory modules that control gene transcription in biological model systems can be solved by applying biclustering algorithms. Consequently, several papers in the literature apply biclustering in the biological domain [59]. However, they work on gene expression data and not on miRNA:mRNA interactions. In order to work properly on miRNA:mRNA interactions, some important issues have to be considered. In particular, extracted biclusters should be:
• Possibly overlapping, since mRNAs and miRNAs can be involved in multiple regulatory networks [10]. Ignoring this aspect would lead to the identification of incomplete interaction networks.
• Hierarchically organized. This organization facilitates the interpretation of results, even when a high number of biclusters is extracted. Moreover, it opens the opportunity to consider an intrinsic hierarchical organization of miRNAs, where it is possible to distinguish between miRNAs involved in many signaling pathways (universe miRNAs) and pathwayspecific miRNAs (intrapathway miRNAs). The latter aspect has recently been considered an important issue that deserves deeper investigation [11].
• Highly cohesive. This means that miRNAs and mRNAs in the same bicluster should be highly related and show (only) reliable interactions. This is different from what biclustering methods specifically designed for gene expression data do, that is, grouping together genes and conditions with similar (both high and low) expression values.
We propose an algorithm for the efficient discovery of overlapping, hierarchically organized and highly cohesive biclusters. Biclusters are extracted from a dataset of experimentally verified miRNA:mRNA interactions, i.e. miRTarBase [12], as well as from miRNAs target prediction datasets extracted from mirDIP [11]. In the latter case, the integration of different miRNA target prediction algorithms contributes to reducing the impact of noise (i.e., false positives) on the significance of the resulting biclusters.
Besides the extraction and evaluation of potential regulatory modules (expressed as biclusters), this paper provides a way to systematically assess the actual role of miRNAs in biclusters in the control of biological processes [3] in which their target mRNAs are involved. This analysis is performed by exploiting a statistical significance test, whose goal is to evaluate the hypothesis that mRNAs which belong to the same biclusters are, on average, more functionally similar than mRNAs which belong to different biclusters. In this test, the functional similarity is evaluated according to the Gene Ontology (GO) [13] classification.
Furthermore, we provide a ranking of biclusters on the basis of an additional statistical test which compares intraand interfunctional similarity of each bicluster with respect to the GO classification. This ranking aims to simplify the identification of the most significant biclusters.
Related works
The research reported in this paper has its roots in works which study biclustering (/coclustering) algorithms for biological data mining, as well as in works which study the role of miRNA:mRNA regulatory modules. Regarding the first research line, we only concentrate on algorithms which extract overlapping biclusters, since in our context, as previously stated, extracting nonoverlapping biclusters is too limitative.
Extraction of overlapping biclusters for biological data analysis
There are several papers in the literature that deal with the extraction of overlapping biclusters. Most of them are applied or specifically designed for gene expression data analysis. In this setting, gene expression data are organized as matrices/tables, where rows represent genes, columns represent various samples such as tissues or experimental conditions, and values in each cell characterize the expression level of the particular gene in the particular sample. According to this setting, biclustering methods typically group together rows (columns) with similar (both high and low) expression values, which, as previously stated, is different from our goal of maximizing the cohesiveness (see Figure 1). In the following, we describe these methods.
Figure 1. Similaritybased vs. cohesivenessbased biclustering approaches. (a) Biclustering obtained by means of (most of) similaritybased approaches. (b) Biclustering obtained by means of cohesivenessbased approaches.
One of the pioneering works on this topic [8] proposes a greedy heuristic search to generate arbitrarily positioned, overlapping biclusters, based on a homogeneity constraint. In this case, biclustering is based on iterative insertions and deletions of genes and conditions asymmetrically (i.e. insertions and deletions of conditions depend on insertions and deletions of genes). Since biclustering is guided by only one dimension, rows and columns are not interchangeable. Moreover, as pointed out in [9], this iterative algorithm is computationally expensive, since it identifies individual biclusters sequentially rather than all at once. The algorithm also causes random perturbations to the data since it inserts random values instead of deleting rows and columns corresponding to the previously discovered bicluster. This process, although allowing overlapping, can reduce the biclustering quality.
In [14], the authors propose initializing (possibly overlapping) biclusters with random rows and columns and, then, iteratively moving rows/columns among them. Each "move" operation aims to minimize the mean residue which indicates the degree of coherence of a cell value with the remaining values in the bicluster. This approach has the disadvantage that it is significantly dependent on the initial random biclustering. As in [8], this approach is not deterministic and does not extract hierarchically organized biclusters. Contrary to [8], it discovers biclusters all at once, thus improving computational efficiency.
A different solution is proposed in [7], where genes and conditions are represented according to a binary matrix, which is recursively divided into two smaller (possibly overlapping) submatrices, after a rearrangement of columns/rows. Since rearrangement is computationally expensive [15], the proposed solution is impractical for large datasets.
In [5], biclustering is guided by a probabilistic process by means of which objects are assigned to hierarchically organized clusters on a single dimension. Clustering on this dimension determines clustering on the other dimension (where it is possible to have overlapping). This means that the hierarchy is defined only on the first dimension and overlapping is supported only on the second dimension.
In [9], the authors propose the algorithm ROCC which rearranges columns and rows in order to identify the most coherent biclusters (expressed as submatrices). Subsequently, it works in a bottomup fashion and iteratively merges pairs of "closest" biclusters until a stopping criterion is satisfied. ROCC bases the merging process on "relationships" (i.e. submatrices) and not directly on objects (i.e. rows and columns), with the consequence that it may encounter problems when processing datasets affected by "relational" imbalance (i.e. when objects of different types participate with significantly different cardinalities in the interactions) [16]. Although this algorithm is, in principle, able to extract a hierarchy of biclusters, it only returns the set of biclusters obtained at the last iteration.
In the literature there have been a few attempts to work on miRNA:mRNA interactions [6,10,17]. These works will be introduced and described in the next subsection.
miRNA:mRNA regulatory modules
Several works in the literature have studied different facets of the interactions among miRNAs, genes and proteins. In particular: [1], [18] and [19] study the global miRNA regulation in cellular networks; [20] and [21] study the combinatorial miRNA regulation in cellular pathways; [22] and [23] study the correspondence between regulatory networks extracted from transcriptional and miRNA data; [24] studies and proves that miRNAs tend to target highly connected genes or proteins in cellular networks; [11] combines multiple miRNA prediction databases to identify signaling pathwayassociated miRNAs.
However, approaches for fullscale analysis of the regulatory networks spanned by miRNAs are only now getting under way [10]. These approaches have their roots in studies which, aiming to identify a modular organization of biological networks (see, for instance [25]), have pointed out that such networks have greatly advanced our understanding of complex cellular systems. As recognized in [10], identifying functional miRNA:mRNA regulatory modules is a challenging task for several reasons: (i) one mRNA can be regulated by multiple miRNAs and one miRNA can regulate a large number of mRNAs. (ii) miRNA:mRNA specific interactions often differ in a celltype and cellphase dependent manner. (iii) although miRNAs physically interact with mRNAs, ultimately miRNA regulation affects the quantity of proteins in cells rather than the quantity of mRNAs. Thus, the expression levels of miRNAs are not always exactly anticorrelated with those of their target genes. While (i) and (ii) motivate the use of biclustering approaches which extract overlapping biclusters, (iii) suggests the use of miRNA target predictions (in alternative to the experimentally verified interactions) extracted by appropriate algorithms.
Following this stream of research, in [6] the authors have proposed an algorithm to identify miRNA:mRNA regulatory modules based on predicted miRNA:mRNA target information. This algorithm extracts maximal bicliques (complete bipartite graphs) which represent candidate biclusters. From candidate biclusters, only those for which the range of scores of miRNA:gene interactions are in a userdefined interval are returned. Consequently, this algorithm suffers from the problem of manually setting the interval and from the problem that the extraction of bicliques prevents the algorithm from identifying noncompletely connected interaction networks, which results in a high number of (redundant) small biclusters. Moreover, since this algorithm is based on a method specifically designed for gene expression data [26], it does not extract highly cohesive biclusters. Finally, extracted biclusters are not hierarchically organized. These limitations can also be found in [17], where the method is similar to that proposed in [6]. Here, however, the extraction of bicliques also takes into account coherent expression patterns between miRNAs and genes, or the (anti)correlations between each miRNAtarget gene pair.
In [10], the proposed solution aims to extract biclusters by solving a nonnegative matrix factorization problem. The peculiarity of this approach is that it takes into account additional information coming from proteinprotein interaction networks and from gene expression data. Also in this case, high cohesion is not guaranteed and extracted biclusters are not hierarchically organized.
Contributions
Taking into account all the considerations reported so far, we propose an algorithm, called HOCCLUS2 (Hierarchical Overlapping CoCLUStering2), which provides a solution to the issues raised by the specific task in hand and effectively deals with the "relational" imbalance problem. Moreover, it does not require as input the number of desired biclusters, i.e. it is able to automatically determine the optimal number of biclusters, by exploiting information about the underlying data distribution. The algorithm starts from an initial set of biclusters which express bicliques (fully connected bipartite graphs) and, then, iteratively defines the hierarchical organization of biclusters according to a bottomup strategy.
This paper is based on the preliminary work in [16], where only the system HOCCLUS is presented. However, this paper significantly extends and upgrades the work presented there:
• We propose a novel algorithm for the construction of the initial biclusters which are now expressed as overlapping bicliques. This is different from what is done in HOCCLUS, where the system METIS [27] is adapted to extract (nonoverlapping) biclusters. This difference is crucial, since METIS can extract biclusters that do not represent fully connected subgraphs. Consequently, it is possible that very specific biclusters are lost. Moreover, biclusters discovered by METIS depend on a userdefined parameter: the number of initial biclusters. Manual tuning of this parameter is an open problem in METIS.
• We revise the method in order to consider the possible presence of "noise" objects. This is coherent with the basic principle of some wellestablished and wellknown clustering algorithms such as DBSCAN [28].
• We report a theoretical analysis of the time complexity of the learning algorithm.
• We report an extended experimental analysis of experimentally verified miRNA:mRNA interactions and miRNA target prediction datasets (miRTarBase and mirDIP, respectively). This is different from [16], where the analysis is only performed on miRNAMap 2.0 [29].
• We use statistical tests to evaluate the hypothesis that mRNAs which belong to the same biclusters are more functionally similar (according to GO) than mRNAs which belong to different biclusters.
• We provide a ranking of biclusters on the basis of a statistical analysis.
Methods
The method we propose is based on three main steps:
1 Extraction of a set of initial nonhierarchically organized biclusters.
2 An iterative process in which, at each iteration, two phases are performed, that is, overlap identification and merging. In the former, some objects (miRNAs or mRNAs) belonging to a bicluster can be added to another bicluster. In the latter, biclusters are merged when some heuristic criteria are satisfied. It is noteworthy that at each iteration several pairs of biclusters can be merged. Moreover, at each iteration, depending on whether merging is performed, an additional level of the hierarchy may or may not be added. This process stops when neither overlaps nor merges are performed in the last iteration.
3 A ranking of extracted biclusters. Ranking takes into account a preference function which exploits the intraand interfunctional similarities (according to GO) of objects in each bicluster.
It is noteworthy that the iterative merging process (Step 2) can be applied to biclusters consisting of a single miRNA:mRNA interaction. Although this solution would make Step 1 useless, it would lead to the construction of a very large set of meaningless hierarchy levels. The construction of an initial set of biclusters guarantees the significance of the results, even from the first level of the hierarchy.
Before formally introducing the problem we intend to solve, some useful definitions are necessary: let V_{r }and V_{c }be the sets of mRNAs and miRNAs, respectively (subscripts r and c stand for row and column, respectively. Here rows refer to mRNAs and columns refer to miRNAs. They are actually interchangeable in HOCCLUS2). Let A^{n × m }be an adjacency matrix, where n = V_{r}, m = V_{c} and A_{r}_{(x)},_{c}_{(y) }is a score associated to the interaction between x ∈ V_{r }and y ∈ V_{c}, where r : V_{r }→ [1, n](c : V_{c }→ [1, m]) is a function that maps a row (column) object to the corresponding row (column) index of the matrix A. Without loss of generality, we impose that ∀x ∈ V_{r}, ∀y ∈ V_{c }: A_{r}_{(x),}_{c}_{(y) }∈ [0,1], where 0 means no interaction and 1 means the most reliable interaction.
Formally, the problem is defined as follows:
Given:
• the set of mRNAs V_{r}, the set of miRNAs V_{c};
• the adjacency matrix A^{n×m};
• a minimum interaction score β;
• a cohesiveness function is the set of possible biclusters);
• a cohesiveness threshold α for ;
Find: a ranked list of biclusters L_{j }for each level j = 1, . . . , k such that:
a) for each list L_{j}, j = 2, . . . , k we have that ∀ C' ∈ L_{j }∃ C" ∈ L_{j}_{−1}, such that C" ⊆ C' (hierarchy);
b) biclusters at the same level can share objects in V_{r }and in V_{c }(overlapping);
c) for each bicluster C' ∈ L_{1}; ∀x ∈ V_{r }∩ C' , ∀y ∈ V_{c }∩ C' , A_{r}_{(x),}_{c}_{(y) }> β;
d) for each bicluster C' ∈ L_{j }obtained after merging, q(C' , A) ≥ α (cohesiveness);
e) for each pair of biclusters C' , C" ∈ L_{j }, p(C' ) ≥ p(C" ) iff C' precedes C" in L_{j }.
It is noteworthy that, at this stage, we do not impose additional conditions on the cohesiveness function q(·,·) and on the preference function p(·) which will be defined later. Moreover, L_{k }does not necessarily contain a single bicluster, meaning that a forest of biclusters is actually returned. This is coherent with the task in hand, where some sets of miRNAs could be totally unrelated to some sets of mRNAs. Moreover, α implicitly influences the number k of the levels and the number of biclusters at each hierarchy level.
Algorithm reported in Figure 2 solves the considered problem. Single steps will be detailed in the following subsections.
Figure 2. Hierarchical and overlapping biclustering algorithm. High level description (in pseudocode) of the proposed hierarchical and overlapping biclustering algorithm.
Building the initial biclusters
We consider two different alternatives for this task. The first one consists in exploiting an existing biclustering algorithm. For this purpose, we use the algorithm METIS [27]. METIS is a good candidate for working with miRNA:mRNA interactions, since it aims at minimizing the socalled edgecut of the graph and, consequently, at maximizing the intracluster cohesiveness. METIS, although originally designed for classical clustering problems, can extract miRNA:mRNA biclusters by forcing node weights such that both miRNAs and mRNAs must appear in the same cluster (http://glaros.dtc.umn.edu/gkhome/node/685 webcite). However, METIS, as most of biclustering algorithms, requires as input the desired number of biclusters. Although in experiments this issue is not perceived, since they are often performed on real/synthetic datasets where the number of biclusters is already known, it is a relevant problem in real contexts, such as in the analysis of gene expression data or miRNA:mRNA interactions. Moreover, METIS is exhaustive, i.e. each object (miRNA and mRNA, in this case) is always assigned to a bicluster. This characteristic leads to lowquality biclusters when some mRNAs (resp. miRNAs) do not share with other mRNAs (resp. miRNAs) a significant number of strong interactions with miRNAs (resp. mRNAs). According to the considerations provided in [28], these objects can be considered as noise objects, since located in lowdensity areas of the space, and should be automatically discarded.
The second alternative consists in the use of a new algorithm which overcomes these limitations. The only parameter the proposed algorithm requires is β ∈ [0,1], whose value can be easily chosen by experts, since it represents the minimum score for miRNA:mRNA interactions. Interactions with score values less than β are ignored, thus β implicitly defines a sort of filter on the reliability of the interactions.
The algorithm builds biclusters in the form of bicliques by analyzing interactions in two directions, i.e. from miRNA to mRNA and from mRNA to miRNA. Once a set of bicliques is obtained for each direction, they are merged together to obtain the final set of bicliques. Since the algorithm works in a symmetrical way, we here describe only the extraction of the initial bicliques in the "miRNA to mRNA" direction.
The algorithm (see Figure 3) works by taking into account some statistical properties, that is:
Figure 3. Initial biclustering algorithm. High level description (in pseudocode) of the initial biclustering algorithm (miRNA to mRNA direction).
• avg_mirna: the average number of miRNAs which target each mRNA, with a score greater than β;
• abs_min_mrna and min_mrna: the absolute and the outlierproof (respectively) minimum number of mRNAs which are targeted by each miRNA, with a score greater than β.
The min_mrna value (outlierproof) is computed by assuming that the number of mRNAs which are targeted by each miRNA follows a Normal distribution. In particular, we take the minimum number of targeted mRNAs (with a score greater then β) by discarding the lowest 0.15% values, which are possibly outliers, according to the 99.7 (or threesigma) rule. Symmetrically, avg_mrna, abs_min_mirna and min_mirna are calculated for the "mRNA to miRNA" direction.
Once these simple statistics are computed, an initial set of bicliques is built. Each initial biclique consists of a single miRNA and the set of mRNAs it targets with a score greater than β, so that we have at most V_{c} initial bicliques (Figure 3, line 1). The algorithm, then, iteratively aggregates two biclusters C' and C" into a new bicluster C''' as follows (see Figure 3, lines 923, and Figure 4):
Figure 4. Aggregation. An example of aggregation of two bicliques (C' and C") into a new biclique (C'"), according to Equation (1).
where C_{r }= V_{r }∩ C and C_{c }= V_{c }∩ C represent row and column objects in C, respectively.
Aggregation is based on the property that the number of miRNAs is antimonotonic with respect to the number of mRNAs in a biclique. The necessary conditions for aggregating are (Figure 3, lines 4 and 15):
The basic idea is that a good biclique should contain approximately avg_mirna miRNAs, while keeping the highest possible number of mRNAs (at least min_mrna). Moreover, as the goal of the algorithm is to obtain a set of highly cohesive bicliques, among the possible aggregations of pairs of bicliques 〈C', C"〉 we select the one for which the following measure is maximized (Figure 3, line 10):
where , A is the adjacency matrix and q(·,·) is a cohesiveness function. The cohesiveness function that we consider in this work is defined as follows:
This function measures the weighted (i.e. by considering the score of the interactions) percentage of interactions in a bicluster, normalized by the maximum number of possible interactions. Intuitively, the function q measures the intracluster cohesion (also known as "compactness" in classical clustering).
The iterative process stops when there are no additional candidates for aggregation, i.e. there is no pair of biclusters which satisfies the conditions of the inequalities in (2).
The whole process is also performed in the "mRNA to miRNA" direction and the two sets of biclusters are then merged by simply removing biclusters which appear more than once and biclusters which are a subset of others. The algorithm then starts a pruning phase whose goal is to remove noise objects. Coherently with the definition of noise objects provided before, each bicluster containing less than abs_min_mirna miRNAs or less than abs_min_mrna mRNAs is removed. We recall that both abs_min_mirna and abs_min_mrna are computed according to a statistical analysis of the data.
Overlap identification
The basic assumption behind the overlap identification is that two nonoverlapping biclusters should be (linearly or not) separable in the space. According to this assumption, we identify objects belonging to one bicluster that can be added to another bicluster. In particular, given two biclusters C' and C" (belonging to the same level of the hierarchy), C' ≠ C", we identify two optimal separating hyperplanes between C' and C" by learning an SVM model for each dimension (miRNAs and mRNAs). Since our goal is not to build a good predictive classification model, but to evaluate the separability of objects belonging to different biclusters, the objects in C' and C" are used as both the training set and the testing set. Misclassified objects are those which possibly belong to both the considered biclusters. Intuitively, the separating hyperplane can be interpreted as delineating the changes of the underlying data distribution between C' and C". This is coherent with studies that exploit SVMs for solving clustering tasks [30].
When learning SVMs, each row (column) object is represented as its corresponding row (column) vector of A. The use of SVMs as discriminative methods is motivated by their recognized peculiarity in dealing with sparse data [31], that is a common situation in a miRNAs:mRNAs adjacency matrix.
More formally, we build two binary classifiers: and . Once the classifiers are built,
• for each row object , we consider its corresponding row vector A_{r}_{(x),*}
• for each column object , we consider its corresponding column vector A_{*,c(y)}
In this way we obtain overlapping biclusters, where the common objects are those that cannot be correctly classified by the separating hyperplane (Figure 5(a)).
Figure 5. Overlapping and merging phases. (a) Overlapping between two clusters along one dimension. The red marked object (misclassified) is added to the other cluster. (b) An example of the object distribution of the row dimension of the biclusters C' and C". In this case, C' and C" are candidates for merging.
It is noteworthy that SVMs have to be constructed on each pair of biclusters for each level. In order to obtain a result which is independent of the order in which pairs of biclusters are analyzed, the misclassified objects are added at the end of the overlap identification process.
In Figure 2, overlapping(L_{k}, A) is in charge of identifying possible overlaps. It returns the number of objects that have been added to biclusters and the updated set of biclusters with added objects. In our implementation, the algorithm used for learning SVMs is SMO [32] with the default kernel (linear).
Merging
Once a set of overlapping biclusters has been obtained, we can analyze them to evaluate if some pairs of biclusters can be reasonably merged. A naïve approach would consider only the distance or the number of common objects, neglecting their statistical distribution. Here, we assume that row (column) objects in a bicluster are normally distributed in the space [0,1]^{m }([0,1]^{n}), that is, in the space in which their row (column) vectors are represented. We consider the distance between pairs of biclusters in order to merge those for which a defined percentage of (possibly unknown) objects can statistically be in common.
Formally, two biclusters C', C" are candidates for merging if:
where dist(w, z) is the Euclidean distance between the centroids of the clusters w and z and σ(w) is the standard deviation of the cluster w (see Figure 5(b)). Since row and column objects are represented as vectors, the centroid of a cluster is computed in the classical way. The standard deviation for row and column objects is computed as and respectively.
Intuitively, conditions in (5) state that two biclusters are candidates for merging if they are close enough according to at least one dimension. By considering the factor 2 for σ(w), we include in each sphere about 95.4% of the objects of the corresponding cluster, as a consequence of Chebyshev's inequality.
If a pair of biclusters C', C" is a candidate for merging, a further quality constraint should be satisfied on the bicluster C''' obtained by merging them. In particular, merging is performed as follows:
and the quality constraint that must be satisfied is q(C''', A) >α, where α allows the user to decide the minimum cohesiveness value that each bicluster obtained after a merging step has to satisfy. Low values of α facilitate merging at the price of low cohesive biclusters.
As in the overlap identification step, in order to obtain a result which is independent of the order in which pairs of biclusters are analyzed, merging is actually performed at the end of the procedure. Obviously, a bicluster could be a candidate for more than one merging. In this case, we actually perform the merging whose resulting bicluster has the maximum cohesiveness (see Figure 6).
Figure 6. Merging procedure. High level description (in pseudocode) of the iterative merging phase of the proposed algorithm.
It is noteworthy that the combination of our overlap identification and merging procedures allow us to consider both the density of biclusters and the distance among the objects, thus combining the major properties which classical clustering algorithms are based on.
Ranking biclusters
Ranking of biclusters is based on the pvalues of a statistical test which aims to evaluate the hypothesis that the mRNAs which belong to a specific bicluster are, on average, more functionally similar to other mRNAs in the same bicluster than to mRNAs which belong to other biclusters.
The functional similarity between two genes is evaluated by means of the SimGIC measure [33], which is a semantic similarity measure computed according to the genes' annotations in GO. As in [34], we consider the similarity as a special case of relatedness that is tied to the likeness (in the shape or form, e.g. on the basis of isa relations) of concepts. SimGIC is proved to show high values of "resolution", that is, the relative intensity with which variations in the sequence similarity scale are translated into the semantic similarity scale. Moreover, as recognized in [33], in the case of GO (as in many other biological ontologies) Information Contentbased approaches (like SimGIC), which estimate a term's specificity from its usage frequency within a given corpus, are more adequate than alternative approaches that typically infer a term's specificity from its depth in the graph. In fact, in the case of GO, the specificity is poorly related with the depth in the graph. For instance, the terms binding and translation regulator activity are at the same depth, but the latter is both semantically more complex and biologically more specific. SimGIC is defined according to the following formula:
where GO(x) represents the set of GO terms which x is associated to and IC(t)= − log p(t) is the negative loglikelihood of the term t computed on the basis of the prior probability p(t) of t. p(t) is estimated as the percentage of genes associated to the term t, according to the UniProt Homo sapiens GO annotations. It is noteworthy that, although we used UniProt Homo sapiens GO annotations, in HOCCLUS2 other sets of annotations can be used.
The statistical test we consider is the classical onetailed Student's t test that allows us to evaluate the null hypothesis H_{0 }: µ_{0}(C) = µ(L, C) against the alternative hypothesis H_{1 }: µ_{0}(C) >µ(L, C), where µ_{0}(C) is the mean of the intrabicluster functional similarities for the bicluster C and µ(L, C) is the mean of the interbicluster functional similarities between the bicluster C and the list L \{C}, i.e. the other biclusters belonging to the same hierarchy level of C (see Figure 7(a)). µ_{0}(C) and µ(L, C) are defined as:
Figure 7. Statistical tests for ranking and biclusters evaluation. (a) Ranking: Average intrabicluster similarity for bicluster A (on red edges) vs. average interbicluster similarity (on blue dashed edges). (b) Biclustering Evaluation: Average intrabicluster similarity (on red edges) vs. average interbicluster similarity (on blue dashed edges).
This test is used to identify the pvalue associated to the hypotheses to be verified. In particular, the lower the pvalue, the lower the probability that H_{0 }is rejected when H_{0 }is true. Therefore, the lower the pvalue, the higher the difference between the average intrafunctional similarity and the average interfunctional similarity. These considerations make the pvalue appropriate to be used in ranking biclusters in L, thus simplifying the identification of the most significant biclusters.
Since we compute SimGIC according to two different hierarchies of GO, that is, Molecular Function and Biological Process, we are able to obtain two different rankings of biclusters.
Time complexity
The time complexity of the algorithm depends on the time complexity of each single phase.
As regards the initial biclustering, we first consider the miRNA to mRNA direction. The time complexity of get_set_of_one_miRNA_bicliques(Vr,Vc, A, β) is O(m * n) where m is the number of miRNAs and n is the number of mRNAs. At the first iteration, we have average time complexity:
where (a) is due to the pairwise comparison of onemiRNA bicliques, (b) is the cost of the union of miRNAs and intersection of mRNAs in two bicliques (we have 1 miRNA and avg_mrna mRNAs in one biclique), (c) is due to the computation of the cohesiveness function q and (d) is due to the identification of the best pair to be considered for aggregation (after sorting). Similarly, for the remaining iterations, we have:
where (a) represents the maximum number of iterations, (b) represents the cost of updating the candidate pairs of bicliques for aggregation at the light of the biclique added in the previous iteration and (c) represents the cost of adding the newly created candidates in aggregationCandidates (see Figure 3).
This analysis indicates that the cost of identifying the initial set of biclusters (by considering both directions) is O(2 * m * n + (m^{2 }+ n^{2}) * avg_mirna * avg_mrna), that is, since n ≫ m:
In the overlap identification, at the first iteration (worst case), SMO is applied m * (m − 1)/2 times for each dimension (we assume that m is the number of biclusters after the initial biclustering). On average, the number of objects involved in each execution of SMO for row (column) objects is avg_mirna (avg_mrna). Therefore the cost of the overlap identification is: O (m * (m − 1) * (avg_mirna * l_{1 }+ avg_ mrna * l_{2})) = O (m^{2 }* (avg_mirna * l_{1 }+ avg _mrna * l_{2})) where avg_mirna*l_{1}+avg_mrna*l_{2 }is the cost of SMO and l_{1 }and l_{2 }are the number of candidate support vectors during the training phase (l_{1 }≤ avg_mirna, l_{2 }≤ avg _mrna).
The time complexity of the merging phase is: O (m * (m  1)/2 * avg_mirna * avg_mrna) = O (m^{2 }* avg_mirna * avg_mrna) where avg_mirna * avg_mrna is due to the cohesiveness function q.
Since the computation of the cohesiveness function q is exactly avg_mirna * avg_mrna, whereas the cost of the execution of SMO on both dimensions, in the worst case (l_{1 }= avg_ mirna and l_{2 }= avg_mrna), is O(avg_mirna^{2 }+ avg_mrna^{2}), we approximate the complexity of a single iteration of the algorithm in Figure 2 to:
The time complexity of the ranking phase is:
where n * (n − 1) * S/2 is the cost for computing all the possible SimGIC values (S is the cost for a single SimGIC value) and k * m log m is the cost of sorting all extracted biclusters for all the hierarchy levels (k).
By combining (10), (11) and (12), the time complexity of the complete algorithm is: O [n^{2 }* (avg_mirna * avg_mrna) + u * m^{2 }* (avg_mirna^{2 }+ avg_mrna^{2}) + S * n^{2}] where u >0 is the number of iterations of the algorithm in Figure 2. Since in the experiments we observed that the main cycle requires much more time than initial biclustering and ranking, it is reasonable to say that the actual time complexity is O(u * m^{2 }* (avg_mirna^{2 }+ avg_mrna^{2})). This complexity significantly depends on avg_mirna and avg_mrna, i.e. on the density of the matrix A. Moreover, due to the worst case assumptions, the analysis appears to be too pessimistic with respect to the actual times measured during the experiments.
Results and discussion
In order to evaluate HOCCLUS2, we have considered as data sources a set of experimentally verified miRNA:mRNA interactions, i.e. miRTarBase [12], as well as the set of miRNAs target predictions in mirDIP [11]. These data sources have been used to obtain six distinct datasets (described later).
The main goal of this experimental evaluation is twofold: we empirically prove that the extracted biclusters preserve high values of cohesiveness and we evaluate extracted biclusters in order to empirically assess their biological significance. Moreover, we show the ability of HOCCLUS2 in ranking extracted biclusters.
Experiments have been performed with different values of α and β in order to evaluate their effect on the obtained biclusters. In the case of miRTarBase, we compare HOCCLUS2 with HOCCLUS (which uses METIS for the initial biclustering) [16] and ROCC [15]. In the case of mirDIP, we compare HOCCLUS2 with METIS [27] and ROCC. We cannot compare HOCCLUS2 with HOCCLUS on mirDIP because of the huge amount of nonlinearly separable biclusters returned by METIS (with many similar miRNAs and mRNAs) that inhibit HOCCLUS from completing the mining process in reasonable running time. For fair comparison, METIS is asked to return the same number of biclusters returned by our initial biclustering algorithm.
Datasets
miRTarBase (ver. 2.5) [12] contains 4,270 experimentally verified miRNAtarget interactions between 669 miRNAs and 2,533 target genes among 14 species. In our study, we only consider the human dataset. From this dataset (hereafter miRTarBase), we have generated an additional dataset (miRTarBase_{filt}), that contains only mRNAs which are annotated in GO, according to both Molecular Function (MF) and Biological Process (BP) hierarchies, and miRNAs that, once mRNAs have been filtered, are still connected to the remaining mRNAs. For both miRTarBase and miRTarBase_{filt}, interaction scores are binary (yes = 1/no = 0). Table 1 provides additional details on miRTarBase and miRTarBase_{filt}.
Table 1. Properties of considered datasets
mirDIP [11] integrates twelve miRNA prediction datasets from seven miRNA prediction databases. In this study, we consider only predictions extracted using TargetScan Conserved, PITA Top Hits and picTar 5way which, according to [11], provide a relatively low number of false positives (when compared to others) without affecting recall. We have not included additional prediction algorithms in order to minimize collinearity problems [35] in the combined predictions. Indeed, in order to obtain interaction scores, we (linearly) combine the standardized scores returned by single algorithms. In this combination, the consideration of the same features (e.g. conservation, site accessibility, free energy of duplex) multiple times may negatively affect the final score. This means that we have only considered the best prediction algorithms with the smallest overlap in the considered characteristics. From the original mirDIP we have generated 4 datasets: mirDIP, FmirDIP, mirDIP_{filt }and FmirDIP_{filt}. mirDIP contains all the predictions obtained by at least one of the considered prediction algorithms. In this case, interaction scores are obtained as the average of the standardized scores returned by each algorithm. FmirDIP is similar to mirDIP, except in the fact that the interaction scores are obtained according to a weighted average, where weights correspond to Fscore values reported in [11] which represent a degree of "reliability" of the predictions of each algorithm. mirDIP_{filt }and FmirDIP_{filt }have been obtained from mirDIP and FmirDIP (respectively), by filtering out mRNAs whose genes are not included in GO. A summary of all considered datasets is reported in Table 1.
Evaluation measures
Biclusters are evaluated on the basis of the average biclustering cohesiveness, which measures the average strength of the intrabiclusters connections: , where L_{j }is the set of biclusters obtained at the jth hierarchy level. In addition to µ_{q}(), we also use an evaluation measure which is based on statistical properties of the obtained biclusters. In particular, we use the independent twosample Student's t test to evaluate the null hypothesis against the alternative hypothesis , where is the average intrabicluster functional similarity and is the average interbicluster functional similarity defined as follows:
The lower the pvalue, the higher the difference between the average intrafunctional similarity and the average interfunctional similarity (see Figure 7(b)). As in the ranking phase, we use both GO Biological Process (BP) and GO Molecular Function (MF) hierarchies to compute SimGIC. Henceforth we will refer to the pvalues computed on BP and MF as p_{BP }and p_{MF }, respectively.
In the following, we report results for the first level of the hierarchy (i.e. the result of the initial biclustering), for the last level of the hierarchy (i.e. L_{k}) and for the "best level" of the hierarchy. The best level is the one for which we have the minimum value of . When more than one level has the same value, the one with the highest µ_{q }is considered. Experiments are run on a 4 Intel CPUs @4Ghz system.
miRTarBase
Tables 2 and 3 report results obtained with HOCCLUS2, HOCCLUS and ROCC on miRTarBase datasets. From these tables, it is possible to see that at the highest levels of the hierarchy, results significantly depend on the values of the parameter α. As expected, the higher the α value, the higher the µ_{q }value. An important result comes from the very low values of p_{MF }and p_{BP}. This statistically confirms (at a confidence level of 99%) that mRNAs which belong to the same biclusters are, on average, more functionally similar than mRNAs which belong to different biclusters. Moreover, p_{MF }and p_{BP }results are monotonic with respect to µ_{q }(and α). This confirms that µ_{q }is a valid measure to evaluate the quality of extracted biclusters. Another result is that at higher levels of the hierarchy, HOCCLUS2 is generally able to significantly reduce (especially for high values of α) the number of biclusters without a significant loss in the p_{MF}, p_{BP }and µ_{q}.
An additional important result is the number of biclustered miRNAs and mRNAs , which gives an indication of the number of isolated objects in miRTarBase that are considered noise by HOCCLUS2. Finally, by comparing results in the two tables, it is possible to say that considering only filtered data improves both the significance of the statistical analysis (p_{MF }and p_{BP }) and the cohesiveness of biclusters.
Regarding comparison with other systems, HOCCLUS2 performs significantly better than ROCC, which, as previously stated, is not designed to work with miRNA:mRNA interactions. Moreover, HOCCLUS2 always outperforms HOCCLUS in terms of cohesiveness and, at the best hierarchy level, in terms of p_{MF }and p_{BP}. In the last hierarchy level (max level), better performances of HOCCLUS with α = 0.1 and α = 0.2 can be motivated by the huge number of extracted biclusters (e.g. 52 vs. 5 extracted by HOCCLUS2, with α = 0.1). As regards the number of noise objects, HOCCLUS2 has biclustered less mRNAs and more miRNAs than ROCC. In this respect, the behavior of HOCCLUS2 is preferable, since the number of miRNAs is generally significantly lower than the number of mRNAs (HOCCLUS2 better deals with the relational imbalance). Moreover, while the possibility to discard noise objects helped HOCCLUS2 to achieve a very high value of cohesiveness, ROCC obtained poor results (µ_{q }= 0.01).
Running times show that HOCCLUS2 outperforms both HOCCLUS and ROCC. This confirms that the reported complexity analysis is too pessimistic and that HOCCLUS2 extracts biclusters in a reasonable time.
mirDIP
Tables 4, 5, 6 and 7 report results obtained on the mirDIP datasets. All the considerations about the monotonicity between α and µ_{q }and the capability of HOCCLUS2 to extract cohesivenesspreserving hierarchies reported for miRTarBase datasets are valid also for the mirDIP datasets. However, in this case, p_{MF }and p_{BP }are monotonic with respect to the cohesiveness only in the case of filtered datasets. This can be explained by the high number of missing GO annotations (for about 40% of biclustered mRNAs) in mirDIP and FmirDIP datasets which makes p_{MF }and p_{BP }not completely reliable indicators of the biclusters' quality. In these cases, i.e. when the algorithm cannot calculate reliable values of p_{MF }and p_{BP}, µ_{q }should be considered the primary indicator for the evaluation.
Table 4. mirDIP results
Table 5. FmirDIP results
Table 6. mirDIP_{filt }results
Table 7. FmirDIP_{filt }results
By observing the differences between mirDIP and FmirDIP (or between mirDIP_{filt }and FmirDIP_{filt}) it is possible to say that, coherently with results reported in [11], HOCCLUS2 benefits from the Fscorebased weighting of the interactions. Furthermore, when compared with other algorithms, HOCCLUS2 performs significantly better, in terms of cohesiveness, than ROCC and METIS. Additionally, ROCC and METIS are not able to extract significant biclusters in terms of p_{BP }and p_{MF}, whereas HOCCLUS2 is almost always able to extract actual functional biclusters for at least one level of the hierarchy.
As regards the number of noise objects, while ROCC has biclustered a very low number of miRNAs and mRNAs (9 and 101 respectively in mirDIP), obtaining a poor value of choesiveness (0.01), HOCCLUS2 has biclustered a reasonable number of objects for every considered values of β.
Running times show that HOCCLUS2 is always faster than ROCC. Moreover, although METIS requires significantly lower time, a detailed analysis reveals that the time for completing our initial biclustering step is comparable to that of METIS (we remind that METIS returns nonhierarchically organized biclusters). Similar to miRTarBase, also these results confirm that the reported worst case analysis is too pessimistic. Here, in addition, we demonstrate that HOCCLUS2 scales well also for huge datasets.
Biological evaluation of extracted biclusters
In order to evaluate the ability of HOCCLUS2 to detect meaningful miRNAs:mRNAs functional relationships, we have first analyzed the results obtained from miRTarBase datasets and then compared them with the results obtained from mirDIP. In this analysis, we have selected biclusters to be analyzed only according to the ranking values returned by the algorithm. In this paper we focus the analysis on some of the biclusters which group miRNAs belonging to the miR1792 gene cluster, also known as oncomir1 [36], and to its paralogs, miR106b25 and miR106a363. Table 8 reports the complete list of these biclusters, together with their properties. Moreover, some other examples of biclusters are provided to elucidate the usefulness of HOCCLUS2 in supporting biologists in the detection of multiple miRNAs functional interactions and in the identification of new potential targets. Functional analysis has been carried out by considering: i) structural and functional properties of miRNAs; ii) pathways mapping and statistical significance of gene enrichment in pathways; iii) the biological context of target genes. The main resource used for mapping gene in pathways is Reactome [37]. GeneCards [38] has been used for retrieving gene function information. Studies reported in the literature have been considered i) for retrieving information on miRNAs and validated miRNA:mRNA interactions and ii) for the discussion of the results.
Table 8. Biclusters containing the miR1792 gene cluster family
Structural and functional properties of miR1792, miR106b25 and miR106a363
miR1792, miR106b25 and miR106a363 belong to a family of highly conserved miRNA gene clusters and have potent effects on many type of human cancers [39]. They are located on chromosome 13, 7 and X, respectively, and derive from duplications and mutations of a unique gene and from the loss of some miRNAs occurred during the early evolution of vertebrates. Each cluster is transcribed as polycistronic primary transcript that ultimately yields six mature miRNAs in miR1792 and miR106a363 (miR17, miR18a, miR19a, miR20a, miR19b1, miR92a1 in miR1792; miR106a, miR18b, miR20b, miR19b2, miR92a2 and miR363 in miR106a363) and three miRNAs in miR106b25 (miR106b, miR93, miR25). The high degree of conservation across different species suggests an important role of this miRNAs cluster family for coordinated regulation and function in many pathways and cellular processes.
The miR1792 gene cluster acts pleiotropically during both normal development and cancer progression. Depending on both cell type and physiological context, miR1792 can promote proliferation, increase angiogenesis, and sustain cell survival through the posttranscriptional repression of a number of target mRNAs [39]. Different types of experimental evidences suggest the intriguing hypothesis that miRNAs in the miR1792 cluster may perform specific functions, either individually or in combination, in a coordinated rather than in an additive manner. A key feature of miR1792 is its property of being a potent inhibitor of the transforming growth factorβ (TGFβ) signaling. Ligands of the TGFβ superfamily are essential for the development and the adult tissue homeostasis, and the inactivation of TGFβ tumor suppression pathway is a main step in the development of a variety of human tumors [40]. Indeed, the miR1792 cluster is often activated in cancer cells and overexpression studies in gastrointestinal and other tumors reveal that both miR1792 and miR106b25 are able to inactivate the TGFβ tumor suppression pathway by interfering with the cell cycle arrest and apoptosis [40]. Although recent studies have greatly contributed to the elucidation of the miR1792 gene cluster family function and mechanism, the identity of all its targets remains still elusive and much work is still necessary to clarify miRNAs cooperative effects on signaling pathways. Moreover, the role of miR106a363 remains still obscure.
Validation of the functional consistency of extracted biclusters
The large amount of literature available on the miR1792 gene cluster family constitutes a reliable resource to verify the ability of our algorithm to discover actual biological functional interactions among miRNAs and their target genes belonging to the same bicluster. The rationale is that, if the results obtained on experimentally verified datasets are confirmed, there exists a real possibility that our biclustering algorithm is effective and that it could also work well on large datasets produced by prediction algorithms. This would allow us to uncover new potential gene functions and targeting interactions.
The aim of the analysis reported in this section is not to give a complete and exhaustive picture of all the possible discovered interaction networks, that would be impossible to report and that does not fit the aim of the present paper, but only to demonstrate that the system shows to be effective.
We have identified and analyzed a series of highlyranked biclusters containing the miR1792 cluster family. Table 8 reports the list of miRNAs and relevant target genes for each of these biclusters. Biclusters at level number 1 are biclusters where all included genes are targeted by all the miRNAs grouped in the bicluster. At higher levels of the hierarchy, other miRNAs and targets are included at different values of cohesiveness suggesting miRNAs alternative multiple interactions. The identification of specific and yet overlapping functions of each component of the miR1792 cluster, can be obtained by comparing targets in each bicluster with those belonging to other related biclusters.
Reactomebased mapping of biclusters 6 (hierarchy level 1), 672 (hierarchy level 2) and 6722270 (hierarchy level 3)(Figure 8), well matches the known functions of miR1792. Indeed, the most overrepresented events are cell cycle and signal transduction. In particular, as for cell cycle, the mitotic transition from the G1 to the S phase is represented with the lowest pvalue (1.5 E13) with 9 (E2F1, E2F3, RB1, CDKN1A, RBL2, CCND2, WEE1, RBL1, CCND1) out of 23 of the target genes involved in this pathway. Signal transduction pathway, with 11 (THBS1, PTEN, TGFBR2, VEGFA, CDKN1A, RBL1, KAT2B, APP, BMPR2, SMAD4, MYC) out of 23 genes involved, is represented with the lowest pvalue (2.5 E10) in the TGFβ signaling pathway including TGFBR2, SMAD4, MYC and RBL1. Other related signaling pathways with significant pvalues are signaling by BMP (bone morphogenetic protein), AKT (Protein Kinase B), PDGF (plateletderived growth factor); signaling by EGFR (epidermal growth factor receptor) in cancer and FGFR (fibroblast growth factor receptors) in disease. EGFR and FGFR signaling pathways depend on PTEN PIP3 activation in the AKT signaling, promoting cell survival and opposing apoptosis by a variety of routes. Gene expression and DNA replication pathways are also overrepresented because influenced by two main downstream effectors of the TGFβ antiproliferative signaling pathway, SMAD4 and CDKN1A [40], and by RB1. SMAD4 is the common signaling transducer of TGFβ at nuclear level, while CDKN1A and RB1 are master regulators of cell cycle progression (negatively regulate the mitotic G1S checkpoint). The analysis of biclusters that contribute to the bicluster 6722270 clarifies the individual contribution of miRNAs and target genes in the general picture above. The main difference between 6 (hierarchy level 1), 672 (hierarchy level 2) and 6722270 (hierarchy level 3) is the miRNA component. Biclustering at level 1 (i.e., bicluster 6) indicates that all the genes in the bicluster are unique targets of miR17 and miR20a, suggesting that only these two genes have a universal role whereas the others may have a pathwayspecific activity. This observation contributes to clarify the general model that, in the attempt to explain the pleiotropic effect of miR1792, proposes that the entire gene cluster gives rise to a moderate downregulation of a large number of mRNAs in each cell type, which collectively mediates its biological functions [39].
Figure 8. Pathway mapping of the bicluster 6722270 in Reactome. The figure shows the pathway mapping of the bicluster 6722270 in Reactome. Results in a tabdelimited format as provided by Reactome are reported in the Additional file 1. The functional mapping of the biclusters 6 and 672 is very similar to that of the bicluster 6722270. The difference is in the fact that KAT2B is only present in the bicluster 6722270.
Additional file 1. Reactome results on bicluster 6722270.
Format: PDF Size: 40KB Download file
This file can be viewed with: Adobe Acrobat Reader
As for the targets component of these biclusters, KAT2B is the unique gene that is only present in bicluster 6722270 but not in bicluster 6 and 672. Looking for other biclusters containing KAT2B at level 1 of the hierarchy (Table 8), it is possible to see that it is present in the bicluster 22 and is cotargeted by miR93 and miR106b. KAT2B has not been included in the biclusters 6 because, differently from the other genes, it is not target of miR17 and miR20a. This is confirmed by a study on multiple myeloma pathogenesis [41] which demonstrates that, among over expressed miRNAs, miR106b25, but not miR1792, is able (through KAT2B) to indirectly control the tumor suppressor protein p53 in the multiple myeloma. Indeed, KAT2B is a histone acetyltransferase involved in the reversible acetylation of various transcriptional regulators, including the tumor suppressor protein p53. Activation of p53 mediated by KAT2B activates CDKN1A (a direct target and master effector of p53) that in turn induces the arrest of the cell cycle at the G1/S transition, and a series of other p53dependent events such as DNA repair and apoptosis. Futhermore, this specific function of KAT2B would be mediated by the coordinate cotargeting of miR181a, miR181b and miR32 [41]. The cotargeting of these last miRNAs on KAT2B is not included in biclusters 22, 2270 and 6722270 but is included in biclusters 41 and 65 at level 1 and in bicluster 1665 at level 2. These biclusters, although not statistically supported by GO, help to disclose new interaction networks. Indeed, in these biclusters (see Table 9), other important regulators of key steps of the cell cycle, TGFβ signaling pathway, cell growth, differentiation and apoptosis, are associated with KAT2B and with the cotargeting of miR25, miR32, miR19a, miR19b, miR181a and miR181b. Moreover, these biclusters, as they also contain BCL2, PTEN, BMPR2 and TGFBR2 (also present in bicluster 6722270), suggest that complex interaction networks involving miR25, miR32, miR181a and miR181b, may account for the diverse and multiple role of miR1792 gene cluster in the maintenance of cell homeostasis. In particular, in bicluster 65, KAT2B is associated, under the direct control of miR25, miR32 and miR19a, with BCL2L11(BIM), the master downstream effector of TGFβdepend apoptosis, and with PRMT5, a protein arginine methyltransferase that negatively regulates cell proliferation by epigenetic control of the RB family of tumor suppressor genes (RB1, RBL1 and RBL2), and that it is regulated by miR19a, miR25, miR32, miR92b and miR96 [42]. The RB family members are known to regulate the expression of genes involved in G1/S transition through their interaction with the E2F transcription factors [43]. However, while transcription of RB1 is repressed in a cell cycledependent manner, the PRMT5mediated inhibition of RBL1 and RBL2 appears to be associated, in leukemia and lymphoma cells transformation, with the deregulation of specific miRNAs [42]. RB1, RBL1 and RBL2 are all present in biclusters 6, 672 and 6722270 and, as shown in bicluster 6, they are all direct targets of miR17 and miR20a. However, as shown in biclusters 70 and 72, RB1 is cotargeted by miR106a, whereas RBL1 and RBL2 are cotargeted by miR106b. This suggests for miR106a and miR106b a functional specificity that could be responsible for the contextdependent response of RBs and of the other genes in these biclusters. Indeed, also E2F1 and E2F3, which are functionally related to RB1 and RBL1/RBL2 [44] (respectively), are coherently biclustered in biclusters 70 and 72. This indicates that functional relationships between E2Fs and RBs [44], as well as the different responses of the RB components (see [42]), may be due to a complex network of transcriptional machineries and regulatory negative feedbacks [39]. This complex network involves transcriptional factors (e.g., E2F1, E2F3, p53, VEGFA) regulating, and in turn regulated by, different components of the miR1792 cluster family in a cell type and contextdependent manner.
Table 9. Biclusters containing cotargeting on KAT2B
Bicluster 41 associates cotargeting of miR181a and miR181b on KAT2B with a series of other transcription factors involved in cell fate determination (by different routes like NLK and BCL2) and differentiation (i.e., CDX2, GATA6, PLAG1). This suggests that the cooperation of miR181a and miR181b with miR1792 may be more specifically related with cell growth and differentiation.
In bicluster 1665, KAT2B is grouped together with genes which are coordinately regulated by miR25, miR32, miR19a and miR19b. The enrichment of these genes in pathways (Table 9) suggests that the cooperation of miR32 with the multiple associations of miR25 (miR106b25 cluster), miR19a and miR19b (miR1792 cluster) is specifically related to the TGFβ signaling.
Among the biclusters containing genes of miR1792, the bicluster 66 is the top ranked (according to GO). Pathway mapping of bicluster 66 returns significant results in the TGFβ/BMP pathway, which regulates embryonic and adult cell proliferation and differentiation, and that is implicated in a great number of human diseases. The transduction of the signal depends on the activation state of different nuclear transcriptional coactivators/corepressors which can positively or negatively regulate different effectors, so that the interpretation of a signal depends on the celltype and cross talk with other signaling pathways such as Notch, MAPK and Wnt (^{© }Reactome). Bicluster 66 includes BMPR2 (bone morphogenetic protein receptor type II), TGFBR2 (TGFβ receptor type II) and SMAD4. While BMPR2 and TGFBR2 are key factors for the activation of TGFβ/BMP receptor complexes and for the transduction of the signal from the cell surface to the cytosol, SMAD4 is essential for the transduction to the nucleus for transcriptional regulation (^{© }Reactome). miRNAs grouped in bicluster 66 indicate that the regulation of TGFβ/BMP signaling at nodal checkpoints of the signal cascade is modulated by the miR1792 gene cluster, namely, miR17, miR19a, miR20a and miR92a. Moreover, as stated before, the presence of BMPR2 and TGFBR2 in the bicluster 1665 suggests that they may also be functional targets of miR25 and miR32. This supports the hypothesis that the activation of the TGFβ receptor is under a complex control mediated by multiple associations of constitutive regulators (e.g. miR17 and miR20a), with diverse members of the same cluster, i.e. miR19a and miR92, and with miR25 (miR106b25) and miR32, in a contextspecific manner. This also suggests that, among the components of miR106b25, miR25 is the one that contributes to the control of the transmission of the TGFβ signaling from the cell surface to the nucleus.
Genes in biclusters 70 and 72, although different, are enriched in cell cycle regulation. Bicluster 70 shows a significant overrepresentation of genes in the G1 phase (pvalue = 8.0 E07) and G1/Sphase transition (pvalue = 1.9 E5), whereas bicluster 72 specifically maps in the G1/Sphase transition (pvalue = 7.8 E08). As for miRNAs, biclusters 70 and 72 share miR17 and miR20a, but bicluster 70 contains miR106a and bicluster 72 contains miR106b. These observations provide useful insights: first, they confirm experimental evidences that demonstrate that miR17 is a key regulator of cell cycle progression by targeting more than 20 genes involved in the G1/Sphase transition [43]; second, the cotargeting of miR20a (always associated with miR17a) underlines that it also cooperate to this pathwayspecific role of miR17. Third, the association of miR106b in bicluster 70 suggests that miR106b25 influences the effects of miR1792 in the cell cycle progression by controlling regulatory circuits involving E2F3, RBL1 and RBL2, while the association of miR106a in bicluster 72 suggests that miR106a363 influences the effects of miR1792 on cell cycle progression by keeping under control regulatory circuits involving CDKN1, E2F1 and RB1. This last aspect may help to shed light on the role of miR106a363 in the general function of the miR1792 cluster family [40].
Discussion on potential applications of extracted biclusters
A general conclusion of the analysis reported in the previous subsection is that our results match with validated experimental results reported in the current literature, demonstrating that HOCCLUS2 is able to provide valuable clues for the understanding of miRNAs functions and mechanisms. In this subsection, we mainly discuss potential uses of biclusters extracted by HOCCLUS2.
As shown for biclusters 41, 65 and 1665, neither GObased ranking nor the analysis of gene enrichment in pathways provide complete understanding on the quality of discovered interaction networks. Nevertheless, we have proved that these biclusters provide important insights for the clarification of functions and interaction networks involving miR1792 components. This example clarifies both the usefulness and effectiveness of HOCCLUS2, even when results are not supported by statistical confirmations on existing resources. It is noteworthy that the statistical ranking of target genes in GO depends on the completeness of annotations available and on the gene classification in the GO tree. Thus, although GO ranking is used to score biclusters, it has not to be intended as an exclusive criterion for the evaluation of the quality of biclusters and for the consequent analysis of data. Rather, it has to be considered as an indicator of potential functional correlations which depend on annotation systems and, as such, it could inevitably fail because of poor, wrong or incomplete annotations.
As regards possible applications, results obtained by HOCCLUS2 on miRTarBase can be used to retrieve all the significant multiple interactions that a miRNA (or a set of miRNAs) of interest may have. Performing this task manually on the source database (miRTarBase in this case) would require to execute a large set of queries and to analyze and aggregate tens of thousands of results. Nevertheless, all this effort would not provide any information on significant gene correlations and miRNA contextspecific multiple interactions.
Furthermore, HOCCLUS2 (which is freely available) can be easily applied to the analysis of other collections of data, e.g. to the analysis of data obtained in specific physiological and pathological conditions which may greatly contribute to the elucidation of miRNAs functions in the relative context. Similarly, HOCCLUS2 could be applied to the analysis of miRNA:mRNA interactions in other organisms annotated in miRTarBase as well as in organisms and plants for which still poor annotations on validated targets are available. In this last case, combining microarrays data with target predictions would allow the researchers to easily detect potentially significant multiple interactions which are worth to be experimentally validated.
In addition to the possibility to extract multiple and significant cotargeting of miRNAs, HOCCLUS2 is able to give new clues in the identification of still unknown miRNA targets. This possibility is due to its ability to associate objects, by iteratively merging pairs of biclusters, that are apparently not related. By observing the biclusters analyzed in the previous subsection, the bicluster 1665 appears to be a good candidate for suggesting the validation of still unknown targets of miR25, miR32, miR19a and miR19b. The cohesiveness value of this bicluster, q = 0.639, indicates that around the 64% of all the possible interactions between its miRNAs and mRNAs are in the dataset and, since they come from miRTarBase, are experimentally validated. This means that the hypothesis that the remaining 36% of possible interactions could actually exist is based on the 64% of validated interactions. This observation confirms that the cohesivenesspreserving strategy adopted by HOCCLUS2 is effective, since, intuitively, the higher the cohesiveness of a bicluster, the higher the probability that the discovered (but not present in the database) interactions actually exist. Indeed, a deep analysis of the interactions of bicluster 1665 revealed that, in miRTarBase, all the genes are validated targets of miR19a and miR19b with the exception of PRMT5, which is a validated target of miR25, miR32 and miR19a but not of miR19b. Moreover, KAT2B and BCL2L11 are validated targets of all the miRNAs in the bicluster. Looking at prediction data in mirDIP, it is possible to find some predictions which support the hypothesis that the remaining interactions actually exist. In particular, at least one algorithm predicted: ESR1, PTEN, ATXN1, BMPR2, KAT2B, TGFBR2 and BCL2L11 as targets of miR19a, miR19b, miR25 and miR32; PRMT5 as target of miR19a, miR25 and miR32; SOCS1 as target of miR19a and miR19b, but not of miR25 and miR32. These observations lead to the conclusion that, in addition to KAT2B and BCL2L11, it would be worth to experimentally validate the hypothesis that ESR1, PTEN, ATXN1, BMPR2 and TGFBR2 are targets of miR25 and miR32.
Comparison of results on miRTarBase with results on mirDIP
Conversely by results of the analysis carried out on miRTarBase, results on biclusters extracted from mirDIP datasets cannot be considered impressive. Although some biclusters have presented enough enrichment in genes that fall in the same or related biological processes and although validated miRNAs interactions have been detected, the much larger number of genes involved has not allowed us a so detailed analysis as for miRTarBase data. On the other hand, Reactome, as other similar resources, still misses pathway mapping annotations for many genes, thus negatively affecting statistical enrichment analysis. In particular, searching for biclusters of the miR1792 gene cluster family in mirDIP has led to identify a few biclusters which were not so well defined as those extracted from miRTarBase, even though functional characterization by pathways mapping has returned a picture that well matches with functional properties of miR1792.
In the attempt of motivating this different behavior, we have searched for predictions of validated targets of miR1792 components in mirDIP. We have found that the difference in the quality of the results obtained on miRTarBase and on mirDIP were mainly due to the performance of prediction algorithms in detecting actual targets. For example, TargetScan Conserved predictions present very low standardized scores for those genes that have been largely confirmed as targets of miR1792 (e.g., E2F1, E2F3 and CDKN1A are predicted targets of mir17 and mir20a with a standardized score ranging from 0.337 to 0.392; CDKN1A is a predicted target of mir106b with a standardized score of 0.358).
Conclusions
In this work, we tackle the problem of biclustering miRNAs and mRNAs on the basis of their interactions. In order to solve this problem, by taking into account specific issues raised by this task, we propose the algorithm HOCCLUS2 which extracts hierarchically organized and overlapping biclusters by maximizing biclusters cohesiveness and exploiting statistical distribution of the data.
The performance of our method is evaluated in terms of execution time and bicluster cohesiveness on a dataset of experimentally verified miRNA:mRNA interactions, i.e. miRTarBase, as well as on miRNA target predictions extracted from mirDIP. A comparative analysis shows that HOCCLUS2 is able to extract a set of (hierarchically organized) biclusters with significantly higher cohesiveness values than ROCC, in a comparable execution time, which proves the inappropriateness of the application of gene expression biclustering algorithms to discover meaningful biclusters from miRNA:mRNA interactions.
The effectiveness of the algorithm in extracting biologically related biclusters is automatically tested and confirmed on the basis of the GO classification. Furthermore, an indepth biological analysis proves that functional relationships among miRNAs and mRNAs in the same bicluster (at different levels of the hierarchy) find large confirmation in the literature. This indicates that the algorithm is able to extract valuable knowledge and that its application in the biological domain may provide us good insights in the study of complex miRNA mechanisms and functions. Moreover, it also proves that the algorithm could be considered as a valid tool for the detection of candidate new miRNAs target genes.
Current results of HOCCLUS2 on miRTarBase human dataset may already be used to easily map differentially expressed miRNAs from microarrays experiments in miRNA:mRNA interacting modules. On the other hand, the application of HOCCLUS2 on very large datasets of predicted targets of differentially expressed miRNAs, although in some way impaired by the poor effectiveness of the prediction algorithms, may significantly help in suggesting potential significant interactions among the huge amount of results they produce. For future work, we intend to use HOCCLUS2 for multilabel classification purposes, according to the predictive clustering framework [45].
Availability of supporting data
Project Name: HOCCLUS2
Project Home Page: http://www.di.uniba.it/~ceci/micFiles/systems/HOCCLUS2/index.html webcite
Available resources: HOCCLUS2 software and user manual, all the datasets and all the obtained results.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MC and GP contributed to the definition of the method. DD contributed to the conception of the biological investigation. GP and MC contributed to design the software. GP and DD took care of the review and selection of bioinformatics resources. GP implemented the system and ran the experiments. DD performed the biological analysis and validation of the results. GP and MC performed the analysis of the results, from the computer science point of view. MC, GP, DD and CL contributed to the manuscript drafting. MC, GP, DD and DM contributed to the manuscript finalization. DM and MC supervised the study. All the authors read and approved the final manuscript.
Declarations
The work presented in this paper is funded by the FAR project "MBLab: The Molecular Biodiversity LABoratory Initiative".
Acknowledgements
The authors thank Lynn Rudd for reading the final version of this paper.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 7, 2013: Italian Society of Bioinformatics (BITS): Annual Meeting 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S7
References

Cui Q, Yu Z, Purisima E, Wang E: Principles of microRNA regulation of a human cellular signaling network.

Place R, Li L, Pookot D, Noonan E, Dahiya R: MicroRNA373 induces expression of genes with complementary promoter sequences.
Proc Natl Acad Sci USA 2008, 105(5):160813. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wilfred Bernard R, WangXia W, Nelson PT: Energizing miRNA research: A review of the role of miRNAs in lipid metabolism, with a prediction that miR103/107 regulates human metabolic pathways.

Cordero F, Pensa RG, Visconti A, Ienco D, Botta M: OntologyDriven Coclustering of Gene Expression Data. In Proc of AI*IA 2009. LNCS; 2009:426435.

Caldas J, Kaski S: Hierarchical Generative Biclustering for MicroRNA Expression Analysis.
Research in Computational Molecular Biology, Volume 6044 of LNCS 2010, 6579.

Yoon S, De Micheli G: Prediction of regulatory modules comprising microRNAs and target genes.

Prelic A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data.
Bioinformatics 2006, 22(9):11221129. PubMed Abstract  Publisher Full Text

Cheng Y, Church GM: Biclustering of Expression Data.
Proc of ISMB'00 2000, 93103. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Deodhar M, Gupta G, Ghosh J, Cho H, Dhillon IS: A scalable framework for discovering coherent coclusters in noisy data.
Proc of ICML'09 2009, 31. PubMed Abstract  PubMed Central Full Text

Zhang SH, Li Q, Liu J, Zhou XJ: A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNAgene regulatory modules.
Bioinformatics [ISMB/ECCB] 2011, 27(13):401409. Publisher Full Text

Shirdel EA, Xie W, Mak TW, Jurisica I: NAViGaTing the Micronome  Using Multiple MicroRNA Prediction Databases to Identify Signalling PathwayAssociated MicroRNAs.
PLoS ONE 2011, 6(2):e17429. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hsu SD, Lin FM, Wu WY, Liang C, Chan WL, Tsai WT, Chen GZ, Lee CJ, Chiu CM, Chien CH, Wu MC, Huang CY, Tsou AP, Huang HD: miRTarBase: a database curates experimentally validated microRNAtarget interactions.

Ashburner M, et al.: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium.
Nature Genetics 2000, 25:2529. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yang J, Wang H, Wang W, Yu PS: An Improved Biclustering Method for Analyzing Gene Expression Profiles.
International Journal on Artificial Intelligence Tools 2005, 14(5):771790. Publisher Full Text

BenDor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the orderpreserving submatrix problem.
Proc of RECOMB '02 2002, 4957. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pio G, Ceci M, Loglisci C, D'Elia D, Malerba D: Hierarchical and Overlapping CoClustering of mRNA: miRNA Interactions. [http://ebooks.iospress.nl/publication/7048] webcite
ECAI, Volume 242 of Frontiers in Artificial Intelligence and Applications IOS Press; 2012, 654659.

Peng X, Li Y, Walters KA, Rosenzweig E, Lederer S, Aicher L, Proll S, Katze M: Computational identification of hepatitis C virus associated microRNAmRNA regulatory modules in human livers.
BMC Genomics 2009, 10:373. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Liang H, Li W: MicroRNA regulation of human protein protein interaction network.

Hsu CW, Juan HF, Huang HC: Characterization of microRNAregulated proteinprotein interaction network.
PROTEOMICS 2008, 8(10):19751979. PubMed Abstract  Publisher Full Text

Gusev Y, Schmittgen TD, Lerner M, Postier R, Brackett D: Computational analysis of biological functions and pathways collectively targeted by coexpressed microRNAs in cancer.

Xu J, Wong C: A computational screen for mouse signaling pathways targeted by microRNA clusters.
RNA 2008, 14:12761283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shalgi R, Lieber D, Oren M, Pilpel Y: Global and Local Architecture of the Mammalian microRNATranscription Factor Regulatory Network.
PLoS Comput Biol 2007, 3(7):e131. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou Y, Ferguson J, Chang JT, Kluger Y: Inter and intracombinatorial regulation by transcription factors and microRNAs.
BMC genomics 2007, 8:396. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Yuan X, Liu C, Yang P, He S, Liao Q, Kang S, Zhao Y: Clustered microRNAs' coordination in regulating proteinprotein interaction network.
BMC Systems Biology 2009, 3:65. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Qi Y, Ge H: Modularity and Dynamics of Cellular Networks.
PLoS Comput Biol 2006, 2(12):e174. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Califano A, Stolovitzky G, Tu Y: Analysis of Gene Expression Microarrays for Phenotype Classification. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. AAAI Press; 2000:7585.

Karypis G, Kumar V: A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.
SIAM J Sci Comput 1998, 20:359392. Publisher Full Text

Ester M, Kriegel HP, Sander J, Xu X: A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

Hsu SD, Chu CH, Tsou AP, Chen SJ, Chen HC, Hsu PW, Wong YH, Chen YH, Chen GH, Huang HD: miRNAMap 2.0: genomic maps of microRNAs in metazoan genomes.
Nucleic Acids Res 2008, 36:D165D169. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

BenHur A, Horn D, Siegelmann HT, Vapnik V: Support Vector Clustering.

Joachims T: Optimizing search engines using clickthrough data.

Platt JC: Fast training of support vector machines using sequential minimal optimization. Cambridge, USA: MIT Press; 1999:185208.

Pesquita C, Faria D, Bastos HP, Ferreira AEN, Falcão AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation.

Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG: Measures of semantic similarity and relatedness in the biomedical domain.
J of Biomedical Informatics 2007, 40(3):288299. Publisher Full Text

Draper N, Smith H: Applied regression analysis. Wiley series in probability and mathematical statistics, New York [u.a.]: Wiley; 1966.

He L, Thomson JM, Hemann MT, HernandoMonge E, Mu D, Goodson S, Powers S, CordonCardo C, Lowe SW, Hannon GJ, Hammond SM: A microRNA polycistron as a potential human oncogene.
Nature 2005, 435(7043):828833. PubMed Abstract  Publisher Full Text

Haw R, Hermjakob H, D'Eustachio P, Stein L: Reactome pathway analysis to enrich biological discovery in proteomics data sets.
Proteomics 2011, 11(18):35983613. PubMed Abstract  Publisher Full Text

Stelzer G, Dalah I, Stein T, Satanower Y, Rosen N, Nativ N, OzLevi D, Olender T, Belinky F, Bahir I, Krug H, Perco P, Mayer B, Kolker E, Safran M, Lancet D: Insilico human genomics with GeneCards.
Hum Genomics 2011, 5(6):70917. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Olive V, Jiang I, He L: mir1792, a cluster of miRNAs in the midst of the cancer network.
Int J Biochem Cell Biol 2010, 42(8):134854. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Petrocca F, Vecchione A, Croce CM: Emerging role of miR106b25/miR1792 clusters in the control of transforming growth factor beta signaling.
Cancer research 2008, 68(20):81918194. PubMed Abstract  Publisher Full Text

Pichiorri F, Suh SS, Ladetto M, Kuehl M, Palumbo T, Drandi D, Taccioli C, Zanesi N, Alder H, Hagan JP, Munker R, Volinia S, Boccadoro M, Garzon R, Palumbo A, Aqeilan RI, Croce CM: MicroRNAs regulate critical genes associated with multiple myeloma pathogenesis.
Proc Natl Acad Sci USA 2008, 105(35):1288590. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wang L, Pal S: Protein Arginine Methyltransferase 5 Suppresses the Transcription of the RB Family of Tumor Suppressors in Leukemia and Lymphoma Cells.
Molecular and Cellular Biology 2008, 28:62626277. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hossain A, Kuo MT, Saunders GF: Mir175p regulates breast cancer cell proliferation by inhibiting translation of AIB1 mRNA.
Molecular and cellular biology 2006, 26(21):81918201. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lapenna S, Giordano A: Cell cycle kinases as therapeutic targets for cancer.
Nat Rev Drug Discov 2009, 8(7):547566. PubMed Abstract  Publisher Full Text

Stojanova D, Ceci M, Appice A, Dzeroski S: Network regression with predictive clustering trees.
Data Mining and Knowledge Discovery 2012, 25(2):378413. Publisher Full Text