, UPMC, UMR7238, Génomique Analytique, 15 rue de l’Ecole de Médecine, F-75006 Paris, France

, CNRS, UMR7238, Laboratoire de Génomique des Microorganismes, F-75006 Paris, France

Abstract

Background

Searching for similarities in a set of biological data is intrinsically difficult due to possible data points that should not be clustered, or that should group within several clusters. Under these hypotheses, hierarchical agglomerative clustering is not appropriate. Moreover, if the dataset is not known enough, like often is the case, supervised classification is not appropriate either.

Results

CLAG (for CLusters AGgregation) is an unsupervised non hierarchical clustering algorithm designed to cluster a large variety of biological data and to provide a clustered matrix and numerical values indicating cluster strength. CLAG clusterizes correlation matrices for residues in protein families, gene-expression and miRNA data related to various cancer types, sets of species described by multidimensional vectors of characters, binary matrices. It does not ask to all data points to cluster and it converges yielding the same result at each run. Its simplicity and speed allows it to run on reasonably large datasets.

Conclusions

CLAG can be used to investigate the cluster structure present in biological datasets and to identify its underlying graph. It showed to be more informative and accurate than several known clustering methods, as hierarchical agglomerative clustering,

Background

Clustering of biological data often requires to look for the proximity of few data points within a large dataset with the purpose to group together only those that satisfy the same set of constraints, possibly resulting from the same functional origins, or that have undergone the same evolutionary pressures. This is the case for amino acids in proteins, where one expects few of the residues to account for the structural stability of the protein or for its functional activity. For these biological problems, the number of expected clusters is unknown and classification approaches, known as unsupervised, are expected to unravel hidden structures in the data.

A common approach to clustering is the simple unsupervised

CLAG, for CLusters AGgregation, is an unsupervised non hierarchical clustering algorithm that handles non uniform distributions of values in order to zoom in dense sets of character values, parameterizes data points proximity, and outputs a graph of similarity between data points as well as a clustered matrix.

Important work on clustering a restricted number of datapoints

Results and discussion

Clustering algorithm and aggregation

Let us consider a set

Entries distributions and grids

The _{1}
and the maximum _{2} entries within the _{2} − _{1}|.

We discretize the entries distribution with the help of two shifted grids of intervals that will be used to easily define entries closeness. Namely, a

Entries distribution and grids

**Entries distribution and grids.** Toy example based on a matrix of 100 entries with environment **A:** the distribution of entries, partitioned with a 0-grid (solid lines) and with a 1-grid (dashed lines). Alternated grey and green colors are used to identify quantile regions. Grids are defined with **B:** elements _{env}(**C:** contrary to B, here the 5 pairs of entries, _{env}(

We say that a distribution of scores is

Closeness between entries

CLAG clusters elements in _{1},
_{2} be two entries within the matrix such that
_{1} < _{2}. We say that _{1}
and _{2} are _{2} belongs to _{1}).

Notice that for distributions of scores that are not heterogeneous, the definition of closeness can be greatly simplified: two entries _{1},_{2}, with
_{1} < _{2}, are close if
_{2} belongs to the _{1}). For distributions which are possibly heterogeneous, the notion of grid turns out to be crucial but it should be observed that the concept of closeness could be stated by using the 0-grid only. The usage of the second grid (that is, 1-grid) is redundant here.

Environmental score

For a pair of elements _{env
} by counting the number of characters _{
env
} to the interval [−1,1]. A high environmental score reflects the fact that

Clusters and affine clusters

To define a cluster in a matrix, we fix an element

a. _{env
}(_{env
}(

b.

If no such

From the definition, it follows that a cluster is a subset of elements in _{env
}(_{env
}(_{env
}(_{env
}(

By increasing

Matrices with

If _{sym
}
of pairs of elements

Symmetric score

In order to evaluate the symmetric score of a pair _{sym}
is defined for close entries only, and for all other pairs is undefined. With no loss of generality,

The definition of symmetric score for two close entries

1. If _{
sym
}(_{sym
}(

2. If _{sym
}(_{sym
}(

3. If

where

4. If _{sym
}(_{sym
}(

The symmetric score of a pair of elements

Clusters and affine clusters taking into account symmetricity

We fix an element

a. _{sym
}(_{ sym
}(

b. _{env
}(_{env
}(

c.

If no such

For a cluster _{env
}(_{sym
}(_{env
}(_{sym
}(

CLAG algorithm: the clustering step

CLAG is structured along two steps: a clustering step and a cluster aggregation step (Figure

1. it computes environmental scores for all pairs of elements in

2. it clusters

3. it identifies clusters and affine clusters.

4. it outputs a list of ranked affine clusters with respect to their environmental (and symmetric) scores and other numerical properties, and pdf images of the clustered matrix.

CLAG flowchart

**CLAG flowchart.** Illustration of the different steps of the clustering method. The algorithm’s inputs are the matrix,

Notice that the input matrix is automatically renormalized to [0,1], if the matrix values do not belong to [0,1]
already. The advantage of using renormalized values, is that they allow the user to visualize affine clusters with the R script developed for this purpose. Also, notice that when

Highest is the environmental score, closer is the behavior of the elements grouped in a cluster (with respect to the environment). This information is helpful to understand the structure of the set

Notice that in the clustering step, the algorithm identifies the set of clusters generated by all elements of

Clusters might share common elements and we wish to derive non overlapping sets of elements while keeping track of elements proximity. We do so for affine clusters and, possibly, for clusters with scores greater than a fixed positive threshold. We iteratively aggregate clusters in a graph as follows:

1. for any _{1},_{2}…_{n
}, having the same (symmetric score, if it exists, and) environmental score, iteratively fuse together those clusters that share a common element and associate to the resulting cluster the same (symmetric score and) environmental score. Apply this step until no more clusters can be fused together. Rank the list of resulting clusters with the (highest symmetric score if it exists, and secondly, the) highest environmental score.

2. remove two clusters _{1},
_{2}
from the top of the ranked list; if _{1},_{2}
share an element, then construct a graph whose labelled nodes are the elements of
_{1},_{2}
and whose edges are defined between all elements of _{1}, and between all elements of _{2}; we color the nodes of the graph with a unique color and call the resulting graph _{1},_{2}
do not have any element in common, construct a clique associated to each cluster and color them differently; the two labelled cliques are aggregates.

3. remove the first cluster _{1}…_{k
}, where possibly _{i
}’s the “new” nodes of _{i
}’s) and all edges between all nodes in _{i
}. If

The resulting graph is called

In the following, without loss of generality, the term “key aggregate” will also be used to refer to the set of elements labeling the nodes of the key aggregate subgraph. Using sets, we present a toy example to illustrate the aggregation step. Let
_{1} = {1,2,3},_{2} = {3,4,5},
_{3} = {6,7,8},_{4} = {8,9,10},
_{5} = {5,10,11,12}
be five affine clusters issued from the first step of the algorithm. Let
_{1},_{1},_{2},
_{3},_{4} be their respective decreasing scores. By step 1, _{1}
and _{2} are fused together in a set _{1,2} = {1,2,3,4,5}
because they have the same score and they share a common element. The set
_{1,2} has score _{1}. In step 2, the algorithm selects _{1,2}
and _{3}, that is the two clusters with highest score, it verifies that they share no common element and it labels _{1,2},_{3} with two different colors. Then, it selects _{4}
(in step 3), since it has the highest score among those clusters not yet examined. Cluster
_{4} shares an element with _{3} and it is fused with _{3} into a new set _{3,4}, keeping the color label of _{3}. By iterating step 3,
_{5} is considered. It shares an element with
_{1,2}
and one with _{3,4}. The new set _{6} = {11,12} is constructed by subtracting _{1,2}∪_{3,4} from
_{5} and it is labelled by a new color. The three sets
_{1,2}, _{3,4} and
_{6} are the resulting key aggregates. Strictly speaking, the algorithm provides a colored graph structure that traces the relations between the different key aggregates (Additional file

CLAG instructions and Figures issued from the datasets analysis. A list of instructions for running CLAG and extra figures for the analysis of the four datasets discussed in the article are given.

Click here for file

It might be useful to rank aggregates with respect to the strength of the clusters that form them. This can be done by associating to an aggregate two
_{env
}
(_{sym
}) scores: the first is the _{env
}
(_{sym
}) score of the first cluster entering the aggregate and the second is the _{env
}
(_{sym
}) score of the last cluster entering the aggregate.

Algorithmic complexity

The construction of the

Application to biological data

We analyze four datasets

Breast tumor miRNA expression data

A panel of 20 different breast cancer samples was chosen to represent three common phenotypes and was blindly analyzed for miRNA expression levels by microarray profiling

Application on breast tumor samples data

**Application on breast tumor samples data.** A panel of 20 different breast cancer samples **A**: matrix of key aggregates computed with CLAG, with _{env}(**B**: zoom on the matrix in A where the three aggregation graphs in C are indicated. **C**: aggregation graph produced by CLAG where three main clusters (produced by the first step of the algorithm and colored red, green and violet) are connected among each other by grey edges. Notice that the three clusters are indicated on the top of the zoomed matrix in B. Numbers labelling the nodes of the graph correspond to samples, that is columns in the matrix. **D**: dendrogram produced from the data clustered in A with a hierarchical clustering algorithm based closely on the average-linkage method of Sokal and Michener and developed in

When CLAG is applied to the dataset, it classifies all patients at ^{−4} and a sum of the probabilities of unusual tables of 0.025. These probabilities improve the ones computed over the tree organization in Figure ^{−3} and a sum of probabilities of unusual tables of 0.066. In both cases, the probabilities of unusual tables are small enough to reject the null hypothesis. (See Additional file

CLAG on breast cancer data: clustering analysis

**CLAG on breast cancer data: clustering analysis.** Curves counting classified elements (**A**) and key aggregates **(B)** for increasing

On this dataset, k-means, c-means and MCLUST fail clustering by proposing one or several clusters of single elements (see Additional file

CLAG executions on the breast cancer dataset. CLAG executions on the breast cancer dataset are detailed with respect to parameters variation. Executions of other clustering tools (k-means, c-means, MCLUST) are also reported.

Click here for file

Brain cancer gene expression data

The expression levels of more than 7000 genes for 42 patients have been monitored and classified in 5 different brain cancer diagnosis by an

For

CLAG on brain cancer gene expression data: error analysis

**CLAG on brain cancer gene expression data: error analysis.** Error analysis of CLAG clustering for gene expression data on brain cancer **A:** count of errors and key aggregates at increasing **B:** number of clustered patients evaluated on aggregation of clusters having scores greater than a fixed threshold. **C:** number of PNET patients aggregated at increasing **D:** number of patients that are correctly classified together.

While

With

In

CLAG executions on the brain cancer dataset. CLAG executions on the brain cancer dataset are detailed with respect to parameters variation. Executions of other clustering tools (k-means, c-means, MCLUST) are also reported.

Click here for file

Coevolved residues in protein families data

A large number of coevolution analysis methods investigate evolutionary constraints in protein families via correlated distribution of amino-acids in sequences. Given a protein family, they produce a square matrix of coevolving scores between pairs of alignment positions in the sequence alignment associated to the protein family

We applied CLAG to the coevolution score matrix produced by the coevolution analysis method MST

CLAG on a coevolution scores matrix

**CLAG on a coevolution scores matrix.** Clustering of the MST matrix of coevolution scores for the globin protein family **A**: Slices of the clusterized matrix associated to all key aggregates obtained with **B**: hierarchical clustering of the dataset where key aggregates of

CLAG on the globin dataset: clustering analysis

**CLAG on the globin dataset: clustering analysis.** Curves counting classified elements **(A)** and key aggregates **(B)** for increasing

Notice that for this dataset,

Agglomerative hierarchical clustering

When we compare

CLAG executions on the globin dataset. CLAG executions on the globin dataset are detailed with respect to parameters variation. Executions of other clustering tools (k-means, c-means, MCLUST) are also reported.

Click here for file

CLAG in synthetic datasets

We run CLAG on six different synthetic datasets with Gaussian clusters, each of them constituted by 1024 vectors, organized in 16 clusters and defined in 32, 64, 128, 256, 512 and 1024 dimensions respectively. CLAG succeeds in clustering correctly all datasets for

CLAG executions on all multi-dimensional datasets and best models computed by MCLUST. CLAG executions on all multi-dimensional datasets are detailed with respect to parameters variation. BIC values for best model selection are reported for MCLUST.

Click here for file

CLAG on the 128-dimensional synthetic dataset

**CLAG on the 128-dimensional synthetic dataset.****A:** the 128-dimensional dataset contains 1024 points and 16 clusters generated with a gaussian distribution (**B:** curves associated to different score thresholds describing the number of elements that are clustered by CLAG while varying **(C)**, c-means **(D)**, MCLUST **(E)**. k-means and c-means were run with 16 clusters, and MCLUST with “ellipsoidal, equal variance with 9 components” as best model (note the 8 grey clusters). For

Also, we generated 2D sets of points with different shapes and degrees of density and checked the performance of k-means, c-means, MCLUST and CLAG on these datasets (Additional file

CLAG executions on synthetic datasets G4, G5, G6. CLAG executions on synthetic datasets G4, G5, G6. Executions of other clustering tools (k-means, c-means, MCLUST) are also reported.

Click here for file

CLAG’s parameterization

CLAG is based on two parameters,

Conclusions

CLAG is an unsupervised non-hierarchical and deterministic clustering algorithm applicable to

An important feature is that CLAG does not try to clusterize all data points, but it combines just those that are sufficiently similar to be clustered together. Because of this relaxed clustering constraint, after the clustering step, the user learns which data points drove the clustering with respect to

The cluster structure present in biological datasets can be systematically investigated with CLAG. This underlying structure between data points is typically not a tree but a graph, and CLAG provides an aggregation graph describing it.

Known clustering methods ask for a data point to belong to at most one cluster. For certain applications, this is a limitation. For instance, for coevolution score matrices, a fixed alignment position in a protein family could be subjected to more than one evolutionary constraint and therefore might play several roles for the protein. Unlike other approaches, CLAG allows for a position to belong to several clusters. Hence, the user can extract useful information from the clustering step and eventually use the outcomes of this step as a clustering result.

For the user, scores are relevant to evaluate clusters strength and to decide whether clusters should be considered important or not for their analysis. This numerical feature is missing for the hierarchical clustering where it becomes hard, at times, to choose among subtrees based on their height. The globin analysis is an example of this (Figure

CLAG second step (producing key aggregates) is applied only on affine clusters, that is clusters with positive environmental (and possibly symmetrical) score(s). Notice that the general notion of affinity, asking for _{env
}(

We should warn potential users that the definitions of environmental score and affine cluster implicitly assume that all the

CLAG has been compared to various clustering approaches on four biological datasets, and showed to be more informative and accurate than hierarchical agglomerative clustering and

Methods

Implementation

CLAG takes as input a matrix and a

CLAG is freely available under the GNU GPL for download at

Comparative tools and data

Hierarchical clustering,

Six multi-dimensional synthetic datasets were downloaded from

The exact contingency table computation has been realized on the website

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AC and LD designed the algorithm, selected and analyzed the four experiments illustrating the applicability and the performance of the algorithm. LD implemented the tool. Both authors read and approved the final manuscript.

Acknowledgements

To Martin Weigt for running SCAP on our datasets. Doctoral fellowship and a teaching assistantship (to LD) from the Ministère de l’Enseignement Supérieure et de la Recherche.