Biochemistry Department, UT Southwestern Medical Center, Dallas, TX, USA

Howard Hughes Medical Institute, UT Southwestern Medical Center, Dallas, TX, USA

Department of Computer Science, Amrita Vishwa Vidyapeetham University, Amritapuri Campus, Kerala, India

School of Biotechnology, Amrita Vishwa Vidyapeetham University, Amritapuri Campus, Kerala, India

Abstract

Background

Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency.

Methods

Our method, self consistency grouping, i.e.

Results

Our tests have demonstrated that SCG yields very few false positives. This was accomplished by introducing errors in the distance measurement. Clustering of protein domain representatives by structural similarity showed that SCG could recover homologous groups with high precision.

Conclusions

SCG has potential for finding biological relationships under stringent conditions.

Background

Grouping related objects into clusters has been one of the most widely used tools in many disciplines including biological sciences

We developed a method that we call

Methods

Terminology

A set of

The total number of objects in the data set is _{s}_{s}

We call an algorithm

The input consists of a rank matrix,

Algorithms

We present three algorithms:

Each of

Asymptotic worst case time complexity of all algorithms is the same. In practice, the fastest to the slowest algorithms in sequence are (

Subcluster management

We use a tree data structure to manage subcluster structure. Initially every node representing an object is a root. We call roots (either singletons or that have children) as _{1}…_{i} form a cluster _{1}…_{i} become children of _{1} and if the first _{2} then the additional _{2} (w.r.t. _{1}) may form a subcluster. To determine this, we check to see if all of these elements have a common root _{c}_{c}_{2}. Hence, this method is efficient and does not increase the overall time complexity.

A1 algorithm

At the start of an iteration, all the objects are marked as valid for that iteration. For every valid object

The algorithm examines all clusters from the smallest to the largest size and hence finds all clusters and their subclusters. Only clusters that contain invalid objects (hence already examined) are not examined. In ^{2}) one can check if _{k}_{=2,}_{n}_{-1 }^{2}) = ^{3}), a total of ^{4}) for all the

A2 algorithm

For every object ^{i}^{th}^{i}^{i}^{i}^{i}^{i}^{th}

Let the first two objects in the ^{th}^{4}). However, in practice, the average number of candidate clusters for

We note that when _{1} and _{2} then either _{1}⊂ _{2} or _{2}⊂ _{1}. This is true because all the members of the smaller cluster have smaller ranks with respect to

**Lemma 1. **

**Proof.** We prove this by contradiction. Let

A3 algorithm

^{i}^{i}^{th}^{i}^{th}

**Lemma 2. **

**Proof.** We prove this by contradiction. Let

Execution Trace

Consider

Execution traces for SCG algorithms

**Execution traces for SCG algorithms.** The two matrices

We trace all three algorithms on these matrices.

Execution time comparisons of A1, A2, and A3

The three algorithms described above (^{2}) for ^{2} - 0.0077^{2} the coefficient of determination is 0.9999. However, for

Execution time measurements of SCG algorithms

A1

A2

A3

8

0.0011

0.00055

0.0009

16

0.0061

0.0025

0.0042

32

0.025

0.0039

0.01

64

0.131

0.009

0.051

128

0.594

0.016

0.276

256

2.584

0.049

1.295

512

12.54

0.118

4.34

1024

58.429

0.27

25.47

Comparisons of execution times of SCG algorithms

**Comparisons of execution times of SCG algorithms**. Comparison of execution times of SCG algorithms.

Results

We compare SCG to well-known agglomerative clustering algorithms: complete linkage (CL), single linkage (SL), and average linkage (AL). We compared the methods with simulated data and with SCOP datasets. Among these methods, CL is the most similar method to SCG.

CL clusters a pair of objects if the distance between them is less than a specified cut-off. CL merges a pair of clusters if the maximum distance between any pair of objects where the first object is from the first cluster and the second object is from the second cluster is at most the chosen cut-off. Thus, it is stringent akin to SCG.

In contrast to CL, SL a popular method, groups the objects aggressively. When forming hierarchies of clusters SL uses the minimum distance between clusters. AL is another popular method that balances the approaches of CL (very conservative) and SL (very aggressive). Thus, CL, AL and SL capture a wide gamut of behavior and comparing SCG to them will yield a fair assessment of SCG.

SCG uses the rank matrix (that is obtained from a distance metric) and inherent consistency in the ranks whereas CL directly uses distance matrix along with an explicit parameter (cut-off). Let the distance cut-off for CL be δ. We note that SCG can yield a (sub)cluster in which the maximum distance between a pair objects is greater than δ. Let _{1} and _{2} can be greater than δ where _{1}, _{2} are objects of

After SCG produces the clusters, one can rank the clusters by the ascending value of the maximum distance between any pair of objects within a cluster. This will rank the natural clusters in the data from the best (first) to the worst (last).

We perform a balanced comparison of clustering methods by measuring the number of clusters and the number of incorrect pairs. If the goal of clustering is to find robust clusters, AL is the most suitable. However, if the goal is to minimize false positives in the presence of large number of random errors, SCG is a good candidate. This suits many problems in biological sciences;

Comparison of methods on simulated data

We compare SCG to other popular agglomerative clustering methods. We generated 100 random test datasets. As shown in Fig

Effect of random errors introduced in distance measurements

**Effect of random errors introduced in distance measurements**. (a) Dataset: We randomly generated a dataset of 80 points around four centers (0,8), (0, -8), (8,0) and (-8,0), 20 points for each center. Each point was offset from the center in both X and Y directions by a random amount following normal distribution (µ=0 and SD= 1). (b) Effect of random error on average cluster sizes: For the given dataset of 80 points, the Euclidean distances were calculated. Then we perturbed each pairwise distances with random value following a Gaussian distribution with µ=0 and SD shown on X axis. Note that SD = 0 implies that there are no perturbations. These distances are used to build clusters using SCG (cyan line), complete linkage (CL, blue line), average linkage (AL, green line) and single linkage (SL, red line). Since CL, AL and SL requires score cut-offs, we measured the clustering with distance cut-off values of 2 (conservative), and 4 (less conservative) denoted by the numbers following “/” (solid lines and dotted lines respectively). Finally, the number of clusters was measured per method per cut-off (this includes singletons). Thus, the maximum value can be 80 (all singletons) and the minimum possible value is 1. The ideal number is 4 by design (Fig 3. (a) ). The error bars shown at different points of the curves (each representing a method) are derived from 100 perturbations for a given SD. Note that the SCG shows steepest rise in the number of clusters. (c) Effect of random error on cluster qualities: Legends and the unit of X-axis are same as in (b). After each method identifies clusters, we enumerate all pairs within a cluster,

Ideally the number of clusters should be 4, denoting 4 groups of points in the plane as shown in Fig

Comparison of methods in clustering protein structure

In general, protein structures were considered hard to cluster with conventional clustering methods without human intervention

Comparisons of clusters built by different methods to the reference SCOP fold classification (Total # of domains clustered:9528)

SCG

CL

AL

SL

Total number of clusters*

4965

4965

4965

4965

Number of non-singleton clusters

1926

1561

1263

975

Number of incorrect pairs

102

214

2952

6440

Percentage of incorrect pairs

(0.2)

(0.4)

(3.5)

(3.7)

Number of correct pairs

46938

50948

81280

166386

Percentage of correct pairs

(99.8)

(99.6)

(96.5)

(96.3)

*Total number of clusters was fixed at the number of clusters determined by SCG for a fair comparison of different methods.

The SCG clustering of SCOP domains shows that many of clusters are very small, ~1/3 of total protein domains form singleton clusters (3039 domains) and only few domains form relatively bigger clusters (see Fig

SCG, CL, AL, and SL clustering results of SCOP domains based on structural similarity score

**SCG, CL, AL, and SL clustering results of SCOP domains based on structural similarity score**. The same color scheme was used as in Fig

Similarities of clusters built by different methods

SCG

CL

AL

SL

SCG

1.0

0.78

0.59

0.36

CL

1.0

0.74

0.44

AL

1.0

0.65

SL

1.0

F-measures are used for similarities between cluster similarities. F-measure is formally defined as a harmonic mean of precision and recall

We would expect similar results to the clustering done on simulated data if the scores were perfect and the grouping was done objectively. Note that we used Euclidian distance in simulation (Fig

SCG method in iteration

In some cases, for increasing the average cluster size, a few false positives can be tolerated. Here, the stringency of SCG becomes an issue. So, we designed iterative SCG or iSCG. The iteration is similar to other agglomerative methods. In the first iteration, SCG finds all independent clusters, subsequently each independent cluster is considered as an object. Then the rank matrix is updated for the next iteration, the new ranks can be determined based on one of the following strategies: i. most similar relationships, ii. average similarities between two groups, or iii. the most distant relationships, roughly corresponding to SL, AL, and CL, respectively. iSCG demonstrated correlation between the number of false positives and the number of iterations.

Concluding remarks

We studied a clustering method that has no restriction on the size or the number of clusters. We designed two improvements

Comparing SCG to other clustering methods demonstrated that SCG is very conservative. In simulation results, SCG did not yield any false positives. Because of its stringency, SCG can be used to validate the correctness of a distance metric in addition to clustering objects into groups. Moreover, SCG formed very accurate groups of protein structures indicating its potential applicability in exploring other biological data, such as microarray expression data, genome wide association studies etc.

Competing interests

There are no competing interests.

Authors' contributions

NVG conceived and supervised the study; BK and BC designed the algorithms; BK conducted the computational experiments analyzed the results; All authors participated in the preparation of this manuscript.

Appendix

Pseudocode

Acknowledgements

Authors thank Jeremy Semeiks and Dr. R. Dustin Schaeffer for their helpful suggestions.

This article has been published as part of