Automatic classification of protein structures relying on similarities between alignments

Santini, Guillaume; Soldano, Henry; Pothier, Joël

doi:10.1186/1471-2105-13-233

Research article
Open access
Published: 14 September 2012

Automatic classification of protein structures relying on similarities between alignments

Guillaume Santini¹,
Henry Soldano^1,2 &
Joël Pothier²

BMC Bioinformatics volume 13, Article number: 233 (2012) Cite this article

3743 Accesses
4 Citations
1 Altmetric
Metrics details

Abstract

Background

Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins.

Results

When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments.

Conclusions

We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP.

Background

During the past decade the databases of protein sequences have grown exponentially reaching several millions entries while 3D protein structures databases grew quadratically so as to reach, regarding the Protein Data Bank (PDB) [1],∼30000 non redundant structures sharing less than 90% sequence identity. In order to assign a structure and then a function to as many new sequences as possible, there are various methods. When a sequence is similar enough to the sequence of one or more known 3D structures, methods based on homology modeling give satisfying results. When sequence similarity fall in the “twilight zone” - i.e. under 30% of sequence identity - one has to resort to other methods. Among those, threading methods take advantage of available 3D structures to infer a 3D structure from a new sequence. Using statistical filters parametrized on a library of structural cores -i.e. a bank of invariant structural motifs of protein families -, they correlate 1D (i.e. sequential) and 3D information. In such context, the predictive ability of the threading method directly depends on the representativeness and exhaustivity of the core library. Such a library can be built upon a set of representative structures taken from expert structural classifications [2, 3] as SCOP [3] and CATH [4]. However, due to the necessary careful manual inspection of the data, these expert classifications face difficulties in coping with the growing number of newly determined protein structures. For instance, since the last version of SCOP (1.75), there has been a growth of about 21% (10417 to 12643) of the total number of non-redundant protein chain in the PDB ( VAST [5] non-redundant set for a BLAST p-value of 1⁰⁻⁷available at ftp://ftp.ncbi.nih.gov/mmdb/nrtable/). Hence automatic and fast clustering approaches become necessary.

Over the past decade there have been many attempts aiming at developing automatic classification procedures, mainly applying supervised classification methods using as labels of know 3D structures part of a reference classification. Jain and Hirst [6] proposed such a supervised machine learning (ML) algorithm based on random forest to learn how to classify a new 3D structure in a SCOP family. Thus a 3D structure is described using a set of global structural descriptors composed from four to six secondary structural elements (SSEs) for protein domains. However, supervised classification methods heavily depends on the reference classification, whose labels are fixed, and therefore only partially address the problem of automatic classification of 3D structures.

Røgen and Fain [7] suggested an unsupervised approach using a description of protein structures derived from knot theory in order to describe the compared structures globally. Zemla et al[8] proposed a similarity scoring function that aims at automatically identifying local and global structurally conserved regions in order to drive a clustering algorithm. Sam et al. [9] investigated varieties of tree-cutting strategies and found some irreducible differences between the best possible automatic partitions and SCOP classifications. These results have been confirmed by the work of Pascual-Garcia et al. [10]. They have investigated the non-transitivity of objective structural similarity measures: a protein A can be found similar to an other protein B, the protein B can be found similar to a third protein _{o
k} and still proteins A and C may share no similarity. They have shown that non transitivity, that does occur at low similarity levels, leads to non unicity of the partition resulting from the clustering process. For fine granularity -i.e. high similarity levels- structural transitivity is satisfied with few violations within a given cluster and different classification procedures converge to the same partition. For coarser granularities -i.e. lower similarity levels- as the similarity measures are computed on distorted and divergent 3D motifs, requiring to partition the set of structures implies choices for deciding which transitivity violations should be ignored. Depending on these choices classifications may differ significantly.

Furthermore, such similarity based classification procedures of 3D structures only consider a single overall pairwise similarity measure or score, that is derived from local similarities, and do not make use of the detailed mapping of similar parts computed during the alignment process. As a consequence, these procedures, ignoring the mapping information, may lead to cluster proteins that do not all share a common motif. This point will be further illustrated using a Simple case studies section. Then, prior to running a graph based clustering process, we propose to make use of the mapping information in ternary similarity constraints applied on triples of structures. Our experiments will compare the agreement between automatic classifications, obtained with and without that preliminary processing, and the SCOP reference classification.

First we need to use the similarity degree between two protein structures in order to build a graph of similarities whose vertices are protein structures and edges correspond to similarities exceeding a given threshold. Such a graph can be directly given as an input to a graph based clustering process. However, our proposal is to use the mapping information for defining similarities between protein alignment as follows. Let us define an alignment between 2 proteins A and B as a one to one mapping of (sub)parts of A onto (sub)parts of B. A similarity between two alignments is thus defined if the two alignments share a common sequence. More precisely, the alignment between protein A and protein B and the alignment between protein B and protein C are stated as similar if the (sub)parts of B implied in both alignments constitute a significant part of at least one of the two alignments. In other words, we consider a ternary similarity between A, B and C, centered on B, and that such a ternary similarity is stronger if the regions of B implied in its similarity with A are also implied in its similarity with C. The aim of the preprocessing step is then to consider that whenever there is an edge between proteins A and B and an edge between proteins B and C, then the ternary similarity centered in B, quantifying the common part shared by the three proteins, should be high enough. In that case we will state that the ternary constraints are satisfied. The preprocessing step will then consist in reducing the original graph to a graph satisfying the ternary constraints.

To summarize it, the method, shortly introduced in [11] starts with building a graph of 3D structures whose edges represent pairwise similarities. That graph is first transformed into its line graph that represents the adjacencies between the graph edges. Applying the ternary constraints results in eliminating some vertices of the line graph. A maximal line graph is then extracted from the resulting graph. The graph of 3D structures corresponding to this maximal line graph now satisfies the ternary constraints: every triple of linked proteins corresponds to a significant structural motif. In our experiments, MCL [12], a Graph Clustering algorithm previously applied with success to the clustering of protein sequences in families on a large scale [13] is used for achieving the final classification. That classification is then compared to the expert classification SCOP at the finest granularity -ie the SCOP “Family level”-. We also experiment a standard clustering method, suited for applications involving a large and unknown number of clusters, the preprocessing step being also applied in these experiments.

Definitions

In this work as in [11] a protein structure is identified to an item o. Each item is defined as a set of parts o={_{p
i}}. Here each part _{p
i} will represent a structural unit defined by a sequence of one or more amino-acids. We first define the similarity of two parts by comparing their distance to a threshold.

Items and similarities

Definition 1 (Similarity of item parts)

Let _{p
i}and $p_{i}^{'}$ be parts of two different items, $D (p_{i}, p_{i}^{'})$ be a distance between parts, and _{T
P}a distance threshold defined on the distance range. We define $simP (p_{i}, p_{i}^{'})$ , the similarity of items parts _{p
i}and $p_{i}^{'}$ , as follows:

$simP (p_{i}, p_{i}^{'})$ is True iff $D (p_{i}, p_{i}^{'}) \leq T_{P}$ .

We also suppose that we have a mapping function M that maps subsets of items parts into a one-to-one correspondence. For protein sequences such a function is an alignment algorithm. Two items are then considered as similar if they have enough parts in common.

Definition 2 (Similarity of two items)

Let O be a set of items and a mapping function M. Let (o,^o′) be two items, and M(o,^o′) be the set of pairs of parts of o and ^o′in one-to-one correspondence, then, the symmetric similarity simO(o,^o′) between items o and $o^{'}$ is defined as follows:

simO(o,^o′) iff Card(M(o,^o′))≥_{T
O}, where _{T
O}is a given threshold.

Elements of M(o,^o′) are denoted as mapped pairs. We now define a ternary similarity relation over triples of items.

Definition 3 (Centered ternary similarity of items)

Let (^o′,o,^o′′) be three items such that simO(^o′,o) and simO(o,^o′′) are true, and $P_{o' o''}$ (o)be the subset of parts of o related to both $o^{'}$ and ^o′′, i.e., such that $P_{o^{'}, o^{'}} (o) = {p | (p, p^{'}) \in M (o, o^{'})$ and (p,^p′′)∈M(o,^o′′)}. Then si_m3(^o′,o,^o′′), the ternary similarity centered around o , is defined as follows:

si_m3(^o′,o,^o′′) iff Card( $P_{o' o''}$ (o))≥T×min(Card(M(o,^o′)),Card(M(o,^o′′))),where the ternary similarity threshold T lies in the range 0−1.

We note and exemplify hereunder that the notion of ternary similarity should not be confused with the notion of transitivity, which only depends on the graph of similarities, i.e. on binary relations. As an example, we consider the case of three items, pairwise linked, i.e. forming a clique, and highlight a case in which none of the three centered ternary similarities exceeds the ternary similarity threshold.

Property 1 (Cliques can not satisfy centered ternary similarity)

Here is a counterexample. Let (o={_{p
i},_{p
j}},^o′={_{p
i},_{p
k}},^o′′={_{p
j},_{p
k}}) such that M(o,^o′)=_{p
i}, M(^o′,^o′′)=_{p
j}and M(o,^o′′)=_{p
k}. Assuming that _{T
O}=1 we obtain that {o,^o′,^o′′} is a 3-clique, and therefore similarity is transitive. Nevertheless si_m3(o,^o′,^o′′) is False, si_m3(^o′,o,^o′′) is False and si_m3(o,^o′′,^o′) is False for any threshold T>0, and therefore all ternary constraints are violated.

Graph model

Similarities between items are encoded as edges in an undirected graph G whose vertices are identified to items, and whose edges represent similarities between pairs of items. Conventional notations are those of [14].

Definition 4 (Graph of item similarities)

The graph G of item similarities with respect to the above notions of pairwise similarities on a set of Oitems is defined as follows:

G=(O,E) where V(G)=Oand E(G)=E={(_{o
i},_{o
j})∈^O2| simO(_{o
i},_{o
j}) is True}.

Definition 5 (Independent connected components)

A connected component of G is a subgraph of G in which any pair of vertices is connected through a path. Correlatively independent connected components, named ICCs, are two subgraphs of G for which there is no path between any node of one component to any node of the other component.

Now we introduce a useful equivalent representation of G as a line graph whose definition is recalled here.

Definition 6 (The line graph of a graph)

Let G=(O,E) be a graph. Its line graph is defined as L(G)=(E,F) where F={(_{e
i}_{e
j})∈^E2| _{e
i} adjacent to _{e
j} in G)}.

The line graph transformation is bijective if nodes labels are known and has the following property:

Property 2

The connected components of G and of L(G) are in a one-to-one correspondence.

Indeed, given _{g
i}and _{g
j}two ICCs of G, according to definition 5 there is no edge linking a node of _{g
i} with a node of _{g
j}. Consequently, by construction, there cannot be adjacency between any edge of _{g
i} and any edge of _{g
j}. Then, according to definition 6 there is no edge between L(_{g
i}) and L(_{g
j}). The reciprocal can easily be inferred.

Our purpose is to modify L(G) in order to satisfy the constraints derived from centered ternary similarities. Such modification relies on the following properties:

Property 3

Line-Graph

1.
A vertex of L(G) is an edge of G,
2.
Two connected vertices of L(G) correspond to two adjacent edges of G: let two edges of G be (^o′,o) and (o,^o′′), the corresponding edge of L(G) is (^o′,o)−(o,^o′′).
3.
Removing a vertex in a line-graph L(G) leads to the line-graph of the subgraph of G obtained by removing the corresponding edge from G.

From property 3 and definition 3, the centered ternary similarity can be checked on every L(G) edge as such an edge links two vertices representing two similarities sharing a common item.

Measures

In order to compare two classifications we use standard comparison measures of classification similarity. More precisely, let P={_P1,_P2,…,_{P
n}} be a partition of the set of items O, two items _{o
k}∈_{P
i}and _{o
l}∈_{P
j} are said co-classified iff _{P
i}=_{P
j}.

Let P be a reference partition and ^P′be an other partition of the same set of items O obtained by a classification procedure. We denote as TP the number of pairs of items co-classified in _{C
p}and in $C_{P^{'}}$ , as FN the number of pairs of items co-classified in the reference partition P but not in ^P′, and as FP the number of pairs of items co-classified in the partition ^P′but not in P.

The Precision and Recall of the partition ^P′with respect to the reference partition Pare defined as follows:

$Recal l^{P} (P^{'}) = \frac{TP}{TP + FN}$ ,

$Precisio n^{P} (P^{'}) = \frac{TP}{TP + FP}$ .

^RecallP(^P′) measures the ability of the classification procedure for co-classifying item pairs when a pair is co-classified in the reference partition P (ability to retrieve all the positives). ^PrecisionP(^P′) measures the accuracy of the classification procedure to co-classify correctly item pairs according to the reference classification P (ability to provide a correct prediction when predicting a positive).

The Jaccard similarity coefficient [15] is defined as follows:

$Jaccard (P, P^{'}) = \frac{TP}{TP + FN + FP}$

It is a measure of concordance between two partitions of a same set of items very similar to the F-measure. When negatives are much more numerous than positives, this measure has the advantage - over measures such as MCC (Matthews correlation coefficient) and plain accuracy - of not taking into account over-represented True Negatives. As a result, variations of concordance are easier to detect.

Simple case studies

As previously mentioned [10], similarity relations between proteins structures belonging to the same class show high values and are considered almost to be transitive, i.e. whenever _o1_o2_o3 belongs to a given class, we should have that simO(_o1_o2)∧simO(_o1_o3)∧simO(_o2_o3) = True. According to our graph formalism, these three items are represented by a 3-clique in G (cf. Figure 1-a). Clustering strategies such as search of max-cliques should allow identifying classes of proteins sharing a similar set of structural motifs, which is not the case.

For the sake of clarity the definition of items similarity for the two first case studies is simpler than definitions 1 and 2: two items are stated as similar when they share at least one identical common part.

Case 1: Non transitive Graph Gand no common sub parts

In Figure 2-a, considering items _o1={_p1}, _o5={_p1,_p2} and _o8={_p2}, we have: simO(_o1,_o5) by parts{_p1} and simO(_o5,_o8) by parts{_p2}. An item such as _o5 made of two subparts (_o5={_p1,_p2}) is denoted as a modular item. Though _o5similarities such as (_o1,_o5) and (_o5,_o8) are adjacent in G (Figure 2-b) they represent different local similarities: edge (_o1,_o5) represents part _p1 and (_o5,_o8) represents _p2. A modular item can be considered as a linker between two or more classes: it is similar, and then connected to any item member of the class 1 of items comprising part _p1(class1=(_o1,_o2,_o3,_o4,_o5)) and to any member of the class 2 of items comprising part _p2(class2=(_o5,_o6,_o7,_o8)). Consequently its degree is higher than those of its neighbors that are only linked to members of a single class. Due to their higher degree, modular items will act as kind of “attractors” during clustering processes. Consequently immediate neighbors of different classes will tend to form around the modular item a unique class, grouping together items having nothing in common (for example _o1 and _o8). Thus, in such a context, direct search of the most connected components from G does not seem appropriate.

Case 2: Transitive Graph Gand no common sub parts

In Figure 1-b, considering items _o1={_p1,_p3}, _o2={_p1,_p2} and _o3={_p2,_p3}, we have simO(_o1,_o2) due to part(s){_p1}, simO(_o2,_o3) due to part(s){_p2} and simO(_o1,_o3) due to part(s){_p3}. Here transitivity exists at the similarity graph level: _o1, _o2 and _o3constitute a clique. Nevertheless considering similarities at the local level of shared sub parts, there is no transitivity as no sub part is shared by all of the three items, which case shows that even if transitivity is assumed at the graph level for a set of items, nothing ensures the occurrence of a set of subparts common to all items. Therefore direct search for max-cliques components from G does not seem appropriate.

Case 3: Non transitive Graph Gand common sub parts

Similarity measures used for comparing modular and fuzzy motifs must be tolerant to take into account the flexibility and the divergence of the compared items as in Yakusa[16], the algorithm used here for identifying, selecting and mapping similar 3D protein structures. As shown in Figure 1-c, with such a measure some similarities stated as not significant by use of user defined selection threshold may be rejected even when a sub-part is found similar. Again, for the sake of clarity the definition of items similarity in the following case study is simplified. Two items are considered as similar if at least 50% of the parts of the shortest item are mapped to sub-parts of the second item. Considering items _o1={_p1_p2_p3}, _o2={_p1_p2_p3_p4} and _o3={_p3_p4}, we have simO(_o1_o2) and simO(_o2_o3) but not simO(_o1_o3), which corresponds to a non-transitive case at the graph level with the occurrence of a sub-part _p3 common to all items _o1, _o2 and _o3. In such a case, the search for max-clique is not well suited.

Method

Use of ternary similarities

These case studies emphasize some difficulties encountered by classical graph clustering approaches in grouping together modular items in classes where all items share a common set of parts. Searching max-clique - sets of items with transitive relations in graph G - does not seem adequate (cf. Case 2) as transitive relations in the graph may occur between items sharing no common subparts, and not be necessary (cf. Case 3) as items whose relations are not transitive in the graph may share a common set of sub-parts. Searching for the most connected components (cf. Case 1) in considering all links of the initial graph G is not appropriate either as some highly connected items may force the union of two significantly different classes.

These drawbacks could be corrected by searching a maximal subgraph _{G
T} of G in which the ternary similarity constraint is verified, before applying any classical connectivity-based clustering approaches. Indeed, as depicted later, application of ternary similarity constraint will tend to reduce the connectivity between items not sharing a same set of subparts (Cases 1 and 2) and preserve links of interest (Case 3) increasing their relative connectivity in the context of the modified graph _{G
T}.

Applying ternary similarity constraint

Let L(G)=(E,F) be the line graph of G=(O,E). From property 3 each edge of L(G)((^o′,o),(o,^o′′)) links two similarities having one item in common and can be submitted to the ternary similarity test. The edges of L(G) are then divided into the subset _{F
T}of F whose elements satisfy the ternary constraints and the subset $\bar{F_{T}}$ whose elements will be marked:

_{F
T}={((^o′,o),(o,^o′′))∈F | si_m3(^o′,o,^o′′) is True}
$\bar{F_{T}} \cup F_{T} = F$ ,

The graph of pairs_{P
T}is obtained by deleting marked edges from L(G):

_{P
T}=(E,_{F
T}), i.e. $P_{T} = L (G) - \bar{F_{T}}$ .

The modified graph _{P
T}is no more homomorphic to a line graph, i.e. there is usually no graph ^G′ such that _{P
T}=L(^G′). The bijection established by the line graph transformation between L(G) and G is broken by the introduction of the ternary similarity constraints. We will search now for a maximal line graph L(_{G
T}) that is a subgraph of _{P
T}. As the edges of L(_{G
T}) are also edges of _{P
T}, the ternary relations for the corresponding items (^o′,o,^o′′) will necessarily hold in _{G
T}. For that purpose a greedy heuristic ℋ eliminates vertices of L(G) until it finds a subgraph, with no marked edges, corresponding to a line graph L(_{G
T}) of some subgraph _{G
T} of G (cf. property 3.3).

Heuristic for selecting a subgraph of L(G) homomorphic to a line graph with no marked edges

Let _{N
T} be the marked subgraph of L(G), i.e._{N
T}=L(G)−_{P
T} and $E (N_{T}) = \bar{F_{T}}$ . Let us recall that L(G)−^E′ where ^E′⊆E is the subgraph of L(G) induced by ^E′(L(G)−^E′ contains all edges of L(G) that join two vertices in ^E′). We will search for some - minimal - subset _{E
T}of _{N
T} vertices such that L(G)−_{E
T}contains no marked edges, and therefore, following property 3.3, corresponds to the line-graph of some - maximal - subgraph _{G
T} of G.

Removing first the vertices of _{N
T}showing the maximal degree maximizes the ratio of the number of deleted vertices over the number of edges taking away the graph from a line graph. As minimizing _{E
T}is equivalent to maximizing L(_{G
T}) it is also equivalent to maximizing _{G
T}. This subgraph of Gboth fulfills the ternary similarity constraint and tends to be maximal.

1/ N←_{N
T}//initializes N as the set of marked edges of L(G) //

2/ _{E
T}←∅ // initializes the set of vertices to be removed //

3/ while E(N)≠∅: // still some marked edges //

// identification of _{N
T} vertices of maximal degree//

4/ Δ(N)← the maximal degree among N vertices,

5/ _{E
d}←{e | e∈_{E
T}and deg(e)=Δ(N)}

6/ N←N−_{E
d}

7/ _{E
T}←_{E
T}∪_{E
d}// iterative definition of_{E
T}vertices set//

Material

SCOP database is an expert classification of structures of protein domains. It is used as a source of data for our clustering studies and as reference classification to which classes formed by clustering procedure are compared to.

SCOP offers a hierarchical classification organized as a 6-levels tree. Protein domains are successively divided into “Classes”, “Folds”, “SuperFamilies” and “Families”. The leaves of the tree are the protein domains. In this study automated classifications will be compared to the finest grained SCOP level, a group of protein domains belonging to the same SCOP Family are then considered as a SCOP cluster.

The set of items is taken from 3D protein structure of domains of SCOP database [3]. Over the 488.567 available domain structures we restrict our search to a non-redundant subset made of the 10.569 SCOP domain representatives exhibiting less than 40% sequence identity - i.e. the ASTRAL_40 data set (version 1.75) [17].

The mapping function of two objects is performed by the YAKUSA software [16]. The program searches for the longest common similar substructures, between the query structure and every structure in the structural database. Such common substructures consist of amino-acids of proteins o and ^o′ and are represented by the mapped parts M(o^o′).

The set of protein pairs showing a YAKUSA z-score over or equal to _{T
O}=7.0 defines the edges E of our graph G.

Before applying the graph modification method we remove all the isolated proteins (proteins not similar to any other protein of the database), i.e. we remove all objects o such that deg(o) = 0. We obtain then the graph G(O,E) representing the pairwise similarities between 6606 items (proteins) encoded in 18199 edges (cf. Figure 3). Items are grouped into 856 connected components with a large component containing 2901 items (cf. Figure 4), achieving a initial coarse grained clustering.

Results

Clustering effect of the modification graph process

In order to experiment the method, G was submitted to the modification process using different values of the ternary similarity threshold T ranging from T=0.05 to T=0.95 by step of 0.05.

The heuristic ℋ selecting vertices _{E
T} to be removed from _{P
T} can potentially select any vertex (_{o
i},_{o
j}). If (_{o
i},_{o
j}) is the only vertex where item _{o
i} appears, deletion of (_{o
i},_{o
j}) leads to removal of item _{o
i}. As _{G
T}is built from the inverse line-graph transformation (every vertex of _{P
T}−_{E
T} leads to an edge of _{G
T}), item _{o
i} is absent from _{G
T}vertices.

By construction, our modification graph process implies a reduction of G connectivity. This results from removal of marked edges ( $P_{T} = L (G) - \bar{F_{T}})$ and then of vertices of _{P
T} that kept the graph away from a line graph (L(_{G
T})=_{P
T}−_{E
T}). Removal of vertices from _{P
T} corresponds to the removing of edges from G to _{G
T}. As expected, this loss of connectivity is directly correlated to the value of threshold T. Higher values of T lead to a more stringent constraint of ternary similarity, and finally to a less connected graph (cf. Figure 3-top).

Moreover, ICC’s formed in the building of _{P
T}are transferred to L(_{G
T}) and from property 2 to _{G
T}. As shown in Figure 3-bottom and 4 this leads to a pre-partition of the objects. More stringent constraint of ternary similarity leads to more ICC’s of lower sizes facilitating the work of the clustering algorithm.

Pre-clustering effect of ternary similarity constraints

Our modification graph process implies two edge deletion steps. First step is the suppression of L(G) edges failing at the centered ternary similarity test. Second step is the removal of L(G) nodes through application of the heuristic ℋ. According to property 3, node removal from L(G) is equivalent to edge removal from G.

In the second step, edge deletion can potentially split an ICC of G into one or more ICC’s in _{G
T}. For a similarity threshold of T=0.65, nine ICC’s are split into two or three ICC’s. As shown in Figure 5, in eight cases, the deleted edge isolates a group of items of the same SCOP Family from items classified differently, showing that application of ternary similarity constraint tends to separate items that are to be found in different SCOP Families.

One can notice in this Figure that protein domains from different SCOP Classes are linked in G. This is due to the flexibility of the YAKUSA similarity measure. Hopefully, the ternary constraint identify some of these issues, and do remove such links.

Ternary similarity threshold and 3D structural comparisons

Picked-up from one of the nine splits presented in Figure 5, Figure 6 illustrates the way the fractional ternary similarity threshold identifies the candidate edges to be deleted in the context of the ternary relation. Considering the three domains d1w0pa2, d1uaia_ and d1j1ta_, pairwise similarities are significant: 73 amino acids are mapped in the alignment (d1w0pa2,d1uaia_) and 75 amino acids are mapped in the alignment (d1uaia_,d1j1ta_). But considering the ternary relation, one considers the overlap of mapped part on the common domain d1uaia_, and finds only 48% (35 aa) of the amino acids common to both alignments. Therefore, with a threshold T=0.65=65%, the ternary similarity is considered to be not significant (48%≤T) and one of the two edges of the ternary relation (d1w0pa2,d1uaia_,d1j1ta_) has to be deleted. There, the heuristic selects the edge (d1w0pa2,d1uaia_) splitting the iccs into two components according to SCOP classification.

Classifications granularities

Application of ternary similarity constraints has a clustering effect taking into account shared similarities. It bears an incidence on the classes formed by MCL, the main clustering algorithm of our procedure. Granularity of the clustering has been studied for varying thresholds of ternary similarity T and inflation parameter I (cf. Figure 4).

The inflation parameter I is the main MCL parameter that rules the clustering granularity. Lower values of I lead to coarser clustering. Different values of I were tested (I∈[1.2,2.0] by step of 0.1 and I∈[2.0,3.0] by step of 0.2).

As expected, large ICC’s are rapidly split into small clusters when inflation parameter increases as shown in Figure 7-top. The size of the largest clusters formed for low inflation parameters 1.2<I≤1.4 (coarsest granularity) depends directly on the ternary similarity threshold used which rules the granularity of the pre-clustering process. For higher inflation parameters (fine granularity) the sizes of the largest clusters appear to be almost independent from T, and the cluster mean size (Figure 7-bottom) is also independent from T.

Thus, if the reduction of G to _{G
T}changes the clustering of items, the granularity is not significantly affected.

Comparison of the MCL classes to standard expert classifications

We compare the MCL classifications obtained with or without the application of ternary similarity constraints to the reference classification SCOP. This is done by mean of Precision/Recall (PR) curves rather than by ROC curves because i) the information contained in both curves are quite equivalent [18] and ii) PR curves are usually preferred in a context where the number of negative examples greatly exceeds the number of positives examples, which is the case here.

As shown in Figure 8-left, increasing values of MCL inflation parameter I-i.e. making smaller clusters-, in- crease (cf. Precision) the ability to provide a correct prediction when co-classifying two items , and decreases (cf. Recall) the ability to retrieve all the positives. As ex- pected, the recall decreases when the precision increases.

Differently, for increasing values of threshold T (triplet must share higher similarities), precision increases, but surprisingly, this gain in precision is not correlated to a loss of recall. Indeed, for T in range 0.0-0.6, the recall remains stable up to high values of T=0.8 (corresponding to very high required similarities between triplets alignments). As a consequence, ternary constraints allow increasing the precision while preserving the recall. As shown in Figure 8-right, we can consider the use of ternary similarity as an improvement of the classification (PR curves are shifted toward the upper-right part of the graphic when using increasing values of T).

Choice of the final clustering algorithm

In order to evaluate the real impact of the ternary similarity constraint independently from the choice of the final clustering algorithm, we compared classifications obtained with MCL to those obtained with a standard approach. We used a normalized spectral clustering algorithm [19] with a final k-means clustering initialized with centroids [20] computed from a hierarchical clustering of our data [21].

Both MCL and Spectral methods do not tend to form clusters with only one member. As shown in Figure 9, for a number of clusters between 1100 and 1450 - close to the number of clusters found in SCOP at the “Family level” and having more than one member (1241) - MCL and Spectral Clustering algorithms give very similar results, applying or not the ternary constraint. For a number of clusters closest to the real number of represented SCOP Families (1977), MCL algorithm gives better results and appears to be more robust.

Whatever the final clustering algorithm, Figure 9 highlights the enhancement of the quality of the automated classification procedure (with respect to SCOP reference) introduced by the ternary similarity constraint.

Discussion and conclusions

Classification of objects such as protein structures based on pairwise similarity relations is a classical problem. We have shown the advantages of applying ternary similarity constraints in the clustering process.

The method proposed here is in line with many constrained clustering methods as recently investigated [22]. However in most of these methods, only pairwise constraints are considered: a must-link (_{M
L}) constraint states that two objects should be placed in the same cluster while a cannot-link (_{C
L}) constraint states that two objects should not be placed in the same cluster. Constraints acting on groups of objects have also been considered, as ε-constraints and δ-constraints. However both can be represented as conjunction or disjunction of pairwise constraints. Indeed it should be clear that the method proposed here deals with ternary constraints that cannot be represented as any combination of pairwise constraints. Besides the ternary constraints introduced here concern the initial graph representation of data: they are not constraints for which satisfaction is required (or maximized) in the clustering result. As a matter of fact, the initial graph representation, by directly linking only nodes that are similar enough, exerts some pairwise constraints on clustering: obviously two nodes belonging to two different connected components are submitted to a _{C
L}constraint. This is true for any graph based clustering approach. In such approaches, the similarity (or distance) matrix defines the initial weighted graph, and edges are then removed until the graph is partitioned. For instance in [23, 24] a minimum spanning tree (in term of distances) is computed, and then using some similarity threshold, a forest is obtained. However, for large datasets, starting from a sparse graph by first applying some simple neighborhood criteria, as we do here, is a much more efficient procedure (see for instance [25] about clustering results dependency on such sparsification preprocessing). It would be interesting to investigate the use of our ternary constraints on various graph-based clustering schemes, as long as objects are modular. In biology, beyond protein structures, adding ternary constraints would also be relevant for clustering protein sequences using graph based methods [26].

Author’s contributions

J P, HS and GS conceived the graph based algorithm. GS implemented the algorithm and carried the experiments. All authors read and approved the final manuscript.

References

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.
Article PubMed Central CAS PubMed Google Scholar
Reeves GA, Dallman TJ, Redfern OC, Akpor A, Orengo CA: Structural diversity of domain superfamilies in the CATH database. J Mol Biol. 2006, 360 (3): 725-741. 10.1016/j.jmb.2006.05.035.
Article CAS PubMed Google Scholar
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008, 36 (Database issue): D419-D425.
PubMed Central CAS PubMed Google Scholar
Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA: The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 2007, 35 (Database issue): D291-D297.
Article PubMed Central CAS PubMed Google Scholar
Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6: 377-385. 10.1016/S0959-440X(96)80058-3.
Article CAS PubMed Google Scholar
Jain P, Hirst JD: Automatic structure classification of small proteins using random forest. BMC Bioinformatics. 2010, 11: 364-10.1186/1471-2105-11-364.
Article PubMed Central PubMed Google Scholar
Rogen P, Fain B: Automatic classification of protein structure by using Gauss integrals. Proc Natl Acad Sci U S A. 2003, 100: 119-124. 10.1073/pnas.2636460100.
Article PubMed Central CAS PubMed Google Scholar
Zemla A, Geisbrecht B, Smith J, Lam M, Kirkpatrick B, Wagner M, Slezak T, Zhou CE: STRALCP–structure alignment-based clustering of proteins. Nucleic Acids Res. 2007, 35: e150+-
Article PubMed Central PubMed Google Scholar
Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ: Towards an automatic classification of protein structural domains based on structural similarity. BMC Bioinformatics. 2008, 9: 74-10.1186/1471-2105-9-74.
Article PubMed Central PubMed Google Scholar
Pascual-Garcia A, Abia D, Ortiz AR, Bastolla U: Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures. PLoS Comput Biol. 2009, 5 (3): e1000331-10.1371/journal.pcbi.1000331.
Article PubMed Central PubMed Google Scholar
Santini G, Soldano H, Pothier J: Use of ternary similarities in graph based clustering for protein structural family classification. BCB. Edited by: Zhang A, Borodovsky M, Özsoyoglu G, Mikler AR. 2010, ACM, 457-459.
Chapter Google Scholar
Van Dongen S: Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl. 2008, 30: 121-141. 10.1137/040608635.
Article Google Scholar
Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30 (7): 1575-1584. 10.1093/nar/30.7.1575.
Article PubMed Central CAS PubMed Google Scholar
Bollobas B: Modern Graph Theory. 1998, Springer-Verlag, New-York
Book Google Scholar
Jaccard P: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles. 1901, 37: 547-579.
Google Scholar
Carpentier M, Brouillet S, Pothier J: YAKUSA: a fast structural database scanning method. Proteins. 2005, 61: 137-151. 10.1002/prot.20517.
Article CAS PubMed Google Scholar
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL compendium in 2004. Nucleic Acids Res. 2004, 32 (Database issue): D189-D192.
Article PubMed Central CAS PubMed Google Scholar
Davis J, Goadrich M: The relationship between precision-recall and ROC curves. Edited by: Cohen WW, Moore A. 2006, ACM, New-York, 233-240.
Google Scholar
Ng AY, Jordan MI, Weiss Y: On spectral clustering: analysis and an algorithm. NIPS. Edited by: Dietterich TG, Becker S, Ghahramani Z. 2001, MIT Press, Cambridge, 849-856.
Google Scholar
Milligan GW: The validation of four ultrametric clustering algorithms. Pattern Recognit. 1980, 12: 41-50. 10.1016/0031-3203(80)90001-1.
Article Google Scholar
Ward JH: Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963, 58: 236-244. 10.1080/01621459.1963.10500845.
Article Google Scholar
Davidson I, Ravi SS: The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min Knowl Discov. 2007, 14: 25-61. 10.1007/s10618-006-0053-7.
Article Google Scholar
Zahn CT: Graph-theoretical methods for detecting and describing gestalt clusters. Trans Comput. 1971, C-20: 68-86.
Article Google Scholar
Foggia P, Percannella G, Sansone C, Vento M: A graph-based clustering method and its applications. Proceedings of the 2nd International Conference on Advances in Brain, Vision and Artificial Intelligence. 2007, Springer-Verlag, Berlin, Heidelberg, 277-287.
Chapter Google Scholar
Maier M, von Luxburg U, Hein M: Influence of graph construction on graph-based clustering measures. NIPS. Edited by: Koller D, Schuurmans D, Bengio Y, Bottou L. 2008, MIT Press, Cambridge, 1025-1032.
Google Scholar
Kawaji H, Takenaka Y, Matsuda H: Graph-based clustering for finding distant relationships in a large set of protein sequences. Bioinformatics. 2004, 20 (2): 243-252. 10.1093/bioinformatics/btg397.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Present work is part of the PROTEUS project which received support from ANR-06-CIS (Calcul Intensif et Simulation). Our thanks to Therese Pothier for the English proof reading. …

Author information

Authors and Affiliations

Université Paris 13, Sorbonne Paris Cité, Laboratoire d’Informatique de Paris-Nord (LIPN), CNRS(, UMR 7030), Villetaneuse, F-93430, France
Guillaume Santini & Henry Soldano
UPMC, Université Paris 06, Atelier de BioInformatique, F-75005, Paris, France
Henry Soldano & Joël Pothier

Authors

Guillaume Santini
View author publications
You can also search for this author in PubMed Google Scholar
Henry Soldano
View author publications
You can also search for this author in PubMed Google Scholar
Joël Pothier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Santini.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Santini, G., Soldano, H. & Pothier, J. Automatic classification of protein structures relying on similarities between alignments. BMC Bioinformatics 13, 233 (2012). https://doi.org/10.1186/1471-2105-13-233

Download citation

Received: 21 November 2011
Accepted: 20 August 2012
Published: 14 September 2012
DOI: https://doi.org/10.1186/1471-2105-13-233

Automatic classification of protein structures relying on similarities between alignments

Abstract

Background

Results

Conclusions

Background

Definitions

Items and similarities

Definition 1 (Similarity of item parts)

Definition 2 (Similarity of two items)

Definition 3 (Centered ternary similarity of items)

Property 1 (Cliques can not satisfy centered ternary similarity)

Graph model

Definition 4 (Graph of item similarities)

Definition 5 (Independent connected components)

Definition 6 (The line graph of a graph)

Property 2

Property 3

Measures

Simple case studies

Case 1: Non transitive Graph Gand no common sub parts

Case 2: Transitive Graph Gand no common sub parts

Case 3: Non transitive Graph Gand common sub parts

Method

Use of ternary similarities

Applying ternary similarity constraint

Heuristic for selecting a subgraph of L(G) homomorphic to a line graph with no marked edges

Material

Results

Clustering effect of the modification graph process

Pre-clustering effect of ternary similarity constraints

Ternary similarity threshold and 3D structural comparisons

Classifications granularities

Comparison of the MCL classes to standard expert classifications

Choice of the final clustering algorithm

Discussion and conclusions

Author’s contributions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us