, Université Paris 13, Sorbonne Paris Cité, Laboratoire d’Informatique de Paris-Nord (LIPN), CNRS(, UMR 7030), Villetaneuse, F-93430, France

UPMC, Université Paris 06, Atelier de BioInformatique, F-75005 Paris, France

Abstract

Background

Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins.

Results

When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a

Conclusions

We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP.

Background

During the past decade the databases of protein sequences have grown exponentially reaching several millions entries while 3D protein structures databases grew quadratically so as to reach, regarding the Protein Data Bank (PDB)
^{0−7}available at

Over the past decade there have been many attempts aiming at developing automatic classification procedures, mainly applying supervised classification methods using as labels of know 3D structures part of a reference classification. Jain and Hirst

Røgen and Fain
_{
o
k
} and still proteins

Furthermore, such similarity based classification procedures of 3D structures only consider a single overall pairwise similarity measure or score, that is derived from local similarities, and do not make use of the detailed mapping of similar parts computed during the alignment process. As a consequence, these procedures, ignoring the mapping information, may lead to cluster proteins that do not all share a common motif. This point will be further illustrated using a Simple case studies section. Then, prior to running a graph based clustering process, we propose to make use of the mapping information in ternary similarity constraints applied on triples of structures. Our experiments will compare the agreement between automatic classifications, obtained with and without that preliminary processing, and the SCOP reference classification.

First we need to use the similarity degree between two protein structures in order to build a graph of similarities whose vertices are protein structures and edges correspond to similarities exceeding a given threshold. Such a graph can be directly given as an input to a graph based clustering process. However, our proposal is to use the mapping information for defining similarities between protein alignment as follows. Let us define an alignment between 2 proteins

To summarize it, the method, shortly introduced in

Definitions

In this work as in
_{
p
i
}}. Here each part _{
p
i
} will represent a structural unit defined by a sequence of one or more amino-acids. We first define the similarity of two parts by comparing their distance to a threshold.

Items and similarities

Definition 1 (Similarity of item parts)

Let _{
p
i
}and
_{
T
P
}a distance threshold defined on the distance range. We define
_{
p
i
}and

•

We also suppose that we have a mapping function

Definition 2 (Similarity of two items)

Let ^{
o
′
}) be two items, and ^{
o
′
}) be the set of pairs of parts of ^{
o
′
}in one-to-one correspondence, then, the ^{
o
′
}) between items

• ^{
o
′
}) iff ^{
o
′
}))≥_{
T
O
}, where _{
T
O
}is a given threshold.

Elements of ^{
o
′
}) are denoted as

Definition 3 (Centered ternary similarity of items)

Let (^{
o
′
},^{
o
′′
}) be three items such that ^{
o
′
},^{
o
′′
}) are true, and _{
P
o
′
,
o′
}(^{
o
′′
}, ^{
p
′′
})∈^{
o
′′
})}. Then _{
m3}(^{
o
′
},^{
o
′′
}), the

_{
m3}(^{
o
′
},^{
o
′′
}) iff _{
P
o
′
,
o′
}(^{
o
′
})),^{
o
′′
}))),where the ternary similarity threshold

We note and exemplify hereunder that the notion of ternary similarity should not be confused with the notion of transitivity, which only depends on the graph of similarities, i.e. on binary relations. As an example, we consider the case of three items, pairwise linked, i.e. forming a clique, and highlight a case in which none of the three centered ternary similarities exceeds the ternary similarity threshold.

Property 1 (Cliques can not satisfy centered ternary similarity)

Here is a counterexample. Let (_{
p
i
},_{
p
j
}},^{
o
′
}={_{
p
i
},_{
p
k
}},^{
o
′′
}={_{
p
j
},_{
p
k
}}) such that ^{
o
′
})=_{
p
i
}, ^{
o
′
},^{
o
′′
})=_{
p
j
}and ^{
o
′′
})=_{
p
k
}. Assuming that _{
T
O
}=1 we obtain that {^{
o
′
},^{
o
′′
}} is a 3-clique, and therefore similarity is transitive. Nevertheless _{
m3}(^{
o
′
},^{
o
′′
}) is _{
m3}(^{
o
′
},^{
o
′′
}) is _{
m3}(^{
o
′′
},^{
o
′
}) is

Graph model

Similarities between items are encoded as edges in an undirected graph

Definition 4 (Graph of item similarities)

The graph

• _{
o
i
},_{
o
j
})∈^{
O2}| _{
o
i
},_{
o
j
}) is True}.

Definition 5 (Independent connected components)

A connected component of

Now we introduce a useful equivalent representation of

Definition 6 (The line graph of a graph)

Let _{
e
i
}
_{
e
j
})∈^{
E2}| _{
e
i
} adjacent to _{
e
j
} in

The line graph transformation is bijective if nodes labels are known and has the following property:

Property 2

The connected components of

Indeed, given _{
g
i
}and _{
g
j
}two ICCs of _{
g
i
} with a node of _{
g
j
}. Consequently, by construction, there cannot be adjacency between any edge of _{
g
i
} and any edge of _{
g
j
}. Then, according to definition 6 there is no edge between _{
g
i
}) and _{
g
j
}). The reciprocal can easily be inferred.

Our purpose is to modify

Property 3

**Line-Graph**

1. A vertex of

2. Two connected vertices of ^{
o
′
},^{
o
′′
}), the corresponding edge of ^{
o
′
},^{
o
′′
}).

3. Removing a vertex in a line-graph

From property 3 and definition 3, the centered ternary similarity can be checked on every

Measures

In order to compare two classifications we use standard comparison measures of classification similarity. More precisely, let _{
P1},_{
P2},…,_{
P
n
}} be a partition of the set of items _{
o
k
}∈_{
P
i
}and _{
o
l
}∈_{
P
j
} are said _{
P
i
}=_{
P
j
}.

Let ^{
P
′
}be an other partition of the same set of items _{
C
p
}and in
^{
P
′
}, and as ^{
P
′
}but not in

The ^{
P
′
}with respect to the reference partition

^{
Recall
P
}(^{
P
′
}) measures the ability of the classification procedure for co-classifying item pairs when a pair is co-classified in the reference partition ^{
Precision
P
}(^{
P
′
}) measures the accuracy of the classification procedure to co-classify correctly item pairs according to the reference classification

The Jaccard similarity coefficient

It is a measure of concordance between two partitions of a same set of items very similar to the F-measure. When negatives are much more numerous than positives, this measure has the advantage - over measures such as MCC (Matthews correlation coefficient) and plain accuracy - of not taking into account over-represented True Negatives. As a result, variations of concordance are easier to detect.

Simple case studies

As previously mentioned
_{
o1}
_{
o2}
_{
o3} belongs to a given class, we should have that _{
o1}
_{
o2})∧_{
o1}
_{
o3})∧_{
o2}
_{
o3}) =

Similarity transitivity and common sub-motif occurrence

**Similarity transitivity and common sub-motif occurrence.** Description of items (_{Oi}) and items parts (_{pi}) parts, and corresponding graph of similarities **(a)** Transitive case with set of parts common to all items, **(b)** Transitive case with no part common to all of items; **(c)** Non-transitive case with a part common to all items.

For the sake of clarity the definition of items similarity for the two first case studies is simpler than definitions 1 and 2: two items are stated as similar when they share at least one identical common part.

Case 1: Non transitive Graph

In Figure
_{
o1}={_{
p1}}, _{
o5}={_{
p1},_{
p2}} and _{
o8}={_{
p2}}, we have: _{
o1},_{
o5}) by part_{
p1}} and _{
o5},_{
o8}) by part_{
p2}}. An item such as _{
o5} made of two subparts (_{
o5}={_{
p1},_{
p2}}) is denoted as a modular item. Though _{
o5}similarities such as (_{
o1},_{
o5}) and (_{
o5},_{
o8}) are adjacent in _{
o1},_{
o5}) represents part _{
p1} and (_{
o5},_{
o8}) represents _{
p2}. A modular item can be considered as a linker between two or more classes: it is similar, and then connected to any item member of the class 1 of items comprising part _{
p1}(_{
o1},_{
o2},_{
o3},_{
o4},_{
o5})) and to any member of the class 2 of items comprising part _{
p2}(_{
o5},_{
o6},_{
o7},_{
o8})). Consequently its degree is higher than those of its neighbors that are only linked to members of a single class. Due to their higher degree, modular items will act as kind of “attractors” during clustering processes. Consequently immediate neighbors of different classes will tend to form around the modular item a unique class, grouping together items having nothing in common (for example _{
o1} and _{
o8}). Thus, in such a context, direct search of the most connected components from

Graph modification method

**Graph modification method.****(a)** the description of object parts, **(b)** the graph **(c)** the line graph **(d)** the graph _{PT}with marked edges
**(e)** the graph _{PT}−_{ET}=_{GT}), with vertices _{ET}removed during the heuristic **ℋ** and their removed incident edges represented in dashed gray, **(f)** the graph _{GT}fulfilling the ternary similarity.

Case 2: Transitive Graph

In Figure
_{
o1}={_{
p1},_{
p3}}, _{
o2}={_{
p1},_{
p2}} and _{
o3}={_{
p2},_{
p3}}, we have _{
o1},_{
o2}) due to part_{
p1}}, _{
o2},_{
o3}) due to part_{
p2}} and _{
o1},_{
o3}) due to part_{
p3}}. Here transitivity exists at the similarity graph level: _{
o1}, _{
o2} and _{
o3}constitute a clique. Nevertheless considering similarities at the local level of shared sub parts, there is no transitivity as no sub part is shared by all of the three items, which case shows that even if transitivity is assumed at the graph level for a set of items, nothing ensures the occurrence of a set of subparts common to all items. Therefore direct search for max-cliques components from

Case 3: Non transitive Graph

Similarity measures used for comparing modular and fuzzy motifs must be _{
o1}={_{
p1}
_{
p2}
_{
p3}}, _{
o2}={_{
p1}
_{
p2}
_{
p3}
_{
p4}} and _{
o3}={_{
p3}
_{
p4}}, we have _{
o1}
_{
o2}) and _{
o2}
_{
o3}) but not _{
o1}
_{
o3}), which corresponds to a non-transitive case at the graph level with the occurrence of a sub-part _{
p3} common to all items _{
o1}, _{
o2} and _{
o3}. In such a case, the search for max-clique is not well suited.

Method

Use of ternary similarities

These case studies emphasize some difficulties encountered by classical graph clustering approaches in grouping together modular items in classes where all items share a common set of parts. Searching max-clique - sets of items with transitive relations in graph

These drawbacks could be corrected by searching a maximal subgraph _{
G
T
} of _{
G
T
}.

Applying ternary similarity constraint

Let ^{
o
′
},^{
o
′′
})) links two similarities having one item in common and can be submitted to the ternary similarity test. The edges of _{
F
T
}of

• _{
F
T
}={((^{
o
′
},^{
o
′′
}))∈_{
m3}(^{
o
′
},^{
o
′′
}) is True}

•

The _{
P
T
}is obtained by deleting marked edges from

•_{
P
T
}=(_{
F
T
}),

The modified graph _{
P
T
}is no more homomorphic to a line graph, ^{
G
′
} such that _{
P
T
}=^{
G
′
}). The bijection established by the line graph transformation between _{
G
T
}) that is a subgraph of _{
P
T
}. As the edges of _{
G
T
}) are also edges of _{
P
T
}, the ternary relations for the corresponding items (^{
o
′
},^{
o
′′
}) will necessarily hold in _{
G
T
}. For that purpose a greedy heuristic **ℋ** eliminates vertices of _{
G
T
}) of some subgraph _{
G
T
} of

Heuristic for selecting a subgraph of

Let _{
N
T
} be the marked subgraph of _{
N
T
}=_{
P
T
} and
^{
E
′
} where ^{
E
′
}⊆^{
E
′
}(^{
E
′
} contains all edges of ^{
E
′
}). We will search for some - minimal - subset _{
E
T
}of _{
N
T
} vertices such that _{
E
T
}contains no marked edges, and therefore, following property 3.3, corresponds to the line-graph of some - maximal - subgraph _{
G
T
} of

Removing first the vertices of _{
N
T
}showing the maximal degree maximizes the ratio of the number of deleted vertices over the number of edges taking away the graph from a line graph. As minimizing _{
E
T
}is equivalent to maximizing _{
G
T
}) it is also equivalent to maximizing _{
G
T
}. This subgraph of

1/ _{
N
T
}

2/ _{
E
T
}←

3/ **while**

_{
N
T
}

4/

5/ _{
E
d
}←{_{
E
T
}and

6/ _{
E
d
}

7/ _{
E
T
}←_{
E
T
}∪_{
E
d
}
_{
E
T
}

Material

SCOP database is an expert classification of structures of protein domains. It is used as a source of data for our clustering studies and as reference classification to which classes formed by clustering procedure are compared to.

SCOP offers a hierarchical classification organized as a 6-levels tree. Protein domains are successively divided into “Classes”, “Folds”, “SuperFamilies” and “Families”. The leaves of the tree are the protein domains. In this study automated classifications will be compared to the finest grained SCOP level, a group of protein domains belonging to the same SCOP Family are then considered as a SCOP cluster.

The set of items is taken from 3D protein structure of domains of SCOP database

The mapping function of two objects is performed by the YAKUSA software
^{
o
′
} and are represented by the mapped parts ^{
o
′
}).

The set of protein pairs showing a YAKUSA z-score over or equal to _{
T
O
}=7.0 defines the edges

Before applying the graph modification method we remove all the isolated proteins (proteins not similar to any other protein of the database),

Modification of the graph

**Modification of the graph****.** Evolution of the size (top) and the number (bottom) of independent connex components (ICCs) of the modified graph _{GT}for increasing threshold of ternary similarity _{GT}becomes more and more sparse, and connected components more numerous.

Sizes of the connected components of the modified graph _{GT}

**Sizes of the connected components of the modified graph _{GT}.** (top) Size of the largest independent connex components (ICCs) and (bottom) mean size of independent connex components of the modified graph

Results

Clustering effect of the modification graph process

In order to experiment the method,

The heuristic **ℋ** selecting vertices _{
E
T
} to be removed from _{
P
T
} can potentially select any vertex (_{
o
i
},_{
o
j
}). If (_{
o
i
},_{
o
j
}) is the only vertex where item _{
o
i
} appears, deletion of (_{
o
i
},_{
o
j
}) leads to removal of item _{
o
i
}. As _{
G
T
}is built from the inverse line-graph transformation (every vertex of _{
P
T
}−_{
E
T
} leads to an edge of _{
G
T
}), item _{
o
i
} is absent from _{
G
T
}vertices.

By construction, our modification graph process implies a reduction of _{
P
T
} that kept the graph away from a line graph (_{
G
T
})=_{
P
T
}−_{
E
T
}). Removal of vertices from _{
P
T
} corresponds to the removing of edges from _{
G
T
}. As expected, this loss of connectivity is directly correlated to the value of threshold

Moreover, ICC’s formed in the building of _{
P
T
}are transferred to _{
G
T
}) and from property 2 to _{
G
T
}. As shown in Figure

Pre-clustering effect of ternary similarity constraints

Our modification graph process implies two edge deletion steps. First step is the suppression of **ℋ**. According to property 3, node removal from

In the second step, edge deletion can potentially split an ICC of _{
G
T
}. For a similarity threshold of

Connected components split during graph modification _{G0.65}

**Connected components split during graph modification ** Cuts of the ICC’s are represented by the thick (vertical) lines. Links removed (resp. kept) by the modification are shown in dashed (resp. continuous) lines. Items belonging to the same SCOP Family are circled in gray and SCOP Family is given in caption.

One can notice in this Figure that protein domains from different SCOP Classes are linked in

Ternary similarity threshold and 3D structural comparisons

Picked-up from one of the nine splits presented in Figure

Protein domain structures comparison in the ternary relation context

**Protein domain structures comparison in the ternary relation context.** Domains d1w0pa2 (Sialidase with SCOP “Family” Id. b.29.1.** 8**), d1uaia_ (Polyguluronate lyase with SCOP “Family” Id. b.29.1.

Classifications granularities

Application of ternary similarity constraints has a clustering effect taking into account shared similarities. It bears an incidence on the classes formed by MCL, the main clustering algorithm of our procedure. Granularity of the clustering has been studied for varying thresholds of ternary similarity T and inflation parameter

The inflation parameter

As expected, large ICC’s are rapidly split into small clusters when inflation parameter increases as shown in Figure

MCL classifications

**MCL classifications.** Mean sizes (dashed lines) and ±2_{GT}, when the ternary similarity threshold

Thus, if the reduction of _{
G
T
}changes the clustering of items, the granularity is not significantly affected.

Comparison of the MCL classes to standard expert classifications

We compare the MCL classifications obtained with or without the application of ternary similarity constraints to the reference classification SCOP. This is done by mean of Precision/Recall (PR) curves rather than by ROC curves because i) the information contained in both curves are quite equivalent

As shown in Figure

Comparison of MCL classes to reference classification SCOP

**Comparison of MCL classes to reference classification SCOP.** Comparison of classes obtained from application of MCL to the initial graph _{GT}(continuous lines). Lighter continuous lines correspond to more stringent ternary similarity constraint (varying from 0.2 to 0.8 by step of 0.2). Left top (resp. bottom): represents Recall (resp. Precision) for increasing values of parameter I (MCL inflation parameter). Right: Precision versus Recall curves points, from right to left, correspond to decreasing inflation parameter I.

Differently, for increasing values of threshold

Choice of the final clustering algorithm

In order to evaluate the real impact of the ternary similarity constraint independently from the choice of the final clustering algorithm, we compared classifications obtained with MCL to those obtained with a standard approach. We used a normalized spectral clustering algorithm

Both MCL and Spectral methods do not tend to form clusters with only one member. As shown in Figure

Impact of final clustering algorithm

**Impact of final clustering algorithm.** Jaccard similarity coefficient between reference classification SCOP and MCL (in black) or spectral clustering (in gray) automated classifications obtained with no ternary similarity constraint (dashed lines) or with a ternary similarity constraint T=0.65 (continuous lines). The vertical line at 1977 clusters (resp. 1241 clusters) gives the number of classes in SCOP (resp. with more than one protein domain).

Whatever the final clustering algorithm, Figure

Discussion and conclusions

Classification of objects such as protein structures based on pairwise similarity relations is a classical problem. We have shown the advantages of applying ternary similarity constraints in the clustering process.

The method proposed here is in line with many _{
M
L
}) constraint states that two objects should be placed in the same cluster while a cannot-link (_{
C
L
}) constraint states that two objects should not be placed in the same cluster. Constraints acting on groups of objects have also been considered, as _{
C
L
}constraint. This is true for any graph based clustering approach. In such approaches, the similarity (or distance) matrix defines the initial weighted graph, and edges are then removed until the graph is partitioned. For instance in

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

J P, HS and GS conceived the graph based algorithm. GS implemented the algorithm and carried the experiments. All authors read and approved the final manuscript.

Acknowledgements

Present work is part of the PROTEUS project which received support from ANR-06-CIS (Calcul Intensif et Simulation). Our thanks to Therese Pothier for the English proof reading. …