Skip to main content

ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval

Abstract

Background

The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database.

Results

In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure d ij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N ( i ) and N ( j ) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing d ij by a factor learned from the context N ( i ) and N ( j ) .

Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new S upervised learned Dis similarity measure, we update the Pro tein H ierarchial Cont ext C oherently in an iterative algorithm--ProDis-ContSHC.

We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other unsupervised contextual dissimilarity measures that do not use the class label information.

Conclusions

Using the contextual proteins with their class labels in the database, we can improve the accuracy of the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other methods reported in the recent literature.

Background

Proteins are linear chains of amino acids. The polypeptide chains are folded into complicated three-dimensional (3D) structures. With different structures, proteins are able to perform specific functions in biological processes [1–14]. To study the structure-function relationship, biologists have a high demand on protein structure retrieval systems for searching similar sequences or 3D structures [15]. Protein pairwise comparison is one of the main functions of such retrieval systems [16]. The need to retrieve or classify proteins using 3D structure or sequence-based similarity underlies many biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as sources for new treatment. In folding simulations, similar intermediate structures might be indicative of a common folding pathway [17].

Related work

The structural comparison problem in a protein structure retrieval system has been extensively studied. In [18], a rapid protein structure retrieval system named ProtDex2 was proposed by Aung and Tan [18] , in which they adopted the information retrieval techniques to perform rapid database search without accessing to each 3D structure in the database. The retrieval process was based on the inverted-file index constructed on the feature vectors of the relationship between the secondary structure elements (SSEs) of all the protein structures in the database. In order to evaluate the similarity score between a query protein structure and a protein structure in the database, they adopted and modified the well-known ∑(tf × idf) scoring scheme commonly used in document retrieval systems [19]. In [20, 21], a 3D shape-based approach was presented by Daras et al. The method relied primarily on the geometric 3D structure of the proteins, which was produced from the corresponding PDB files, and secondarily on their primary and secondary structures. Additionally, characteristic attributes of the primary and secondary structures of the protein molecules were extracted, forming attribute-based descriptor vectors. The descriptor vectors were then weighted and an integrated descriptor vector was produced. To compare a pair of protein descriptor vectors, Daras et al. [20, 21] used two metrics of similarity. The first one was based on the Euclidean distance [22] between the descriptor vectors, and the second one was based on Mean Euclidean Distance Measure [20, 21].

Later, Marsolo and Parthasarathy presented two normalized, stand-alone representations of proteins that enabled fast and efficient object retrieval based on sequence or structure information [17, 23]. For the range queries, they specified a range value r and retrieved all the proteins from the database which lied within a distance r to the query. In their work, distance referred to the standard Euclidean distance [22]. In [24], Sael et al. introduced a global surface shape representation by 3D Zernike descriptors for protein structure similarity search. In their study, three distance measures were used for comparing 3D Zernike descriptors of protein surface shapes, i.e., Euclidean distance, Manhattan distance [25], and correlation coefficient-based distance. A fast protein comparison algorithm IR Tableau was developed by Zhang et al. for protein retrieval purposes in [26], which leveraged the tableau representation to compare protein tertiary structures. IR tableau compared tableaux using feature indexing techniques. In IR Tableau [26], a number of similarity functions were applied for comparing a pair of protein vectors, i.e., cosine similarity [27], Jaccard index [28], Tanimoto coefficient [29], and Euclidean distance.

The basic components of a protein retrieval system includes a way to represent proteins and a dissimilarity measure that compares a pair of proteins. Most of the aforementioned studies focus on the feature representation of the proteins, while neglecting the comparison of the feature vectors. Such studies usually apply a simple similarity or dissimilarity measure for the comparison of the feature vectors, such as Euclidean Distance Measure used in [17, 20, 21, 23, 24, 26]. Most of the existing protein comparison techniques suffer from the following two bottlenecks:

  • The dissimilarity measure is a pairwise distance measure, which is computed only considering the query protein x0 and a database protein x i as d(x0, x i ). It does not consider other proteins in the database, neglecting the effects of the contextual proteins. If we consider the distribution of the entire protein database X = {x j }, j = 1 ... N when computing the dissimilarity as d(x0, x i |X), the retrieval performance may benefit from the contextual proteins {x j }, j ≠ i.

  • The dissimilarity measure is computed in an unsupervised way, which does not use the known information of the class labels L = {l j }, j = 1 ... , N in the database. Although we may have no idea about whether x0 and x i belong to the same class (having the same folding type etc., l0 = l i ) or not (l0 ≠ l i ), we do know some prior information about other proteins L. In all of the previous studies, prior class labels L were not adopted to calculate the dissimilarity d(x0, x i ).

Due to these two bottlenecks, traditional protein retrieval systems using pairwise and unsupervised dissimilarity measure usually do not achieve satisfactory performance, even though many effective protein feature descriptors are developed and used. In this paper, we investigate the dissimilarity measure and propose a novel learning algorithm to improve the performance of a given dissimilarity measure.

Recent research in machine learning points out that contextual information can be used to improve the dissimilarity or similarity measures. This kind of algorithms are called contextual or context-sensitive dissimilarity learning [30–34]. Unlike the traditional pairwise distance d(x0, x i ) which only considers the two refereed proteins x0 and x i , contextual dissimilarity also considers the contextual proteins X when computing the dissimilarity d(x0, x i |X). The existing contextual similarity learning algorithms can mainly be classified into the following two categories:

Dissimilarity regulation

The first contextual dissimilarity measure (CDM) was proposed by Jegou et al. in [30, 31]. They introduced the CDM, which significantly improved the accuracy of the image search problem. CDM measure took the local distribution of the vectors into account and iteratively estimated the distance update terms in the spirit of Sinkhorns scaling algorithm [35], thereby modified the neighborhood structure. This regularization was motivated by the observation that a good ranking was usually not symmetric in an image search system. In this paper, we will focus on this type of contextual dissimilarity learning.

Similarity transduction on graph

In [32, 33], Bai et al. provided a novel perspective to the shape retrieval tasks by considering the existing shapes as a group and studying their similarity measures to the query shape in a graph structure. For a given similarity measure, a new similarity was learned through graph transduction. The learning was done in an iterative manner so that the neighbors of a given shape influenced the final similarity to the query. The basic idea is actually related to the PageRank algorithm, which forms a foundation of Google Web search. This method is further improved by Wang et al. in [36]. Similar learning algorithms were also used to rank proteins in a protein database as in [37, 38]. Kuang et al. proposed a general graph-based propagation algorithm called MotifProp to detect more subtle similarity relationship than the pairwise comparison methods. In [38], Weston et al. reviewed RankProp, a ranking algorithm that exploited the global network structure of similarity relationship among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges.

The drawbacks of the above algorithms lay on two folds. On the one hand, such algorithms do not utilize the class label information of the database images L, and thus work in an unsupervised way. The only one used L is [38]. However, the algorithm proposed in [38] had basically the same framework as [32, 33, 37], i.e., protein label information L was only used to estimate the parameters. On the other hand, the "context" is fixed in the iterative algorithms of most of the transduction methods [32, 33, 37, 38]. A better way is to update the context using the learned similarity measures as in [30, 31].

To overcome these drawbacks, we develop a novel contextual dissimilarity learning algorithm to improve the performance of a protein retrieval system. The novel dissimilarity measure is regularized by the dissimilarity of the contextual proteins (neighboring proteins), while the contextual proteins are updated using the learned dissimilarities coherently. The basic idea comes from [39, 40], which assume that if two local features in two images are similar, their context is likely to be similar. In comparison to [30, 31], which use neighborhood as a single context, we partition the neighborhood into several hierarchical sub-context corresponding to the learned dissimilarities. With the sub-context, we compute the dissimilarity of sub-context of a pair of proteins and construct the hierarchial sub-contextual dissimilarity vector. Moreover, using the label information L, we select pairs of proteins belonging to the same classes {(x i , x j )|l i = l j } as the relevant protein pairs. We also select the irrelevant protein pairs {(x k , x l )|l k ≠ l l }.

Finally, we train a support vector machine (SVM) [41] to distinguish between the relevant and the irrelevant protein pairs. The output of the SVM will further be used to regularize the dissimilarity in an iterative manner.

Methods

This section describes our contextual protein-protein dissimilarity learning algorithm, which utilizes the contextual proteins and class label information of the database proteins to index and search protein structures efficiently. We will demonstrate that our idea is general in the sense that it can be used to improve the existing similarity/dissimilarity measures.

Protein structure retrieval framework

In a protein retrieval system, the query and the database proteins are firstly represented as feature vectors. Here, we denote the query protein feature vector as x0 and database protein feature vectors as X = {x1, x2, ... , x N }, where N is the number of proteins in the database. Then, based on a distance measure d0i= d(x0, x i ), we compute the distance of x0 and all the proteins in the database, i.e., {d01, d02, ... , d0N}. The database proteins are then ranked according to the distances. The k most similar ones are returned as the retrieval results. We illustrate the outline of the protein retrieval system in Figure 1.

Figure 1
figure 1

Flowchart of protein retrieval systems.

ProDis-ContSHC: the contextual dissimilarity learning algorithm

In this section, we will introduce the novel contextual protein-protein dissimilarity learning algorithm. We first give the definition of the hierarchical context of a protein, which will be used to compute the contextual dissimilarity and regularize the dissimilarity measure. Then a more discriminative regularization factor is learned using the class labels of the database proteins. Finally, we propose the S upervised regulating of Pro tein-protein Dis similarity and updating of the H ierarchical Cont ext C oherently in an iterative manner, resulting in the ProDis-ContSHC algorithm.

Using hierarchical context to regularize the dissimilarity measure

Here, we define a protein x i 's context as its K nearest neighbors N ( i ) . The dissimilarity between two sets of context is measured by the contextual dissimilarity as

r i j = 1 K 2 ∑ m ∈ N ( i ) , n ∈ N ( j ) d m n
(1)

The contextual dissimilarity is illustrated in Figure 2(a).

Figure 2
figure 2

Illustration of context-based dissimilarity and hierarchical context-based dissimilarity. The two proteins x i and x j , on which the dissimilarity is to be measured, are in the first row. The nearest neighbors of these two proteins are listed below them as the context, respectively. (a) The traditional context N ( i ) ; (b) The proposed hierarchical context N p ( i ) , p = {1, 2, 3}.

Furthermore, instead of averaging all the pairwise dissimilarities between the two context N ( i ) and N ( j ) , we propose the hierarchical context by splitting the context N ( i ) to P "sub-context" N p ( i ) ,p= { 1 , ⋯ , P } according to their distances to x i . To be more specific, sub-context N p ( i ) is defined as

N p ( i ) = { x j | x j i s a m o n g t h e k ′ - t h t o k ″ - t h n e a r e s t n e i g h b o r s o f x i , a c c o r d i n g t o { d i j } , j ∈ { 1 , ⋯ , i - 1 , i + 1 , ⋯ , N } }
(2)

where k' = (p - 1) × κ, k'' = (p - 1) × κ + κ, κ is the size of a sub-context, and P is the number of sub-context. In this way, we can compute the contextual dissimilarity by averaging the dissimilarity of the sub-context as

r i j = 1 P ∑ p 1 κ 2 ∑ m ∈ N p ( i ) , n ∈ N p ( j ) d m n = 1 P ∑ p d i j ( p )
(3)

where d i j ( p ) = 1 κ 2 ∑ m ∈ N p ( i ) , n ∈ N p ( j ) d m n ,p=1,⋯,P, is the hierarchical sub-contextual dissimilarity. Figure 2(b) illustrates the idea of sub-contextual dissimilarity.

Intuitively, if the context of two proteins is dissimilar to each other (r ij is higher than the average), they should have a higher dissimilarity value, and vice versa. We implement this by multiplying a coefficient, which is the ratio of r ij to the average of all the contextual dissimilarity r ̄ = 1 N 2 ∑ i , j r i j ,

d i j * = d i j × r i j r ̄ = d i j × δ i j
(4)

Here, δ i j = r i j r ̄ is a regularization factor for d ij , with which we can improve d ij by its contextual information. Moreover, this procedure can be done in an iterative manner. We can use the regularized dissimilarity measure d i j * to re-define the new hierarchical context N p ( i ) . In this way, we can learn the protein-protein dissimilarity d i j * and hierarchical context N p ( i ) coherently.

Supervised regularization factor learning

We try to utilize the label information L = {l1, ... , l N } of the database proteins to learn a better regularization factor δ ij . The class information is adopted both in the intraclass and interclass dissimilarity computation to maximize the Fisher criterion [42] for protein class separability. Firstly, we can select a number of protein pairs {γ = (i, j)|i, j = 1, ... , N}. For each pair, we compute the hierarchical contextual dissimilarities and organize them as a P-dimensional dissimilarity vector d γ = [d ij (1) d ij (2) ... d ij (P)]⊤, as shown in Figure 3. Then, inspired by the score fusion rule [43, 44], using L, we further label each pair γ = (i, j) as a relevant pair y γ = +1 if l i = l j , or an irrelevant pair y γ = - 1 otherwise.

Figure 3
figure 3

Differentiate relevant and irrelevant proteins by classification. (x i , x j ) is assumed to be a relevant pair and (x i , x k ) is assumed to be an irrelevant pair. The contextual dissimilarity vectors of both pairs are distinguished by a binary SVM model.

Now with the training samples as Γ = {(d γ , y γ )}, γ = 1, ... , N C2, we train a binary SVM [41] classifier to distinguish between the relevant pairs and the irrelevant pairs. The publicly available package SVMlight [45] is applied to implement the SVM on our training set Γ. This package allows us to optimize a number of parameters and offers the options to use different kernel functions to obtain the best classification performance [46]. The separating hyperplane generated by SVM model is given by

f ( d ) =dâ‹…w+b
(5)

where w is a vector orthogonal to the hyperplane, and b is a parameter that minimizes ||w||2 and satisfies the following conditions:

y γ ( d γ ⋅ w + b ) ≥1
(6)

for all 1 ≤ γ ≤ N C2, where N C2 is the total number of examples (protein pairs). An SVM model with a linear decision boundary is shown in Figure 3 to distinguish the relevant protein pairs from the irrelevant ones. Note that not all the N C2 possible protein pairs are necessary to be included to train the SVM model (5). For any pair of proteins (x i , x j ), after we compute its contextual dissimilarity vector d ij , the trained SVM classifier is applied to get the distance of this point to the margin boundary of SVM as ỹ i j =f ( d i j ) . Apparently, ỹ i j is a measure of dissimilarity of the context of this pair of proteins. Thus, it can be used to form a regularization factor as

δ ′ i j = e x p ( - y ̃ i j σ ) = e x p - ( d i j ⋅ w + b ) σ
(7)

where σ is a preemptor of the factor. With this regularization factor learned from the contextual proteins, we regularize the dissimilarity d ij of protein pair (x i , x j ) as

d i j * = d i j × δ i j ′
(8)

Updating the context and dissimilarity coherently

With the learned dissimilarity measure d i j * , we can re-define the "context" of a protein x i according to its dissimilarity to all the other proteins d i j * ,j∈ { 0 , ⋯ , i - 1 , i + 1 , ⋯ , N } . The new "hierarchical-context" relying on d i j * is donated as N p * ( i ) ,p= { 1 , ⋯ , P } . In this way, we can develop an iterative algorithm that learns d i j * and N p * ( i ) , p = { 1 , ⋯ , P } coherently. Since N p * ( i ) implicitly depends on d i j * through the nearest neighbors of x i , we use a fixed-point recursion method [47] to solve d i j * . In each iteration, N p * ( i ) is first computed by using the previous estimation of d i j * , which is then updated by multiplying the regularization factor δ i j ′ as in (8). The iterations are carried out for T times, as given in Algorithm 1.

With the learned dissimilarity matrix D(t+1), we use D(t+1)[0; 1, ... , N] as the dissimilarity between the query protein x0 and the database proteins {x1, ... , x N }. Thus we can rank the database proteins in an ascending order.

Efficient implementation of ProDis-ContSHC

The proposed learning algorithm is time-consuming. Therefore, it is not suitable for realtime protein retrieval systems. Here we propose several techniques to significantly improve the efficiency of the algorithm.

  • Similar to [33], in order to increase the computational efficiency, it is possible to run ProDis-ContSHC for only part of the database of the known proteins. Hence, for each query protein x0, we first retrieve N' ≪ N of the most similar proteins, and perform ProDis-ContSHC to learn the dissimilarity matrix of size (N' + 1) × (N' + 1) for only those proteins. Then we calculate the new dissimilarity measure D' (N' + 1) × (N' + 1)for only those (N' + 1) proteins. Here, we assume that all the relevant proteins will be among the top N' most similar proteins. This strategy is illustrated in Figure 4(a) and 4(b).

Figure 4
figure 4

Efficient implementation of ProDis-ContSHC. (a) Performing ProDis-ContSHC on the original matrix of size (N + 1) × (N + 1) from the entire dataset; (b) Performing ProDis-ContSHC on a subset of the database proteins, i.e., a dissimilarity matrix of size (N' + 1) × (N' + 1); (c) Using the symmetry property of the dissimilarity matrix to reduce the training time.

  • Most of the dissimilarity and similarity measures are symmetric ones, i.e., d ij = d ji . As can be observed in (13), the regularization of d ij is also symmetric. Therefore, it is possible to develop an efficient learning algorithm by using this property. In the algorithm, all the computation results of (i, j) (such as d ij and δ ij ) can be used directly by (j, i). In this way, we can save almost half of the computational time, as shown in Figure 4(c).

  • A bottleneck of ProDis-ContSHC may be the training procedure for the SVM model in each iteration. For a database of N proteins belonging to C classes, there are N C2 protein pairs, in which ∑ c = 1 C N c C 2 are relevant pairs, while ∑ c = 1 C ∑ c ′ ≠ c N c × N c ′ are irrelevant pairs, where C is the number C of the protein classes and N c is the number of proteins in the c-th class ( ∑ c = 1 C N c = N ) . There might be a huge number of protein pairs available for the SVM training. However, it is not necessary to include all of them in the training process. One can select a small but equal number of the relevant and the irrelevant pairs to train the SVM classifier. This is an effective way to reduce the training time of SVM.

Algorithm 1 ProDis-ContSHC: S upervised Learning of Pro tein Dis similarity and Updating H ierarchical Cont ext C oherently.

Require: Input D = [d ij ](N+1)×(N+1): matrix of size (N+1)×(N+1) of pairwise protein feature distances, where x0 is the query protein and {x1, ... , x N } are the database proteins;

Require: Input κ: size of the hierarchical sub-context;

Require: Input P: number of the hierarchical context;

Initialize dissimilarity matrix: D(1) = D;

for t = 1, ... , T do

Update the hierarchical context for each protein x i : N p ( t ) ( i ) , ( p = 1 , ⋯ , P ) ,

N p ( t ) ( i ) = { x j | x j i s a m o n g t h e k ′ - t h t o k ″ - t h n e a r e s t n e i g h b o r s o f x i , a c c o r d i n g t o D ( t ) ( i ; 1 , ⋯ , N ) }
(9)

where k' = (p - 1) × κ, k'' = (p - 1) × κ + κ, and D ( t ) ( i ; 0 , ⋯ , N ) = [ d i 0 ( t ) , ⋯ , d i N ( t ) ] .

Compute the contextual proteins dissimilarity vector d i j ( t ) for each pair of proteins (i, j), i, j ∈ {0, ... , N}:

d i j ( t ) = [ d i j ( t ) ( 1 ) d i j ( t ) ( 2 ) ⋯ d i j ( t ) ( P ) ] ⊤
(10)

where d i j ( t ) ( p ) = 1 k 2 ∑ m ∈ N p ( t ) ( i ) , n ∈ N p ( t ) ( j ) d m n ( t ) .

Select relevant and irrelevant protein pairs and label them as y γ = +1 and y γ = - 1 respectively, train an SVM model for their contextual dissimilarity vectors d γ ( t ) as

f ( t ) ( d ) = w ( t ) â‹…d+ b ( t )
(11)

Compute the distance to the SVM margin boundary for the contextual dissimilarity vector d i j ( t ) of each pair of proteins as ỹ i j ( t ) = f ( t ) ( d i j ( t ) ) , and set a regularization factor for this pair of proteins:

δ i j ( t ) = e x p ( - y ̃ i j ( t ) σ )
(12)

Update the pairwise protein dissimilarity measures:

for i = 0, 1, ... , N do

for j = 0, 1, ... , N do

d i j ( t + 1 ) = d i j ( t ) × δ i j ( t )
(13)

end for

end for

D ( t + 1 ) = [ d i j ( t + 1 ) ] ( N + 1 ) × ( N + 1 ) .

end for

Output the dissimilarity matrix: D(t+1).

Benchmark sets

To evaluate the proposed ProDis-ContSHC algorithm, we conduct experiments on two different benchmark sets, i.e., the ones used in [21] and [26] respectively.

ASTRAL 1.73 protein domain dataset

Following [26], we use the following database and queries as our first benchmark set:

Database

The ASTRAL 1.73 [48] 95% sequence-identity non-redundant data set is used as the protein database. We generate our index database from the tableau data set published by Stivala et al. [49], which contains 15,169 entries.

Queries

A query data set containing 200 randomly selected protein domains is used in our experiment. For each query, a list that contains all the proteins in the respective index database is returned with the ranking scores.

We generate a vector of features x for a given protein based on its tableau representation [49].

FSSP/DALI protein dataset

To evaluate the performance of the proposed methods, a portion of the FSSP database [50] is selected as in [21]. This dataset has 3,736 proteins classified into 30 classes. It’s constructed according to the DALI algorithm [51, 52]. The protein numbers in different classes varies 2 to 561. For protein feature representation, the following two features are extracted from the 3D structure and the sequence of a protein as in [20, 21]:

  • The Polar-Fourier transform, resulting in the FT02 features;

  • Krawtchouk moments, resulting in the Kraw00 features.

The descriptor vectors are weighted and an integrated descriptor vector is produced as x, which will be used for the protein retrieval tasks.

Results and discussion

Results on ASTRAL 1.73 dataset

To compare a query protein x0 to a protein x i in the ASTRAL 1.73 dataset, we compute the cosine similarity [27] as the baseline similarity measure as in [26]. Cosine similarity [27] simply calculates the cosine of the angle between the two vectors x i and x j .

s i j =C ( x i , x j ) = x i â‹… x j | | x i | | | | x j | |
(14)

A higher cosine similarity score implies a smaller angle between the two vectors. Although ProDis-ContSHC is proposed to learn protein-protein dissimilarity d ij , it can be extended easily to learn similarity s ij as well. The only difference is to set the regularization factor as δ i j ′ =exp ( y ̃ i j σ ) instead of δ i j ′ =exp ( - y ̃ i j σ ) in (7).

ROC curve and precision-recall curve performance

SCOP [53] fold classification is used as the ground truth to evaluate the performance of the different methods. To fairly compare the accuracy, we use the receiver operating characteristic (ROC) curve [54], the area under this ROC curve (AUC) [54], and the precision-recall curve [55]. Given a query protein x0 which belongs to the SCOP fold l0, the top k proteins returned by the search algorithms are considered as the hits. The remaining proteins are considered as the misses. For the i-th protein x i belonging to the SCOP fold l i , if l i = l0 and i ≤ k, the protein x i is defined as a true positive (TP). On the other hand, if l i ≠ l0 and i ≤ k, x i is defined as a false positive (FP). If l i ≠ l0 and i > k, x i is defined as a true negative (TN). Otherwise, x i is a false negative (FN). Using these definitions, we can then compute the true positive rate (TPR or recall), the false positive rate (FPR), recall and precision as follows:

T P R = T P P = T P T P + F N F P R = F P N = F P F P + T N
(15)
R e c a l l = T P T P + F N P r e c i s i o n = T P T P + F P
(16)

TPR k , FPR k , Recall k , and Precision k are calculated for all 1 ≤ k ≤ N , where N is the size of the database. The ROC defines a curve of points with FPR k as the abscissa and TPR k as the ordinate. Precision-recall defines a curve with recall k and precision k as abscissa and ordinate respectively. We use the area under the ROC curve (AUC) as a single-figure measurement for the quality of a ROC curve [54], and use the averaged AUC over all the queries to evaluate the performance of the method.

To demonstrate the contribution of the supervised learning idea, we also compare ProDis-ContSHC with its unsupervised counterpart, i.e., contextual dissimilarity algorithm based on the unsupervised learning, i.e., ProDis-ContHC. ProDis-ContHC is also applied to improve the cosine similarity. We also compare with the widely-used contextual dissimilarity measure [30, 31] (CDM), which tries to take into account the local distribution of the vectors and iteratively estimates distance update terms in the spirit of Sinkhorns scaling algorithm, thereby modifying the neighborhood structures.

The performance of different methods are compared, as shown in Figure 5. Figure 5(a) shows the ROC curves of the original cosine similarity and its improved versions by three contextual similarity learning algorithms on the ASTRAL 1.73 [48] 95% dataset, with different numbers of proteins returned to each query. It can be seen from Figure 5(a) that the TPR of all the methods increases as the FPR grows. The reason is due to the fact that, provided the number of queries is fixed, when the number k of returned proteins to each query is very small, the returned proteins are not enough to "represent" the class features of the query, which then causes the low TPR. Meanwhile, in this situation, most of the returned proteins are highly confident of belonging to the same class as the query, resulting in a low FPR. Moreover, the TPR is almost 100% when the FPR> 50%. It is clear that the ROC curve of ProDis-ContSHC completely embodies the ROC curves of the other three methods, which implies ProDis-ContSHC is the best method among the four. That also means that supervised learning is better than unsupervised learning for this purpose. ProDis-ContHC, on the other hand, is the second best method among these four, which demonstrates the contribution of the hierarchical sub-context idea to the traditional contextual dissimilarity measures. The overall AUC results are listed in Table 1, from which similar conclusions can be drawn. It is noticeable that the AUC for ProDis-ContSHC is very close to 1, which means ProDis-ContSHC works almost perfectly on this dataset. We further compare these four methods by the precision-recall curves, which are shown in Figure 5(b). It can be seen that the proposed contextual similarity learning algorithms significantly outperform the traditional methods. ProDis-ContSHC, again, is consistently the best method among the four.

Figure 5
figure 5

Performance of similarity measures on the ASTRAL 1.73 90% dataset. (a) The ROC curves of the original similarity measure, and the improved measures by ProDis-ContSHC, ProDis-ContHC, and CDM, respectively. (b) The precision-recall curves of the original similarity measure, and the improved measures by ProDis-ContSHC, ProDis-ContHC, and CDM, respectively.

Table 1 Performance of different retrieval methods on the ASTRAL 1

Regarding the efficiency of the method, in this experiment, the learning time of the ProDis-ContSHC is longer than that of the ProDis-ContHC and CDM. This is because in each iteration of the learning algorithm, a quadratic programming problem with many training protein pairs have to be solved to train the SVM. In addition, the computation of the regularization factor of supervised similarity learning algorithm needs more function evaluations.

We also compare the proposed algorithms with seven other protein retrieval methods, i.e., tableau search [56], QP tableau [49], Yakusa [57], SHEBA [58], VAST [59, 60], and TOPS [61, 62]. The overall AUC values are shown in Table 1. It can be concluded that the tableau feature based methods do not always achieve better performance than other methods, such as tableau search. Among the existing tableau feature based methods, IR tableau outperforms the others. Yakusa and SHEBA also have comparable performance. As seen in Table 1, the AUC of the proposed algorithms is clearly better than all the other methods.

Improving different similarity measures via contextual dissimilarity learning algorithms

To further evaluate the robustness of our method, we test the behavior of ProDis-ContSHC and other contextual similarity learning algorithms on different similarity measures. A group of experiments are conducted on the ASTRAL 1.73 95% dataset with the following similarity measures:

  • The cosine similarity [27] as introduced in the previous section.

  • The Jaccard index [28]: it is defined as the size of the intersection divided by the size of the union of two sets, i.e.,

    J ( x i , x j ) = | x i ⋂ x j | | x i ⋃ x j |
    (17)
  • The Tanimoto coefficient [29]: it is a generalization of the Jaccard index, defined as

    J ( x i , x j ) = x i â‹… x j | | x i | | 2 + | | x j | | 2 - x i â‹… x j
    (18)
  • Squared Euclidean distance [22]: it is another means of measuring similarity of proteins.

    d i j = ( x i − x j ) ⊤ ( x i − x j ) = ∑ m ( x i ( m ) − x j ( m ) ) 2
    (19)

where x i (m) is the m-th element of vector x i .

ProDis-ContSHC, ProDis-ContHC, and the CDM algorithms are applied to improve each of these similarity measures, respectively. The AUC values of the corresponding retrieval systems are plotted in Figure 6. In general, improving the original similarity measure by ProDis-ContSHC leads to the largest improvement. The only exception is for Tanimoto coefficient, on which ProDis-ContSHC has slightly lower AUC than ProDis-ContHC, but comparable AUC to the CDM. One possible reason is that the supervised classifier fail to capture the real distribution of the contextual similarity. ProDis-ContHC, on the other hand, also performs better than the CDM algorithm and the original similarity measures. This strongly suggests that our previous conclusions are valid and consistent. That is, hierarchical sub-contextual information can remarkably improve the traditional context-based similarity measures, whereas supervised learning can further improve the accuracy for most of the input similarity measures.

Figure 6
figure 6

Performance of similarity measures on different base measures on the ASTRAL 1.73 90% dataset. Performance of similarity measures on different base measures on the ASTRAL 1.73 90% dataset. The four base measures being tested are cosine similarity [27], the Jaccard index [28], the Tanimoto coefficient [29], and the Euclidean distance [22].

Results on FSSP/DALI dataset

Unlike the similarity measure used in the last experiment, here we use the Euclidean distance [22] to compare a pair of proteins as the baseline dissimilarity measure as in [20, 21]. In this way, we have an idea about how our algorithms work with both similarity and dissimilarity measures. For a query protein x0, the pairwise Euclidean distances, d0i, i = 1, 2, ... , N , are ranked. The top k proteins are returned as the retrieval results. To evaluate the performance of the proposed algorithms, we test them on both the protein retrieval and the protein classification tasks, following [20, 21].

Performance on protein retrieval

The efficiency of the proposed dissimilarity learning algorithm is first evaluated in terms of the performance on the protein retrieval task. In this case, each protein x i ∈ X of the dataset is used as a query x0 and the retrieved proteins are ranked according to the shape dissimilarity d0jto the query, where j = 1, 2, ... , i - 1, i + 1, ... , N. We also use the precision-recall curve to demonstrate the performance of the proposed methods, where precision is the proportion of the retrieved proteins that are relevant to the query and recall is the proportion of the relevant proteins in the entire dataset that are retrieved as the results.

To test the robustness and consistency of our methods, we apply our methods to three different protein descriptor vectors, i.e., Daras et al.'s FT02, Kraw00, and FT02&Kraw00 [20, 21] geometric descriptor vectors. We also apply the unsupervised version of our algorithm, ProDis-ContHC, and the CDM algorithm to the same dissimilarity measure and the same descriptor vectors to compare with ProDis-ContSHC. Figure 7 shows the precision-recall curves for different algorithms on different protein descriptor vectors. As mentioned in [20, 21], there is always a tradeoff between the precision and recall values. This is clearly shown in Figure 7(a), (b), and 7(c), in which the algorithms reach their peak precision values at the smallest recall values. It can be seen that ProDis-ContSHC has a clearly better performance than any other method, whereas ProDis-ContHC is the second best one. This is quite consistent with what is observed in the last experiment, in which a similarity measure is used. Therefore, our algorithms can consistently improve any similarity/dissimilarity measure. Among the three protein descriptor vectors, ProDis-ContSHC performs the best on the combined vector, i.e., Kraw00 &FT02. This is because this vector not only employs the context, but also their relevant information to predict the relationship between the query and the database proteins.

Figure 7
figure 7

Performance of dissimilarity measures on the FSSP/DALI dataset. (a) The precision-recall curves of the original dissimilarity measure, and the improved measures by ProDis-ContSHC, ProDis-ContHC, and CDM, respectively, with the descriptor vector FT02&Kraw00. (b) The precision-recall curves with the descriptor vector FT02. (c) The precision-recall curves with the descriptor vector Kraw00.

Performance on protein classification

The performance of the method is also evaluated in terms of the overall classification accuracy [20, 21]. To be more specific, for each protein x i in the database, a dissimilarity measure is applied after removing that protein from the database ("leave-one-out" experiment [63]). A class label l0 is then assigned to the query x0 according to the label of the nearest database protein. The overall classification accuracy is given by:

O v e r a l l C l a s s i f i c a t i o n A c c u r a c y = N u m b e r o f c o r r e c t l y p r e d i c t e d p r o t e i n s T o t a l n u m b e r o f p r o t e i n s i n t h e d a t a b a s e
(20)

We again conduct this experiment with the three descriptor vectors, i.e., FT02, Kraw00, and FT02&Kraw00. The overall classification accuracy is shown in Table 2. It can be seen that ProDis-ContSHC has a consistently higher than 99% accuracy on all the three descriptor vectors. Each dissimilarity measure achieves its highest accuracy on Kraw00 &FT02. Among the four dissimilarity measures, ProDis-ContSHC has the highest accuracy, whereas ProDis-ContHC is the second best one. Therefore, this conclusion has been demonstrated on both similarity and dissimilarity measures on different datasets with different descriptor vectors.

Table 2 Overall classification accuracy using different protein descriptors and the Euclidean distance measure

Conclusions

We have introduced in this paper a novel contextual dissimilarity learning algorithm for protein-protein comparison in protein database retrieval tasks. Its strength resides in the use of the hierarchical context between a pair of proteins and their class label information. By extensive experiments, this novel algorithm has been demonstrated to outperform the traditional context-based methods and their unsupervised version.

We formulate the protein dissimilarity learning problem as a context-based classification problem. Under such a formulation, we try to regularize the protein pairwise dissimilarity in a supervised way rather than the traditional unsupervised way. To the best of our knowledge, this is the first study on supervised contextual dissimilarity learning. We propose a novel algorithm, ProDis-ContSHC, which updates a protein's hierarchical sub-context and the dissimilarity measure coherently. The regularization factors are learned based on the classification of the relevant and the irrelevant protein pairs. The algorithm works in an iterative manner.

Experimental results demonstrate that supervised methods are almost always better than their unsupervised counterparts on all the databases with all the feature vectors. The proposed method, even though mainly presented for protein database retrieval tasks, can be easily extended to other tasks, such as RNA sequence-structure pattern indexing [64], retrieval of high throughput phenotype data [65], and retrieval of genomic annotation from large genomic position datasets [66]. The approach may also be extended to the database retrieval and pattern classification problems in other domains, such as medical image retrieval [67–69], speech recognition, and texture classification [70].

References

  1. Chen SA, Lee TY, Ou YY: Incorporating significant amino acid pairs to identify O-linked glycosylation sites on transmembrane proteins and non-transmembrane proteins. BMC Bioinformatics 2010, 11: 536. 10.1186/1471-2105-11-536

    Article  PubMed Central  PubMed  Google Scholar 

  2. Sobolev B, Filimonov D, Lagunin A, Zakharov A, Koborova O, Kel A, Poroikov V: Functional classification of proteins based on projection of amino acid sequences: application for prediction of protein kinase substrates. BMC Bioinformatics 2010, 11: 313. 10.1186/1471-2105-11-313

    Article  PubMed Central  PubMed  Google Scholar 

  3. Albayrak A, Otu HH, Sezerman UO: Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets. BMC Bioinformatics 2010, 11: 428. 10.1186/1471-2105-11-428

    Article  PubMed Central  PubMed  Google Scholar 

  4. Ezkurdia L, Bartoli L, Fariselli P, Casadio R, Valencia A, Tress ML: Progress and challenges in predicting protein-protein interaction sites. Brief Bioinform 2009, 10(3):233–246.

    Article  CAS  PubMed  Google Scholar 

  5. Cook T, Sutton R, Buckley K: Automated flexion crease identification using internal image seams. Pattern Recognition 2010, 43(3):630–635. 10.1016/j.patcog.2009.08.012

    Article  Google Scholar 

  6. Ofran Y, Rost B: Protein-protein interaction hotspots carved into sequences. PLoS Comput Biol 2007, 3(7):e119. 10.1371/journal.pcbi.0030119

    Article  PubMed Central  PubMed  Google Scholar 

  7. Yhou ZH, Lei YK, Gui J, Huang DS, Zhou X: Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data. Bioinformatics 2010, 26(21):2744–2751. 10.1093/bioinformatics/btq510

    Article  Google Scholar 

  8. Xia JF, Zhao XM, Song J, Huang DS: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility. BMC Bioinformatics 2010, 11: 174. 10.1186/1471-2105-11-174

    Article  PubMed Central  PubMed  Google Scholar 

  9. Yhou ZH, Yin Z, Han K, Huang DS, Zhou X: A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinformatics 2010, 11: 343. 10.1186/1471-2105-11-343

    Article  Google Scholar 

  10. Xia JF, Zhao XM, Huang DS: Predicting protein-protein interactions from protein sequences using meta predictor. Amino Acids 2010, 39(5):1595–1599. 10.1007/s00726-010-0588-1

    Article  CAS  PubMed  Google Scholar 

  11. Shi MG, Xia JF, Li XL, Huang DS: Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids 2010, 38(3):891–899. 10.1007/s00726-009-0295-y

    Article  CAS  PubMed  Google Scholar 

  12. Huang DS, Zhao XM, Huang GB, Cheung YM: Classifying protein sequences using hydropathy blocks. Pattern Recognition 2006, 39(12):2293–2300. 10.1016/j.patcog.2005.11.012

    Article  Google Scholar 

  13. Li JJ, Huang DS, Wang B, Chen P: Identifying protein-protein interfacial residues in heterocomplexes using residue conservation scores. Int J Biol Macromol 2006, 38: 241–247. 10.1016/j.ijbiomac.2006.02.024

    Article  CAS  PubMed  Google Scholar 

  14. Wang B, Chen P, Huang DS, Li JJ, Lok TM, Lyu MR: Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett 2006, 580(2):380–384. 10.1016/j.febslet.2005.11.081

    Article  CAS  PubMed  Google Scholar 

  15. Wang J, Li Y, Zhang Y, Tang N, Wang C: Class conditional distance metric for 3D protein structure classification. 2011 5th International Conference on Bioinformatics and Biomedical Engineering, (iCBBE). 2011, 1–4.

    Google Scholar 

  16. Chi PH, Scott G, Shyu CR: A fast protein structure retrieval system using image-based distance matrices and multidimensional index. International Journal of Software Engineering and Knowledge Engineering 2005, 15(3):527–545. 10.1142/S0218194005002439

    Article  Google Scholar 

  17. Marsolo K, Parthasarathy S: On the use of structure and sequence-based features for protein classification and retrieval. Knowledge and Information Systems 2008, 14: 59–80. 10.1007/s10115-007-0088-0

    Article  Google Scholar 

  18. Aung Z, Tan K: Rapid 3D protein structure database searching using information retrieval techniques. Bioinformatics 2004, 20(7):1045–1052. 10.1093/bioinformatics/bth036

    Article  CAS  PubMed  Google Scholar 

  19. Zhang W, Yoshida T, Tang X: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst Appl 2011, 38(3):2758–2765. 10.1016/j.eswa.2010.08.066

    Article  Google Scholar 

  20. Daras P, Zarpalas D, Tzovaras D, Strintzis M: 3D shape-based techniques for protein classification. IEEE International Conference on Image Processing, 2005. ICIP 2005. 2005, 1130–1133.

    Google Scholar 

  21. Daras P, Zarpalas D, Axenopoulos A, Tzovaras D, Strintzis MG: Three-dimensional shape-structure comparison method for protein classification. IEEE/ACM Trans Comput Biol Bioinform 2006, 3(3):193–207. 10.1109/TCBB.2006.43

    Article  CAS  PubMed  Google Scholar 

  22. Oscamou M, McDonald D, Yap VB, Huttley GA, Lladser ME, Knight R: Comparison of methods for estimating the nucleotide substitution matrix. BMC Bioinformatics 2008, 9: 511. 10.1186/1471-2105-9-511

    Article  PubMed Central  PubMed  Google Scholar 

  23. Marsolo K, Parthasarathy S: On the use of structure and sequence-based features for protein classification and retrieval. Proceedings of the Sixth International Conference on Data Mining, 2006. ICDM '06. 2006, 394–403. 10.1109/ICDM.2006.119

    Google Scholar 

  24. Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D: Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins 2008, 72: 1259–1273. 10.1002/prot.22030

    Article  CAS  PubMed  Google Scholar 

  25. Mittelmann H, Peng J: Estimating bounds for quadratic assignment problems associated with Hamming and Manhattan distance matrices based on semidefinite programming. SIAM J Optim 2010, 20(6):3408–3426. 10.1137/090748834

    Article  Google Scholar 

  26. Zhang L, Bailey J, Konagurthu AS, Ramamohanarao K: A fast indexing approach for protein structure comparison. BMC Bioinformatics 2010, 11(Suppl 1):S46. 10.1186/1471-2105-11-S1-S46

    Article  PubMed Central  PubMed  Google Scholar 

  27. Lee B, Lee D: Protein comparison at the domain architecture level. BMC Bioinformatics 2009, 10(Suppl 15):S5. 10.1186/1471-2105-10-S15-S5

    Article  PubMed Central  PubMed  Google Scholar 

  28. Rahman M, Hassan MR, Buyya R: Jaccard index based availability prediction in enterprise grids. International Conference on Computer Science, ICCS 2010. 2010, 2701–2710.

    Google Scholar 

  29. Garavaglia S: Statistical analysis of the Tanimoto coefficient self-organizing map (TCSOM) applied to health behavioral survey data. International Joint Conference on Neural Networks, 2001. IJCNN '01. 2001, 2483–2488.

    Google Scholar 

  30. Jegou H, Harzallah H, Schmid C: A contextual dissimilarity measure for accurate and efficient image search. IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR '07. 2007, 1–8.

    Chapter  Google Scholar 

  31. Jegou H, Schmid C, Harzallah H, Verbeek J: Accurate image search using the contextual dissimilarity measure. IEEE Trans Pattern Anal Mach Intell 2010, 32(1):2–11.

    Article  PubMed  Google Scholar 

  32. Yang X, Bai X, Latecki LJ, Tu Z: Improving shape retrieval by learning graph transduction. 10th European Conference on Computer Vision. ECCV 2008. 2008, 788–801.

    Chapter  Google Scholar 

  33. Bai X, Yang X, Latecki LJ, Liu W, Tu Z: Learning context-sensitive shape similarity by graph transduction. IEEE Trans Pattern Anal Mach Intell 2010, 32(5):861–874.

    Article  PubMed  Google Scholar 

  34. Bai X, Wang B, Wang X, Liu W, Tu Z: Co-transduction for shape retrieval. 11th European Conference on Computer Vision. ECCV 2010. 2010, 328–341.

    Chapter  Google Scholar 

  35. Sinkhorn R: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann Math Statist 1964, 35(2):876–879. 10.1214/aoms/1177703591

    Article  Google Scholar 

  36. Wang J, Li Y, Bai X, Zhang Y, Wang C, Tang N: Learning context-sensitive similarity by shortest path propagation. Pattern Recognition 2011, 44(10–11):2367–2374. 10.1016/j.patcog.2011.02.007

    Article  Google Scholar 

  37. Kuang R, Weston J, Noble W, Leslie C: Motif-based protein ranking by network propagation. Bioinformatics 2005, 21(19):3711–3718. 10.1093/bioinformatics/bti608

    Article  CAS  PubMed  Google Scholar 

  38. Weston J, Kuang R, Leslie C, Noble WS: Protein ranking by semi-supervised network propagation. BMC Bioinformatics 2006, 7(Suppl 1):S10. 10.1186/1471-2105-7-S1-S10

    Article  PubMed Central  PubMed  Google Scholar 

  39. Sahbi H, Audibert JY, Rabarisoa J, Keriven R: Object recognition and retrieval by context dependent similarity kernels. International Workshop on Content-Based Multimedia Indexing, 2008. CBMI 2008. 2008, 216–223.

    Chapter  Google Scholar 

  40. Sahbi H, Audibert J, Keriven R: Context-dependent kernels for object classification. IEEE Trans Pattern Anal Mach Intell 2011, 33(4):699–708.

    Article  PubMed  Google Scholar 

  41. Ding J, Zhou S, Guan J: MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features. BMC Bioinformatics 2010, 11(Suppl 11):S11. 10.1186/1471-2105-11-S11-S11

    Article  PubMed Central  PubMed  Google Scholar 

  42. González AJ, Liao L: Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines. BMC Bioinformatics 2010, 11: 537. 10.1186/1471-2105-11-537

    Article  PubMed Central  PubMed  Google Scholar 

  43. Wang J, Li Y, Liang P, Zhang G, Ao X: An effective multi-biometrics solution for embedded device. IEEE International Conference on Systems, Man and Cybernetics, 2009. SMC 2009. 2009, 917–922.

    Chapter  Google Scholar 

  44. Wang J, Li Y, Ao X, Wang C, Zhou J: Multi-modal biometric authentication fusing iris and palmprint based on GMM. IEEE/SP 15th Workshop on Statistical Signal Processing, 2009. SSP '09. 2009, 349–352.

    Chapter  Google Scholar 

  45. Shih-Wen Ke G, Oakes MP, Palomino MA, Xu Y: Comparison between SVM-Light, a search engine-based approach and the mediamill baselines for assigning concepts to video shot annotations. International Workshop on Content-Based Multimedia Indexing, 2008. CBMI 2008. 2008, 381–387.

    Chapter  Google Scholar 

  46. Ramana J, Gupta D: LipocalinPred: a SVM-based method for prediction of lipocalins. BMC Bioinformatics 2009, 10: 445. 10.1186/1471-2105-10-445

    Article  PubMed Central  PubMed  Google Scholar 

  47. Ey K, Poetzsche C: Asymptotic behavior of recursions via fixed point theory. Journal of Mathematical Analysis and Applications 2008, 337(2):1125–1141. 10.1016/j.jmaa.2007.04.052

    Article  Google Scholar 

  48. Brenner S, Koehl P, Levitt R: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000, 28(1):254–256. 10.1093/nar/28.1.254

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  49. Stivala A, Wirth A, Stuckey PJ: Tableau-based protein substructure search using quadratic programming. BMC Bioinformatics 2009, 10: 153. 10.1186/1471-2105-10-153

    Article  PubMed Central  PubMed  Google Scholar 

  50. FSSP/DALI Database[http://ekhidna.biocenter.helsinki.fi/dali/start]

  51. Holm L, Sander C: The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res 1996, 24(1):206–209. 10.1093/nar/24.1.206

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  52. Holm L, Sander C: The FSSP database of structurally aligned protein fold families. Nucleic Acids Res 1994, 22: 3600–3609.

    PubMed Central  CAS  PubMed  Google Scholar 

  53. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.

    CAS  PubMed  Google Scholar 

  54. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M: pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 2011, 12: 77. 10.1186/1471-2105-12-77

    Article  PubMed Central  PubMed  Google Scholar 

  55. Tsai RT, Lai PT: Dynamic programming re-ranking for PPI interactor and pair extraction in full-text articles. BMC Bioinformatics 2011, 12: 60. 10.1186/1471-2105-12-60

    Article  PubMed Central  PubMed  Google Scholar 

  56. Konagurthu AS, Stuckey PJ, Lesk AM: Structural search and retrieval using a tableau representation of protein folding patterns. Bioinformatics 2008, 24(5):645–651. 10.1093/bioinformatics/btm641

    Article  CAS  PubMed  Google Scholar 

  57. Carpentier M, Brouillet S, Pothier J: YAKUSA: a fast structural database scanning method. Proteins 2005, 61(1):137–151. 10.1002/prot.20517

    Article  CAS  PubMed  Google Scholar 

  58. Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Eng 2000, 13(8):535–543. 10.1093/protein/13.8.535

    Article  CAS  PubMed  Google Scholar 

  59. Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins 1995, 23(3):356–369. 10.1002/prot.340230309

    Article  CAS  PubMed  Google Scholar 

  60. Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385. 10.1016/S0959-440X(96)80058-3

    Article  CAS  PubMed  Google Scholar 

  61. Gilbert D, Westhead D, Nagano N, Thornton J: Motif-based searching in TOPS protein topology databases. Bioinformatics 1999, 15(4):317–326. 10.1093/bioinformatics/15.4.317

    Article  CAS  PubMed  Google Scholar 

  62. Torrance G, Gilbert D, Michalopoulos I, Westhead D: Protein structure topological comparison, discovery and matching service. Bioinformatics 2005, 21(10):2537–2538. 10.1093/bioinformatics/bti331

    Article  CAS  PubMed  Google Scholar 

  63. Zhang W, Sun F, Jiang R: Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach. BMC Bioinformatics 2011, 12(Suppl 1):S11. 10.1186/1471-2105-12-S1-S11

    Article  PubMed Central  PubMed  Google Scholar 

  64. Meyer F, Kurtz S, Backofen R, Will S, Beckstette M: Structator: fast index-based search for RNA sequence-structure patterns. BMC Bioinformatics 2011, 12: 214. 10.1186/1471-2105-12-214

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  65. Chang WE, Sarver K, Higgs BW, Read TD, Nolan NM, Chapman CE, Bishop-Lilly KA, Sozhamannan S: PheMaDB: a solution for storage, retrieval, and analysis of high throughput phenotype data. BMC Bioinformatics 2011, 12: 109. 10.1186/1471-2105-12-109

    Article  PubMed Central  PubMed  Google Scholar 

  66. Krebs A, Frontini M, Tora L: GPAT: retrieval of genomic annotation from large genomic position datasets. BMC Bioinformatics 2008, 9: 533. 10.1186/1471-2105-9-533

    Article  PubMed Central  PubMed  Google Scholar 

  67. Wang J, Li Y, Zhang Y, Wang C, Xie H, Chen G, Gao X: Bag-of-features based medical image retrieval via multiple assignment and visual words weighting. IEEE Trans Med Imaging 2011, 30(11):1996–2011.

    Article  PubMed  Google Scholar 

  68. Wang J, Li Y, Zhang Y, Xie H, Wang C: Boosted learning of visual word weighting factors for bag-of-features based medical image retrieval. 2011 Sixth International Conference on Image and Graphics (ICIG). 2011, 1035–1040.

    Chapter  Google Scholar 

  69. Wang J, Li Y, Zhang Y, Xie H, Wang C: Bag-of-features based classification of breast parenchymal tissue in the mammogram via jointly selecting and weighting visual words. 2011 Sixth International Conference on Image and Graphics (ICIG). 2011, 622–627.

    Chapter  Google Scholar 

  70. Liu Z, Wang J, Li Y, Zhang Y, Wang C: Quantized image patches co-occurrence matrix: a new statistical approach for texture classification using image patch exemplars. Proceedings of SPIE 8009. 2011, 80092P.

    Google Scholar 

Download references

Acknowledgements

The study was supported by grants from Shanghai Key Laboratory of Intelligent Information Processing, China (Grant No. IIPL-2011-003), Key Laboratory of High Performance Computing and Stochastic Information Processing, Ministry of Education of China (Grant No. HS201107), National Grand Fundamental Research (973) Program of China (Grant No. 2010CB834303 and 2011CB911102), National Natural Science Foundation of China (Grant No. 60973154), Hubei Provincial Science Foundation, China (Grant No. 2010CDA006 and 2010CD06601), and a start-up grant from King Abdullah University of Science and Technology.

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 7, 2012: Advanced intelligent computing theories and their applications in bioinformatics. Proceedings of the 2011 International Conference on Intelligent Computing (ICIC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S7.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Gao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JW: designed the algorithm, carried out the experiments, analyzed the results, and wrote the manuscript. XG: designed the algorithm and the experiments, improved the manuscript. QW: carried out the experiments, analyzed the results, improved the manuscript. YL: improved the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Wang, J., Gao, X., Wang, Q. et al. ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval. BMC Bioinformatics 13 (Suppl 7), S2 (2012). https://doi.org/10.1186/1471-2105-13-S7-S2

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-13-S7-S2

Keywords