Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia

Qatar Computing Research Institute, Doha 5825, Qatar

Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia

Abstract

Background

Protein domain ranking is a fundamental task in structural biology. Most protein domain ranking methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the protein domain ranking methods.

Results

To tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiG-Rank. Instead of using a single graph to regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold of protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned with ranking scores jointly and automatically, by alternately minimizing an objective function in an iterative algorithm. Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiG-Rank achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based ranking methods.

Conclusion

The problem of graph model and parameter selection in graph regularized protein domain ranking can be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in applying multiple graphs to solving protein domain ranking applications.

Background

Proteins contain one or more domains each of which could have evolved independently from the rest of the protein structure and which could have unique functions

The output of a ranking procedure is usually a list of database protein domains that are ranked in descending order according to a measure of their similarity to the query domain. The choice of a similarity measure largely defines the performance of a ranking system as argued previously

**Pairwise protein domain comparison algorithms** compute the similarity between a pair of protein domains either by protein domain structure alignment or by comparing protein domain features.

**Graph-based similarity learning algorithms** use the traditional protein domain comparison methods mentioned above that focus on detecting pairwise sequence alignments while neglecting all other protein domains in the database and their distributions. To tackle this problem, a graph-based transductive similarity learning algorithm has been proposed

The main component of graph-based ranking is the construction of a graph as the estimation of intrinsic manifold of the database. As argued by Cai et al.

In
**Multi**ple **G**raph regularized **Rank**ing method **MultiG-Rank**. It is composed of an off-line graph weights learning algorithm and an on-line ranking algorithm.

Methods

Graph model and parameter selection Given a data set of protein domains represented by their tableau 32-D feature vectors
_{
q
}is the query protein domain, and the others are database protein domains. We define the ranking score vector as
_{
i
} is the ranking score of _{
i
} to the query domain. The problem is to rank the protein domains in
_{
i
} is the label of _{
q
}is the query label. The optimal ranking scores of relevant protein domains {_{
i
}},_{
i
} = _{
q
} should be larger than the irrelevant ones {_{
i
}},_{
i
} ≠ _{
q
}, so that the relevant protein domains will be returned to the user.

Graph regularized protein domain ranking

We applied two constraints on the optimal ranking score vector **f** to learn the optimal ranking scores:

**Relevance constraint** Because the query protein domain reflects the search intention of the user, _{
i
} = 1, if _{
i
} is relevant to the query and _{
i
} = 0 if it is not. Because the type label _{
q
} of a query protein domain _{
q
} is usually unknown, we know only that the query is relevant to itself and have no prior knowledge of whether or not others are relevant; therefore, we can only set _{
q
} = 1 while _{
i
},

To assign different weights to different protein domains in
_{
ii
} = 1 when _{
i
} is known, otherwise _{
ii
} = 0. To impose the relevant constraint to the learning of

**Graph constraint **
_{
i
}, its _{
ij
} which can be computed using different graph definitions and parameters as described in the next section. The edge weights are further organized in a weight matrix
_{
ij
} is the weight of edge (_{
i
} and _{
j
} are close (i.e.,_{
ij
} is big), then _{
i
} and _{
j
} should also be close. To impose the graph constraint to the learning of

where

When the two constraints are combined, the learning of **f** is based on the minimization of the following objective function:

where **f**) with respect to **f** to zero as **f** = (^{−1}
**y**. In this way, information from both the query protein domain provided by the user and the relationship of all the protein domains in
**G**raph regularized **Rank**ing (G-Rank).

Multiple graph learning and ranking: MultiG-Rank

Here we describe the multiple graph learning method to directly learn a self-adaptive graph for ranking regularization The graph is assumed to be a linear combination of multiple predefined graphs (referred to as base graphs). The graph weights are learned in a supervised way by considering the SCOP fold types of the protein domains in the database.

Multiple graph regularization

The main component of graph regularization is the construction of a graph. As described previously, there are many ways to find the neighbors
_{
i
} and to define the weight matrix

• **Gaussian kernel weighted graph:**
_{
i
} is found by comparing the squared Euclidean distance as,

and the weighting is computed using a Gaussian kernel as,

where

• **Dot-product weighted graph:**
_{
i
} is found by comparing the squared Euclidean distance and the weighting is computed as the dot-product as,

• **Cosine similarity weighted graph:**
_{
i
} is found by comparing cosine similarity as,

and the weighting is also assigned as cosine similarity as,

• **Jaccard index weighted graph:**
_{
i
} is found by comparing the Jaccard index

and the weighting is assigned as,

• **Tanimoto coefficient weighted graph:**
_{
i
} is found by comparing the Tanimoto coefficient as,

and the weighting is assigned as,

With so many possible choices of graphs, the most suitable graph with its parameters for the protein domain ranking task is often not known in advance; thus, an exhaustive search on a predefined pool of graphs is necessary. When the size of the pool becomes large, an exhaustive search will be quite time-consuming and sometimes not possible. Hence, a method for efficiently learning an appropriate graph to make the performance of the employed graph-based ranking method robust or even improved is crucial for graph regularized ranking. To tackle this problem we propose a multiple graph regularized ranking framework, that provides a series of initial guesses of the graph Laplacian and combines them to approximate the intrinsic manifold in a conditionally optimal way, inspired by a previously reported method

Given a set of _{
m
} in

where _{
m
} is the weight of

To use the information from data distribution approximated by the new composite graph Laplacian

where _{1},⋯,_{
M
}]^{⊤} is the graph weight vector.

Off-line supervised multiple graph learning

In the on-line querying procedure, the relevance of query _{
q
} to database protein domains is unknown and thus the optimal graph weights **y**
_{
q
} =[_{1q
},⋯,_{
Nq
}]^{⊤} as known because all the SCOP-fold labels are known for all the database protein domains as,

Therefore, we set ^{
N×N
} as a **f**
_{
q
} =[_{1q
},⋯,_{
Nq
}]^{⊤}. Substituting **f**
_{
q
}, **y**
_{
q
} and

To avoid the parameter _{2} norm regularization term ||^{2} to the object function. The difference between _{
q
} and _{
q
} should be noted: _{
q
}∈{1,0}^{
N
} plays the role of the given ground truth in the supervised learning procedure, while
_{
q
} is the ideal solution of _{
q
}, it is not always achieved after the learning. Thus, we introduce the first term in (16) to make _{
q
} as similar to _{
q
} as possible during the learning procedure.

Object function

Using all protein domains in the database

where **f**
_{1},⋯,**f**
_{
N
}] is the ranking score matrix with the **y**
_{1},⋯,**y**
_{
N
}] is the relevance matrix with the

Optimization

Because direct optimization to (17) is difficult, instead we adopt an iterative, two-step strategy to alternately optimize

•

•

where _{
m
} = ^{⊤}
_{
m
}
_{1},⋯,_{
M
}]^{⊤}. The optimization of (19) with respect to the graph weight

Off-line algorithm

The off-line

Algorithm 1.

**MultiG-Rank: off-line graph weights learning algorithm.**

**Require:** Candidate graph Laplacians set

**Require:** SCOP type label set of database protein domains

**Require:** Maximum iteration number

Construct the relevance matrix _{
iq
}]^{
N×N
} where _{
iq
} if _{
i
} = _{
q
}, 0 otherwise; Initialize the graph weights as

**for **
**do**

Update the ranking score matrix ^{
t
} according to

previous

Update the graph weight ^{
t
} according to

updated ^{
t
} by (19);

**end****for** Output graph weight ^{
t
}.

On-line ranking regularized by multiple graphs

Given a newly discovered protein domain submitted by a user as query _{0}, its SCOP type label _{0} will be unknown and the domain will not be in the database
_{0}, we extend the size of database to _{0}into the database and then solve the ranking score vector for _{0} which is defined as

• **Laplacian matrix ****
L
**: We first compute the

• **Relevance vector ****
y
**: The relevance vector for

• **Matrix ****
U
**: In this situation,

Then the ranking score vector

The on-line ranking algorithm is summarized as Algorithm 2

Algorithm 2.

**MultiG-Rank: on-line ranking algorithm.**

**Require:** protein domain database

**Require:** Query protein domain _{0};

**Require:** Graph weight

Extend the database to (_{0} and compute _{0} = 1 and diagonal matrix
_{
ii
} = 1 if **f** for _{0} as in (20); Ranking protein domains in
**f** in descending order.

Protein domain database and query set

We used the SCOP 1.75A database

Protein domain database

Our protein domain database was selected from

Distribution of protein domains with different fold types in the ASTRAL SCOP 1.75A 40% database

**Distribution of protein domains with different fold types in the ASTRAL SCOP 1.75A 40% database.**

Query set

We also randomly selected 540 protein domains from the SCOP 1.75A database to construct a query set. For each query protein domain that we selected we ensured that there was at least one protein domain belonging to the same SCOP fold type in the ASTRAL SCOP 1.75A 40% database, so that for each query, there was at least one ”positive” sample in the protein domain database. However, it should be noted that the 540 protein domains in the query data set were randomly selected and do not necessarily represent 540 different folds. Here we call our query set the

Evaluation metrics

A ranking procedure is run against the protein domains database using a query domain. A list of all matching protein domains along with their ranking scores is returned. We adopted the same evaluation metric framework as was described previously
_{
q
} belonging to the SCOP fold _{
q
}, a list of protein domains is returned from the database by the on-line MultiG-Rank algorithm or other ranking methods. For a database protein domain _{
r
} in the returned list, if its fold label _{
r
} is the same as that of _{
q
}, i.e. _{
r
} = _{
q
} it is identified as a true positive (TP), else it is identified as a false positive (FP). For a database protein domain

By varying the length of the returned list, different

ROC curve

Using

Recall-precision curve

Using recall as the abscissa and precision as the ordinate, the recall-precision curve can be plotted. For a high-performance ranking system, this curve should be close to the top-right corner of the plot.

AUC

The AUC is computed as a single-figure measurement of the quality of an ROC curve. AUC is averaged over all the queries to evaluate the performances of different ranking methods.

Results and discussion

We first compared our MultiG-Rank against several popular graph-based ranking score learning methods for ranking protein domains. We then evaluated the ranking performance of MultiG-Ranking against other protein domain ranking methods using different protein domain comparison strategies. Finally, a case study of a TIM barrel fold is described.

Comparison of MultiG-Rank against other graph-based ranking methods

We compared our MultiG-Rank to two graph-based ranking methods, G-Rank and GT

Comparison of MultiG-Rank against other protein domain ranking methods

**Comparison of MultiG-Rank against other protein domain ranking methods.** Each curve represents a graph-based ranking score learning algorithm. MultiG-Rank, the Multiple Graph regularized Ranking algorithm; G-Rank, Graph regularized Ranking; GT, graph transduction; Pairwise Rank, pairwise protein domain ranking method
**(a)** ROC curves of the different ranking methods; **(b)** Recall-precision curves of the different ranking methods.

The figure shows the ROC and the recall-precision curves obtained using the different graph ranking methods. As can be seen, the MultiG-Rank algorithm significantly outperformed the other graph-based ranking algorithms; the precision difference got larger as the recall value increased and then tend to converge as the precision tended towards zero (Figure

The AUC results for the different ranking methods on the

**Method**

**AUC**

MultiG-Rank

0.9730

G-Rank

0.9575

GT

0.9520

Pairwise-Rank

0.9478

We have made three observations from the results listed in Table

1. G-Rank and GT produced similar performances on our protein domain database, indicating that there is no significant difference in the performance of the graph

2. Pairwise ranking produced the worst performance even though the method uses a carefully selected similarity function as reported in

3. MultiG-Rank produced the best ranking performance, implying that both the discriminant and geometrical information in the protein domain database are important for accurate ranking. In MultiG-Rank, the geometrical information is estimated by multiple graphs and the discriminant information is included by using the SCOP-fold type labels to learn the graph weights.

Comparison of MultiG-Rank with other protein domain ranking methods

We compare the MultiG-Rank against several other popular protein domain ranking methods: IR Tableau

Comparison of the performances of protein domain ranking algorithms

**Comparison of the performances of protein domain ranking algorithms.****(a)** ROC curves for different field-specific protein domain ranking algorithms. TPR, true positive rate; FPR, false positive rate. **(b)** Recall-precision curves for different field-specific protein domain ranking algorithms.

**Method**

**AUC**

MultiG-Rank

0.9730

IR Tableau

0.9478

YAKUSA

0.9537

SHEBA

0.9421

QP tableau

0.9364

The results in Table

To evaluate the effect of using protein domain descriptors for ranking instead of direct protein domain structure comparisons, we compared IR Tableau with YAKUSA and SHEBA. The main differences between them are that IR Tableau considers both protein domain feature extraction and comparison procedures, while YAKUSA and SHEBA compare only pairs of protein domains directly. The quantitative results in Table

This result strongly suggests that ranking performance improvements are achieved mainly by graph regularization and not by using the power of a protein domain descriptor.

Plots of TPR versus FPR obtained using MultiG-Rank and various field-specific protein domain ranking methods as the ranking algorithms are shown in Figure

Case Study of the TIM barrel fold

Besides considering the results obtained for the whole database, we also studied an important protein fold, the TIM beta/alpha-barrel fold (c.1). The TIM barrel is a conserved protein fold that consists of eight

Fold level

When the returned database protein domain is from the same fold type as the query protein domain.

Superfamily level

When the returned database protein domain is from the same superfamily as the query protein domain.

Family level

When the returned database protein domain is from the same family as the query protein domain.

The ROC and the recall-precision plots of the protein domain ranking results of MultiG-Rank for the query TIM beta/alpha-barrel domain at the three levels are given in Figure

Ranking results for the case study using the TIM beta/alpha-barrel domain as the query

**Ranking results for the case study using the TIM beta/alpha-barrel domain as the query.****(a)** ROC curves of the ranking results for the TIM beta/alpha-barrel domain at the fold, superfamily, and family levels. TPR, true positive rate; FPR, false positive rate. **(b)** Recall-precision curves of the ranking results for the TIM beta/alpha-barrel domain at the fold, superfamily, and family levels.

Conclusion

The proposed MultiG-Rank method introduces a new paradigm to fortify the broad scope of existing graph-based ranking techniques. The main advantage of MultiG-Rank lies in its ability to represent the learning of a unified space of ranking scores for protein domain database in multiple graphs. Such flexibility is important in tackling complicated protein domain ranking problems because it allows more prior knowledge to be explored for effectively analyzing a given protein domain database, including the possibility of choosing a proper set of graphs to better characterize diverse databases, and the ability to adopt a multiple graph-based ranking method to appropriately model relationships among the protein domains. Here, MultiG-Rank has been evaluated comprehensively on a carefully selected subset of the ASTRAL SCOP 1.75 A protein domain database. The promising experimental results that were obtained further confirm the usefulness of our ranking score learning approach.

Competing interests

The authors declare no competing interests.

Authors’ contributions

JW invented the algorithm, performed the experiments and drafted the manuscript. HB drafted the manuscript. XG supervised the study and made critical changes to the manuscript. All the authors have approved the final manuscript.

Acknowledgements

The study was supported by grants from National Key Laboratory for Novel Software Technology, China (Grant No. KFKT2012B17), 2011 Qatar Annual Research Forum Award (Grant No. ARF2011), and King Abdullah University of Science and Technology (KAUST), Saudi Arabia. We appreciate the valuable comments from Prof. Yuexiang Shi, Xiangtan University, China.