Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

This article is part of the supplement: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012)

Open Access Proceedings

Protein localization prediction using random walks on graphs

Xiaohua Xu*, Lin Lu, Ping He and Ling Chen

Author Affiliations

Department of Computer Science, Yangzhou University, Yangzhou 225009, China

For all author emails, please log on.

BMC Bioinformatics 2013, 14(Suppl 8):S4  doi:10.1186/1471-2105-14-S8-S4


The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/14/S8/S4


Published:9 May 2013

© 2013 Xu et al.; licensee BioMed Central Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Understanding the localization of proteins in cells is vital to characterizing their functions and possible interactions. As a result, identifying the (sub)cellular compartment within which a protein is located becomes an important problem in protein classification. This classification issue thus involves predicting labels in a dataset with a limited number of labeled data points available. By utilizing a graph representation of protein data, random walk techniques have performed well in sequence classification and functional prediction; however, this method has not yet been applied to protein localization. Accordingly, we propose a novel classifier in the site prediction of proteins based on random walks on a graph.

Results

We propose a graph theory model for predicting protein localization using data generated in yeast and gram-negative (Gneg) bacteria. We tested the performance of our classifier on the two datasets, optimizing the model training parameters by varying the laziness values and the number of steps taken during the random walk. Using 10-fold cross-validation, we achieved an accuracy of above 61% for yeast data and about 93% for gram-negative bacteria.

Conclusions

This study presents a new classifier derived from the random walk technique and applies this classifier to investigate the cellular localization of proteins. The prediction accuracy and additional validation demonstrate an improvement over previous methods, such as support vector machine (SVM)-based classifiers.

Background

Protein localization is a general a term that refers to the study of where proteins are located within the cell. In many cases, proteins cannot perform their designated function until they are transported to the proper location at the appropriate time. Improper localization of proteins can exert a significant impact on cellular processes or on the entire organism. Therefore, a central issue for biologists is to predict the (sub)cellular localization of proteins[1-3], which has implications for the functions and interactions[4,5] of proteins.

With the development of new approaches in computer science, coupled with an improved dataset of proteins with known localization, computational tools can now provide fast and accurate localization predictions for many organisms as an alternative to laboratory-based methods. Therefore, many studies have begun to address this issue. To predict the cellular localization of proteins, soon after their proposal of a probabilistic classification system to identify 336 E.coli proteins and the 1484 yeast proteins [6], Paul Horton and Kenta Nakai [7] also compared their specifically designed probabilistic model with three other classifiers on the same datasets: the k-nearest-neighbor (kNN) classifier, the binary decision tree classifier, and the naive Bayes classifier. The resulting accuracy using stratified cross-validation showed that the kNN classifier performed better than the other methods, with an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes.

Feng [8] presented an overview about the prediction of protein subcellular localization, and in 2004, Donnes and Hoglund [9] introduced past and current work on this type of prediction as well as a guideline for future studies. Chou and Shen [10] summarized the more recent advances in the prediction of protein subcellular localization up to 2007. A variety of artificial intelligence technologies [11-15] have now been developed, including neural networks, the covariant discriminate algorithm, hidden Markov models (HMMs), Decision Tree and support vector machines (SVMs). Among these methods, the SVMs are always considered as a powerful algorithm for supervised learning.

Besides, there are other methods proposed too, like the YLoc tool implemented by Briesemeister et al. [16] and the PROlocalizer [17] which integrated web service to aid the prediction. Recently, the random-walk-on-graph technique [18-20] has been applied to biological questions such as the classification of proteins into functional and structural classes based on their amino acid sequences. Weston et al. presented a random-walk kernel based on PSI-BLAST E-values [21] for protein remote homology detection. Min et al. [22] applied the convex combination algorithm to approximate the random-walk kernel with optimal random steps and applied this approach to classify protein sequence. Freschi et al. [23] proposed a random walk ranking algorithm to predict protein functions from interaction networks. Random walks are closely linked to Markov chains, which inspired Yuan [24] to apply a first-order Markov chain and extend the residue pair probability to higher-order models to predict protein subcellular locations. Garagea et al. [25] also presented a semi-supervised method for prediction using abstraction augmented Markov models.

This study introduces a novel random walk method for protein subcellular localization based on amino acid composition. By mapping the protein data into a weighted and partially labeled graph where each node represents a protein sequence, we implemented a random walk classification model to predict labels of unlabeled nodes based on our previous theoretical work [26]. We present an intuitive interpretation of the graph representation, label propagation and model formulation. We additionally analyzed the performance of the method in predicting the (sub)cellular localization of proteins. This method produced results that were both competitive and promising when compared to the state-of-the-art SVM classifier.

Results

Our random walk classifier (RaWa) was coded in MATLAB. Given the training data and their classes, we computed the state matrix Y and weight matrix W. In our experiment, the similarity or weight between two nodes was given according to the radius basis function (RBF)

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M1">View MathML</a>

To prove the effective classification performance of our method, we compared our classifier with RBF-SVM by implementing LibSVM [27], and the γ = 1/2σ2 of our RaWa and RBF-SVM was optimized over the interval {2-11, 2-9, ..., 29, 211}. In this study, we adopted an n-fold cross-validation measurement to produce the highest predication accuracy, which was computed by dividing the number of correctly classified data points by the size of the entire unlabeled dataset.

Predicting the (sub)cellular localization of proteins

Since our classifier involved two parameters, the laziness parameter α for constructing transition matrix and the random walk step t, we first tested the performance of our classifier on different combinations of α and t. Then, under the optimized parameter settings, we compared our approach with various measurements to the SVM classifier.

Influence of α and t

We investigated a maximum walk of 30 steps and five parameters: 0.05, 0.25, 0.5, 0.75 and 0.95. Figure 1 and Figure 2 depict the predictive accuracy curves of our random walk classifier on yeast and Gneg datasets, respectively. Each figure contains five lines that correspond to each α and depicts the trend of accuracy ratios with increasing t. The test results were obtained from 10-fold cross validation.

thumbnailFigure 1. Classification accuracies (in %) of yeast data given varying random walk steps and laziness parameters.

thumbnailFigure 2. Classification accuracies (in %) of gram-negative bacteria data given different random walk steps and laziness parameters.

We found that a large number of steps were unnecessary for the RaWa classifier to achieve the best results. First, the complete graph offers each label a chance to reach the unlabeled node in at least one step. Second, both figures show that good accuracy was always obtained when the value of t was low. In contrast, the accuracy gradually declines after the peak value of t. This decline may probably due to the fact that with the increasing of t, Pt will become trivial and in turn mislead the classification. This situation is quite apparent in Figure 2. In addition, Szummer and Jaakola [28] found that small constant values of t (about t = 8) were effective on a dataset with several thousand examples.

Since the labeled training data is often deterministic, the transition matrix built over the labeled data is commonly treated as a unit matrix in semi-supervised random walk methods. However, the best result for the yeast data was achieved when α = 0.75. This value gave the labeled nodes more freedom to move to each other, whereas the best result for the Gneg data was achieved when α = 0.95. Consequently, it is necessary to import the laziness parameter when the training data is not fully reliable; α can usually be set above 0.5.

Comparisons with SVM

According to the above results, our method achieved a total prediction accuracy of 61% for yeast data, and >93% accuracy for Gneg data. Furthermore, to quantify the performance of our proposed algorithm, we employed SVMs and compared the two methods by computing the widely used measures of Specificity and Sensitivity. Table 1 compares the ability of the two methods to classify yeast data into 10 classes, while Table 2 shows the comparison for the Gneg data with 5 classes. We also compared the total accuracy of both classifiers; these data are presented in the final row of the table.

Table 1. Sensitivity and Specificity for yeast data using 10-fold cross-validation including the total predication accuracy

Table 2. Sensitivity and Specificity for gram-negative bacteria data using 10-fold cross-validation including the total predication accuracy.

Each classifier was able to produce results with high sensitivity and specificity, but neither could identify the proteins that localized to the VAC site. The RaWa performs slightly better since it could predict the proteins that localized to POX and ERL, whereas the SVM could not. As illustrated in Table 2, both classifiers produced high sensitivities and specificities on the 5 locations, but according to the total accuracy listed in the last row, our classifier outperformed the SVM by 1%.

We further compared the two classifiers using receiver operating characteristic curves (ROCs). Figure 3 and Figure 4 depict the results for yeast and Geng, respectively, and each figure contains the ROC curve for the RaWa method on the left and the ROC curve for the SVM method on the right. These figures together offer an intuitive comparison and show that our RaWa classifier is effective and that the results are comparable to those derived from a SVM-based method.

thumbnailFigure 3. ROC curves illustrating the comparison of RaWa and SVM methods on data from yeast.

thumbnailFigure 4. ROC curves illustrating the comparison of RaWa and SVM methods on data from gram-negative bacteria.

Discussion

Herein, we propose a novel classification model for label propagation through random walks on graphs. We first initialized an undirected complete graph over the labeled data whose data points act as the nodes and pairwise distances act as the weights. Then, labels and weights are employed to construct the state matrix and state transition matrix so that any node can start a random walk and propagate its label to any unlabeled data point after several steps. This model is also optimized by a kernel method and regularization so as to provide flexible control over the transition matrix.

One interesting possibility for future work is to develop algorithms for a clever selection of the labeled dataset and the kernel based on the data. In this study, we used the very simple Gaussian kernel with the identity covariance matrix, which likely does not exploit the similarity information conveyed in the data points.

Conclusions

Protein cellular and subcellular localization has been an important facet of research because of its role in characterizing protein functions and protein-protein interactions. In this study, we developed a novel approach based on a random walk technique to predict protein localization. We demonstrated that this approach improves the accuracy of predicting protein (sub)cellular localization and is easy to train. When compared to the SVM classifier, our results are both competitive and promising.

Methods

Data preparation

To apply our method to predict and classify protein (sub)cellular localization, we utilized two datasets: the widely used yeast data from the UCI database and the gram-negative bacteria proteins from the Cell-PLoc package. The yeast data, including 1484 items with 8 attributes, were used to predict the cellular localization of proteins and have been categorized into 10 classes. The second dataset was first used by Shen and Chou in their predictors [29,30] particularly for the prediction of gram-negative bacteria proteins. This dataset contained 1114 gram-negative (Gneg) bacterial proteins classified into 5 subcellular locations according to experimental annotations. None of the proteins had more than 25% sequence identity to any other in the same subset (subcellular location). Detailed information is provided in Table 3.

Table 3. Information about gram-negative and yeast data

First, we represented a protein sample P with L amino acid residues by its evolutionary and sequence information. Here, for simplifying the formulation without losing generality, we use the numerical codes 1, 2... 20 to represent the 20 native amino acid types according to their single character symbols in alphabetical order. Then, the position-specific scoring matrix (PSSM) was introduced as a descriptor of evolutionary information. The PSSM produced a matrix ML×20 where Mi→j represents the score of the amino acid residue in the ith position of the protein sequence being mutated to amino acid type j through evolution.

However, according to the PSSM descriptor, proteins with different lengths will correspond to matrices with different numbers of rows. To allow the PSSM descriptor to have a uniform representation, a given protein sample P could be represented by the mean value of each row: <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M2">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M4">View MathML</a>

However, as a result, all the sequence-order information would be lost. To avoid the complete loss of the sequence-order information, we also adopted the concept of the pseudo-amino acid composition (PseAA), as originally proposed in [31]. According to the representation of the PseAA, the protein P is formulated by

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M5">View MathML</a>

where p1,p2,...,p20 are associated with the conventional amino acid composition, reflecting the occurrence frequencies of the 20 native amino acids in the protein P.

We thus represented the protein P by combining PSSM and PseAA in the following form <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M6">View MathML</a>.

In order to obtain the PseAA values, the lambda was set to 49, and the weight was 0.05. Since there are 3 proteins whose lengths were shorter than 49 amino acids, we obtained 1111 proteins with 89 features.

Problem formulation

Usually, a training set (X, C) specifies the set of labeled data and the set of their classes, n is the number of tuples in X, and then the classes of a test set can be predicted. We first considered an initial graph of the form G(V, E, W), which was constructed over the training set, where V is the set of nodes and its member vi only responds to (xi, ci). This graph is assumed to be complete; therefore the edge set E is trivial. We thus provided the labeled nodes with a certain probability to travel to other nodes (explained below). W represents the edge weight matrix sized n×n and indicates the pairwise similarities, wij = sim(vi,vj) = sim(xi,xj).

We also let Y be a set of m labels that can be applied to nodes of the graph. After the initial weighted graph was generated, a state transition matrix P = [Pij]n×n was defined to infer the probability pij that one node vi transitions to the state of node vj. P is generally computed as P = D-1W, where the diagonal matrix D = diag(W1n) and 1n is a n-dimensional vector with all values set to 1. We next converted yi into a vector of labels (i.e., <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M7">View MathML</a>), where yi = [y1i,y2i,...,ymi]T. Therefore, the label or state of vi is cj if and only if yji = 1. Y can be also referred to as the state matrix of V or X.

Given the state matrix and transition matrix, a simple random walk on V is described as the process that the state yi of any node vi transitions with the probability pij to the state yj of node vj. Thus, the states of labeled data are not encoded as the absorbing states. Random walks on readily labeled nodes are meaningless since we utilized the information already encoded in the partially labeled graph to help us predict labels, but the initial graph G is just a labeled graph. Therefore, given each data point lacking a label from the test set, we added it to graph G as an unlabeled node. The traditional classification problem has thus been converted to a node classification problem on a partially labeled graph by this method.

Random walk classification model

We next aimed to deduce a simple classifier based on the nodes that are labeled so it can be applied to predict the labels of the unlabeled nodes. Our solution was a state vector y that provides the label for an unlabeled data point x.

We first provide an example to clarify the process of label propagation through random walks. Consider an initial graph G constructed over the training data (X, Y) = {(x1, c1), (x2, c1), (x3, c2)}. Each data point lacking a label is added into graph G as an unlabeled node. Figure 5 displays such a graph G' after three unlabeled data points were added. The graph G' is often assumed to be label-connected to become completely labeled [32]; that is, it is possible to reach a labeled node from any unlabeled node in a finite number of steps. For example, if in a random walk, the sixth node v6 ends at the second node v2, then this node will be labeled as c1.

thumbnailFigure 5. A simple partially labeled graph.

Node classification relies on a random walk originating at the unlabeled node vj and ends at one labeled node vi after several steps, and in this way, vj obtains its label from vi. If during the walk an unlabeled node reaches a labeled node for the first time, it will not remain at that node because the labeled nodes are not absorbing states; rather, the unlabeled node will move to another node with a certain probability. Since graphs G and G' are undirected and symmetric, a random walk that starts at vj and ends at vi can be also revertible.

Next, we assume p(vi, v) to be the state-transition probability with which a walk proceeds from node vi in V to the new node v represented by unlabeled data point x. The state y of new node v is represented as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M8">View MathML</a>

where

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M9">View MathML</a>

For the node vi in V, we have

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M10">View MathML</a>

Similarly, for the new node v not in V, p(V, v) is computed as:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M11">View MathML</a>

Therefore, the state y of v can be obtained by the following equation:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M12">View MathML</a>

where W+ denotes the pseudo-reverse matrix of W. This is preferred over the inverse of W because W may sometimes be singular. w(V, v) is a column vector that indicates the similarity between the new node v and nodes in V.

Model training

In order to train an effective classifier, the labeled data should be fully utilized; however the influence of noise within the training data should be avoided, especially because biological measurements always contain a certain amount of noise.

Therefore, we trained our classification model with a prediction adjustment using complementary training data. We first partitioned the training data X in a balanced fashion, which resulted in two subsets with a similar size, each having a certain amount of data belonging to each class in C. The two subsets S and T thus have properties such that ST = X and ST = Φ. Next, we allow the two complementary sets to predict each other with the above equation, and we can get:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M13">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M14">View MathML</a>

To evaluate the performance of this prediction, we computed the test loss on S and T according to the following equations:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M15">View MathML</a>

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M16">View MathML</a>

where classifier's performance increases with decreasing test loss. Moreover, we defined the total loss as

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M17">View MathML</a>

Though the total loss could be minimized through repeated random partitions of the training data, it is time consuming. We note that the test loss also indicates the importance of its corresponding subset, so we can impose a weight on each subset to highlight this difference. We then defined the state matrix to be:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M18">View MathML</a>

The weight vector was computed as follows:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M19">View MathML</a>

For the transition matrix, we usually consider a multi-step random walk; for t steps, we just replace P with Pt. During a random walk of t steps, the state of the new node v or new data point x is:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M20">View MathML</a>

Previous studies have treated the labeled nodes as absorbing states, such that P = I, but here we considered lazy random walks, i.e., Pt = (αI + (1-α)P)Pt-1, where α∈(0,1) is a laziness parameter indicating that the nodes will stay at their current positions with probability α

Further improvement with the kernel method and regularization

Usually, k(u, v) denotes the kernel function so that k(X, x)=[k(x1, x), k(x2, x),..., k(xn, x)]T. We defined the kernel matrix K in the space (X, X) as K = k(X, X) = [k(xi, xj)]n×n, and F was defined as a classifier. The kernel function k(X, x) and kernel matrix K were employed to substitute for the similarity metric w(V, v) and weighted matrix W, respectively.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M21">View MathML</a>

With the kernel method embedded, we formulated our random walk classifier as:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M22">View MathML</a>

Again, assuming <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M23">View MathML</a>, the final classification model is represented as:

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M24">View MathML</a>

The idea underlying the random walk methods is that the probability of labeling a node v with a label (or state) y is the total probability that a random walk starting at v will end at a node labeled y. F(x) therefore is more likely to return a probability distribution such as <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M25">View MathML</a>, where each distribution fji refers to the total probability that the a random walk starting at node vistops at any node labeled cj after t steps. The largest fji allows vi to be assigned label cj.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M26">View MathML</a>

K sometimes is a singular matrix because of insufficient data or the existence of noise, or there could be more than one optimized solution for W. In either case, computing w is not recommended. We thus use regularization to improve upon ill-posed problems. To enhance the robustness of our classifier, we introduced a regularization parameter λ into the kernel matrix, thereby formulating the regularized random walk basic classifier. In our experiments, we fixed λ to 0.0001 to avoid interference from the original data.

<a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M27">View MathML</a>

If the dimension of X is d, then the time cost for computing the kernel matrix and pseudo-reverse matrix to build the model for our classifier is O(dn2) and O(n3), respectively. <a onClick="popup('http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.biomedcentral.com/1471-2105/14/S8/S4/mathml/M28">View MathML</a> requires a complexity of O(mn2), where m n, so the overall cost is estimated as O(dn2) + O(n3) + O(mn2) = O(max{d, n}n2).

List of Abbreviations

HMM: Hidden Markov Models; kNN: k Nearest Neighbor; SVM: Support Vector Machine; RBF: Radial Basis Function; PSI: Position-Specific Iterated; BLAST: Basic Local Alignment Search Tool; PseAA: Pseudo Amino acid; RaWa: Random Walk Classifier; Gneg: gram-negative bacteria; ROC: receiver operating characteristic curve.

Authors' contributions

XX conceptualized the theoretical framework for this study. LL implemented the idea and conducted the experiments. PH, LC managed and coordinated the project. All authors participated in writing and revising the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under grant No. 61003180, No. 61070047 and No. 61103018; Natural Science Foundation of Education Department of Jiangsu Province under contract 09KJB20013; Natural Science Foundation of Jiangsu Province under contracts BK2010318 and BK2011442; Research Innovation Program for College Graduates of Jiangsu Province (CXLX12_0917); and The New Century Talent Project of Yangzhou University.

Declarations

This article has been published as part of BMC Bioinformatics Volume 14 Supplement 8, 2013: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S8.

References

  1. Bork P, Eisenhaber F: Wanted: subcellular localization of proteins based on sequence.

    Trends Cell Biol 1998, 8:169-170. PubMed Abstract | Publisher Full Text OpenURL

  2. Olof Emanuelsson: Predicting protein subcellular localisation from amino acid sequence information.

    Briefings in bioinformatics 2002, 3(4):4361-376. PubMed Abstract | Publisher Full Text OpenURL

  3. Kenichiro Imai, Kenta Nakai: Prediction of subcellular locations of proteins: Where to proceed?

    Proteomics 2010, 10(22):3970-3983. PubMed Abstract | Publisher Full Text OpenURL

  4. Junfeng Xia, Xingming Zhao, Jiangning Song, Deshuang Huang: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility.

    Bioinformatics 2010, 11:174. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Junfeng Xia, Xingming Zhao, Jiangning Song, Deshuang Huang: Predicting protein-protein interactions from protein sequences using Meta predictor.

    Amino Acids 2010, 39(5):1595-1599. PubMed Abstract | Publisher Full Text OpenURL

  6. Paul Horton, Kenta Nakai: A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. In proceedings of fourth international conference on Intelligent Systems in Molecular Biology 12-15 June 1996. Volume 4. Edited by David J. States, David Johnson. St. Louis USA; 1996::109-115. PubMed Abstract OpenURL

  7. Paul Horton, Kenta Nakai: Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier.

    In proceedings of fifth international conference on Intelligent Systems in Molecular Biology: 21-25 June 1997; Halkidiki, Greece Edited by Terry Gaasterland, Theresa. 1997, 5:147-152. PubMed Abstract OpenURL

  8. Feng ZP: An overview on predicting subcellular location of a protein.

    Silico Biology 2002, 2(3):291-303. PubMed Abstract | Publisher Full Text OpenURL

  9. Donnes P, Hoglund A: Predicting protein subcellular localization: past, present, and future.

    Genomics Proteomics Bioinform 2004, 2(4):209-215. PubMed Abstract OpenURL

  10. Kuochen Chou, Hongbin Shen: Recent progress in protein subcellular location prediction.

    Analytical Biochemistry 2007, 37:01-16. PubMed Abstract | Publisher Full Text OpenURL

  11. Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machine-learned classifiers.

    Bioinformatics 2004, 20(4):547-556. PubMed Abstract | Publisher Full Text OpenURL

  12. Gardy JL, Brinkman FS: Methods for predicting bacterial protein subcellular localization.

    Nat Rev Microbiol 2006, 4(10):741-751. PubMed Abstract | Publisher Full Text OpenURL

  13. Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization Proteins.

    Proteins 2006, 64(3):643-651. PubMed Abstract | Publisher Full Text OpenURL

  14. Nair R, Rost B: Protein subcellular localization prediction using artificial intelligence technology.

    Methods Mol Biol 2008, 484:435-463. PubMed Abstract | Publisher Full Text OpenURL

  15. Eric JuanYT, Chang JH, Li CH, Chen BY: Methods for Protein Subcellular Localization Prediction.

    In proceedings of fifth International Conference on Complex, Intelligent, and Software Intensive Systems: 30 June - 2 July 2011; Seoul, Korea Edited by Leonard Barolli, Fatos Xhafa, llsun You, Nik Bessis. 2011, 553-558. OpenURL

  16. Briesemeister S, Rahnenfuhrer J, Kohlbacher O: Going from where to why -- interpretable prediction of protein subcellular localization.

    Bioinformatics 2010, 26(9):1232-1238. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Laurila K, Vihinen M: PROlocalizer: integrated web service for protein subcellular localization prediction.

    Amino Acids 2011, 40(3):975-980. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Lovász L: Random Walks on Graphs: A survey.

    Combinatorics, Paul Erdӧs is Eighty (Vol. 2), Keszthely(Hungary) 1993, 2:01-46. OpenURL

  19. Smriti B, Graham C, Muthukrishnan S: Node classification in social networks.

    Arxiv preprint arXiv 2011, 1101-3291. OpenURL

  20. Gregory LawlerF: Simple Random Walk.

    Intersections of Random Walks Modern Birkhäuser Classics 2013, 11-46. OpenURL

  21. Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, William Stafford Noble: Semi-supervised protein classification using cluster kernels.

    Bioinformatics 2005, 21:3241-3247. PubMed Abstract | Publisher Full Text OpenURL

  22. Min R, Bonner A, Li J, Zhang Z: Learned random-walk kernels and empirical-map kernels for protein sequence classification.

    J Comput Biol 2009, 16(3):457-474. PubMed Abstract | Publisher Full Text OpenURL

  23. Freschi V: Protein function prediction from interaction networks using a random walk ranking algorithm.

    In Proceedings of the seventh IEEE international conference on Bioinformatics and Biomedicine BIBM(2007): 2-4 November California USA Edited by Xiaohuo Hu, lon mandoiu, Zoran Obradovic, Jiali Xia. 2007, 42-48. OpenURL

  24. Yuan Z: Prediction of protein subcellular locations using Markov chain models.

    FEBS Letters 1999, 451:23-26. PubMed Abstract | Publisher Full Text OpenURL

  25. Caragea C, Caragea D, Silvescu A, Honavar V: Semi-supervised prediction of protein subcelular localization using abstraction augmented Markov models.

    BMC Bioinformatics 2010, 11(Suppl 8):S6. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  26. Xiaohua Xu: Random Walk Learning on Graph. PhD thesis. Nanjing University of Aeronautics and Astronautics, Computer Science Department; 2008. OpenURL

  27. Chang CC, Lin CJ: LIBSVM: a library for support vector machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm] webcite

  28. Szummer M, Jaakkola T: Patially labeled classification with markov random walk.

    Advances in neural Information Processing Systems 2002, 14:945-952. OpenURL

  29. Kuochen Chou, Hongbin Shen: Large-scale predictions of Gram-negative bacterial protein subcellular locations.

    J Proteome Res 2007, 5:3420-3428. PubMed Abstract | Publisher Full Text OpenURL

  30. Shen HB, Chou KC: Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins.

    J Theor Biol 2010, 264(2):326-333. PubMed Abstract | Publisher Full Text OpenURL

  31. Shen H, Chou K: Nuc-Ploc: a new web-server for predicting protein subnuclear localization by fusing PseAA and PsePSSM.

    Protein Engineering Design & Selection 2007, 20(11):561-567. PubMed Abstract | Publisher Full Text OpenURL

  32. Azran A: The rendezvous algorithm:Multiclass semi-supervised learning with markov random walks.

    In proceedings of the twentyforth International Conference on Machine Learning 20-24 June 2007; Corvallis, Oregon USA Edited by Zoubin Ghahramani. 2007, 1144-1151. OpenURL