Abstract
Background
Understanding the localization of proteins in cells is vital to characterizing their functions and possible interactions. As a result, identifying the (sub)cellular compartment within which a protein is located becomes an important problem in protein classification. This classification issue thus involves predicting labels in a dataset with a limited number of labeled data points available. By utilizing a graph representation of protein data, random walk techniques have performed well in sequence classification and functional prediction; however, this method has not yet been applied to protein localization. Accordingly, we propose a novel classifier in the site prediction of proteins based on random walks on a graph.
Results
We propose a graph theory model for predicting protein localization using data generated in yeast and gramnegative (Gneg) bacteria. We tested the performance of our classifier on the two datasets, optimizing the model training parameters by varying the laziness values and the number of steps taken during the random walk. Using 10fold crossvalidation, we achieved an accuracy of above 61% for yeast data and about 93% for gramnegative bacteria.
Conclusions
This study presents a new classifier derived from the random walk technique and applies this classifier to investigate the cellular localization of proteins. The prediction accuracy and additional validation demonstrate an improvement over previous methods, such as support vector machine (SVM)based classifiers.
Background
Protein localization is a general a term that refers to the study of where proteins are located within the cell. In many cases, proteins cannot perform their designated function until they are transported to the proper location at the appropriate time. Improper localization of proteins can exert a significant impact on cellular processes or on the entire organism. Therefore, a central issue for biologists is to predict the (sub)cellular localization of proteins[13], which has implications for the functions and interactions[4,5] of proteins.
With the development of new approaches in computer science, coupled with an improved dataset of proteins with known localization, computational tools can now provide fast and accurate localization predictions for many organisms as an alternative to laboratorybased methods. Therefore, many studies have begun to address this issue. To predict the cellular localization of proteins, soon after their proposal of a probabilistic classification system to identify 336 E.coli proteins and the 1484 yeast proteins [6], Paul Horton and Kenta Nakai [7] also compared their specifically designed probabilistic model with three other classifiers on the same datasets: the knearestneighbor (kNN) classifier, the binary decision tree classifier, and the naive Bayes classifier. The resulting accuracy using stratified crossvalidation showed that the kNN classifier performed better than the other methods, with an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes.
Feng [8] presented an overview about the prediction of protein subcellular localization, and in 2004, Donnes and Hoglund [9] introduced past and current work on this type of prediction as well as a guideline for future studies. Chou and Shen [10] summarized the more recent advances in the prediction of protein subcellular localization up to 2007. A variety of artificial intelligence technologies [1115] have now been developed, including neural networks, the covariant discriminate algorithm, hidden Markov models (HMMs), Decision Tree and support vector machines (SVMs). Among these methods, the SVMs are always considered as a powerful algorithm for supervised learning.
Besides, there are other methods proposed too, like the YLoc tool implemented by Briesemeister et al. [16] and the PROlocalizer [17] which integrated web service to aid the prediction. Recently, the randomwalkongraph technique [1820] has been applied to biological questions such as the classification of proteins into functional and structural classes based on their amino acid sequences. Weston et al. presented a randomwalk kernel based on PSIBLAST Evalues [21] for protein remote homology detection. Min et al. [22] applied the convex combination algorithm to approximate the randomwalk kernel with optimal random steps and applied this approach to classify protein sequence. Freschi et al. [23] proposed a random walk ranking algorithm to predict protein functions from interaction networks. Random walks are closely linked to Markov chains, which inspired Yuan [24] to apply a firstorder Markov chain and extend the residue pair probability to higherorder models to predict protein subcellular locations. Garagea et al. [25] also presented a semisupervised method for prediction using abstraction augmented Markov models.
This study introduces a novel random walk method for protein subcellular localization based on amino acid composition. By mapping the protein data into a weighted and partially labeled graph where each node represents a protein sequence, we implemented a random walk classification model to predict labels of unlabeled nodes based on our previous theoretical work [26]. We present an intuitive interpretation of the graph representation, label propagation and model formulation. We additionally analyzed the performance of the method in predicting the (sub)cellular localization of proteins. This method produced results that were both competitive and promising when compared to the stateoftheart SVM classifier.
Results
Our random walk classifier (RaWa) was coded in MATLAB. Given the training data and their classes, we computed the state matrix Y and weight matrix W. In our experiment, the similarity or weight between two nodes was given according to the radius basis function (RBF)
To prove the effective classification performance of our method, we compared our classifier with RBFSVM by implementing LibSVM [27], and the γ = 1/2σ^{2 }of our RaWa and RBFSVM was optimized over the interval {2^{11}, 2^{9}, ..., 2^{9}, 2^{11}}. In this study, we adopted an nfold crossvalidation measurement to produce the highest predication accuracy, which was computed by dividing the number of correctly classified data points by the size of the entire unlabeled dataset.
Predicting the (sub)cellular localization of proteins
Since our classifier involved two parameters, the laziness parameter α for constructing transition matrix and the random walk step t, we first tested the performance of our classifier on different combinations of α and t. Then, under the optimized parameter settings, we compared our approach with various measurements to the SVM classifier.
Influence of α and t
We investigated a maximum walk of 30 steps and five parameters: 0.05, 0.25, 0.5, 0.75 and 0.95. Figure 1 and Figure 2 depict the predictive accuracy curves of our random walk classifier on yeast and Gneg datasets, respectively. Each figure contains five lines that correspond to each α and depicts the trend of accuracy ratios with increasing t. The test results were obtained from 10fold cross validation.
Figure 1. Classification accuracies (in %) of yeast data given varying random walk steps and laziness parameters.
Figure 2. Classification accuracies (in %) of gramnegative bacteria data given different random walk steps and laziness parameters.
We found that a large number of steps were unnecessary for the RaWa classifier to achieve the best results. First, the complete graph offers each label a chance to reach the unlabeled node in at least one step. Second, both figures show that good accuracy was always obtained when the value of t was low. In contrast, the accuracy gradually declines after the peak value of t. This decline may probably due to the fact that with the increasing of t, P^{t }will become trivial and in turn mislead the classification. This situation is quite apparent in Figure 2. In addition, Szummer and Jaakola [28] found that small constant values of t (about t = 8) were effective on a dataset with several thousand examples.
Since the labeled training data is often deterministic, the transition matrix built over the labeled data is commonly treated as a unit matrix in semisupervised random walk methods. However, the best result for the yeast data was achieved when α = 0.75. This value gave the labeled nodes more freedom to move to each other, whereas the best result for the Gneg data was achieved when α = 0.95. Consequently, it is necessary to import the laziness parameter when the training data is not fully reliable; α can usually be set above 0.5.
Comparisons with SVM
According to the above results, our method achieved a total prediction accuracy of 61% for yeast data, and >93% accuracy for Gneg data. Furthermore, to quantify the performance of our proposed algorithm, we employed SVMs and compared the two methods by computing the widely used measures of Specificity and Sensitivity. Table 1 compares the ability of the two methods to classify yeast data into 10 classes, while Table 2 shows the comparison for the Gneg data with 5 classes. We also compared the total accuracy of both classifiers; these data are presented in the final row of the table.
Table 1. Sensitivity and Specificity for yeast data using 10fold crossvalidation including the total predication accuracy
Table 2. Sensitivity and Specificity for gramnegative bacteria data using 10fold crossvalidation including the total predication accuracy.
Each classifier was able to produce results with high sensitivity and specificity, but neither could identify the proteins that localized to the VAC site. The RaWa performs slightly better since it could predict the proteins that localized to POX and ERL, whereas the SVM could not. As illustrated in Table 2, both classifiers produced high sensitivities and specificities on the 5 locations, but according to the total accuracy listed in the last row, our classifier outperformed the SVM by 1%.
We further compared the two classifiers using receiver operating characteristic curves (ROCs). Figure 3 and Figure 4 depict the results for yeast and Geng, respectively, and each figure contains the ROC curve for the RaWa method on the left and the ROC curve for the SVM method on the right. These figures together offer an intuitive comparison and show that our RaWa classifier is effective and that the results are comparable to those derived from a SVMbased method.
Discussion
Herein, we propose a novel classification model for label propagation through random walks on graphs. We first initialized an undirected complete graph over the labeled data whose data points act as the nodes and pairwise distances act as the weights. Then, labels and weights are employed to construct the state matrix and state transition matrix so that any node can start a random walk and propagate its label to any unlabeled data point after several steps. This model is also optimized by a kernel method and regularization so as to provide flexible control over the transition matrix.
One interesting possibility for future work is to develop algorithms for a clever selection of the labeled dataset and the kernel based on the data. In this study, we used the very simple Gaussian kernel with the identity covariance matrix, which likely does not exploit the similarity information conveyed in the data points.
Conclusions
Protein cellular and subcellular localization has been an important facet of research because of its role in characterizing protein functions and proteinprotein interactions. In this study, we developed a novel approach based on a random walk technique to predict protein localization. We demonstrated that this approach improves the accuracy of predicting protein (sub)cellular localization and is easy to train. When compared to the SVM classifier, our results are both competitive and promising.
Methods
Data preparation
To apply our method to predict and classify protein (sub)cellular localization, we utilized two datasets: the widely used yeast data from the UCI database and the gramnegative bacteria proteins from the CellPLoc package. The yeast data, including 1484 items with 8 attributes, were used to predict the cellular localization of proteins and have been categorized into 10 classes. The second dataset was first used by Shen and Chou in their predictors [29,30] particularly for the prediction of gramnegative bacteria proteins. This dataset contained 1114 gramnegative (Gneg) bacterial proteins classified into 5 subcellular locations according to experimental annotations. None of the proteins had more than 25% sequence identity to any other in the same subset (subcellular location). Detailed information is provided in Table 3.
Table 3. Information about gramnegative and yeast data
First, we represented a protein sample P with L amino acid residues by its evolutionary and sequence information. Here, for simplifying the formulation without losing generality, we use the numerical codes 1, 2... 20 to represent the 20 native amino acid types according to their single character symbols in alphabetical order. Then, the positionspecific scoring matrix (PSSM) was introduced as a descriptor of evolutionary information. The PSSM produced a matrix M_{L×20 }where M_{i→j }represents the score of the amino acid residue in the ith position of the protein sequence being mutated to amino acid type j through evolution.
However, according to the PSSM descriptor, proteins with different lengths will correspond to matrices with different numbers of rows. To allow the PSSM descriptor to have a uniform representation, a given protein sample P could be represented by the mean value of each row:
However, as a result, all the sequenceorder information would be lost. To avoid the complete loss of the sequenceorder information, we also adopted the concept of the pseudoamino acid composition (PseAA), as originally proposed in [31]. According to the representation of the PseAA, the protein P is formulated by
where p_{1},p_{2},...,p_{20 }are associated with the conventional amino acid composition, reflecting the occurrence frequencies of the 20 native amino acids in the protein P.
We thus represented the protein P by combining PSSM and PseAA in the following form .
In order to obtain the PseAA values, the lambda was set to 49, and the weight was 0.05. Since there are 3 proteins whose lengths were shorter than 49 amino acids, we obtained 1111 proteins with 89 features.
Problem formulation
Usually, a training set (X, C) specifies the set of labeled data and the set of their classes, n is the number of tuples in X, and then the classes of a test set can be predicted. We first considered an initial graph of the form G(V, E, W), which was constructed over the training set, where V is the set of nodes and its member v_{i }only responds to (x_{i}, c_{i}). This graph is assumed to be complete; therefore the edge set E is trivial. We thus provided the labeled nodes with a certain probability to travel to other nodes (explained below). W represents the edge weight matrix sized n×n and indicates the pairwise similarities, w_{ij }= sim(v_{i},v_{j}) = sim(x_{i},x_{j}).
We also let Y be a set of m labels that can be applied to nodes of the graph. After the initial weighted graph was generated, a state transition matrix P = [P_{ij}]_{n×n }was defined to infer the probability p_{ij }that one node v_{i }transitions to the state of node v_{j}. P is generally computed as P = D^{1}W, where the diagonal matrix D = diag(W1_{n}) and 1_{n }is a ndimensional vector with all values set to 1. We next converted y_{i }into a vector of labels (i.e., ), where y_{i }= [y_{1i},y_{2i},...,y_{mi}]^{T}. Therefore, the label or state of v_{i }is c_{j }if and only if y_{ji }= 1. Y can be also referred to as the state matrix of V or X.
Given the state matrix and transition matrix, a simple random walk on V is described as the process that the state y_{i }of any node v_{i }transitions with the probability p_{ij }to the state y_{j }of node v_{j}. Thus, the states of labeled data are not encoded as the absorbing states. Random walks on readily labeled nodes are meaningless since we utilized the information already encoded in the partially labeled graph to help us predict labels, but the initial graph G is just a labeled graph. Therefore, given each data point lacking a label from the test set, we added it to graph G as an unlabeled node. The traditional classification problem has thus been converted to a node classification problem on a partially labeled graph by this method.
Random walk classification model
We next aimed to deduce a simple classifier based on the nodes that are labeled so it can be applied to predict the labels of the unlabeled nodes. Our solution was a state vector y that provides the label for an unlabeled data point x.
We first provide an example to clarify the process of label propagation through random walks. Consider an initial graph G constructed over the training data (X, Y) = {(x_{1}, c_{1}), (x_{2}, c_{1}), (x_{3}, c_{2})}. Each data point lacking a label is added into graph G as an unlabeled node. Figure 5 displays such a graph G' after three unlabeled data points were added. The graph G' is often assumed to be labelconnected to become completely labeled [32]; that is, it is possible to reach a labeled node from any unlabeled node in a finite number of steps. For example, if in a random walk, the sixth node v_{6 }ends at the second node v_{2}, then this node will be labeled as c_{1}.
Figure 5. A simple partially labeled graph.
Node classification relies on a random walk originating at the unlabeled node v_{j }and ends at one labeled node v_{i }after several steps, and in this way, v_{j }obtains its label from v_{i}. If during the walk an unlabeled node reaches a labeled node for the first time, it will not remain at that node because the labeled nodes are not absorbing states; rather, the unlabeled node will move to another node with a certain probability. Since graphs G and G' are undirected and symmetric, a random walk that starts at v_{j }and ends at v_{i }can be also revertible.
Next, we assume p(v_{i}, v) to be the statetransition probability with which a walk proceeds from node v_{i }in V to the new node v represented by unlabeled data point x. The state y of new node v is represented as
where
For the node v_{i }in V, we have
Similarly, for the new node v not in V, p(V, v) is computed as:
Therefore, the state y of v can be obtained by the following equation:
where W^{+ }denotes the pseudoreverse matrix of W. This is preferred over the inverse of W because W may sometimes be singular. w(V, v) is a column vector that indicates the similarity between the new node v and nodes in V.
Model training
In order to train an effective classifier, the labeled data should be fully utilized; however the influence of noise within the training data should be avoided, especially because biological measurements always contain a certain amount of noise.
Therefore, we trained our classification model with a prediction adjustment using complementary training data. We first partitioned the training data X in a balanced fashion, which resulted in two subsets with a similar size, each having a certain amount of data belonging to each class in C. The two subsets S and T thus have properties such that S∪T = X and S∩T = Φ. Next, we allow the two complementary sets to predict each other with the above equation, and we can get:
To evaluate the performance of this prediction, we computed the test loss on S and T according to the following equations:
where classifier's performance increases with decreasing test loss. Moreover, we defined the total loss as
Though the total loss could be minimized through repeated random partitions of the training data, it is time consuming. We note that the test loss also indicates the importance of its corresponding subset, so we can impose a weight on each subset to highlight this difference. We then defined the state matrix to be:
The weight vector was computed as follows:
For the transition matrix, we usually consider a multistep random walk; for t steps, we just replace P with P^{t}. During a random walk of t steps, the state of the new node v or new data point x is:
Previous studies have treated the labeled nodes as absorbing states, such that P = I, but here we considered lazy random walks, i.e., P^{t }= (αI + (1α)P)P^{t}1, where α∈(0,1) is a laziness parameter indicating that the nodes will stay at their current positions with probability α
Further improvement with the kernel method and regularization
Usually, k(u, v) denotes the kernel function so that k(X, x)=[k(x_{1}, x), k(x_{2}, x),..., k(x_{n}, x)]^{T}. We defined the kernel matrix K in the space (X, X) as K = k(X, X) = [k(x_{i}, x_{j})]_{n×n}, and F was defined as a classifier. The kernel function k(X, x) and kernel matrix K were employed to substitute for the similarity metric w(V, v) and weighted matrix W, respectively.
With the kernel method embedded, we formulated our random walk classifier as:
Again, assuming , the final classification model is represented as:
The idea underlying the random walk methods is that the probability of labeling a node v with a label (or state) y is the total probability that a random walk starting at v will end at a node labeled y. F(x) therefore is more likely to return a probability distribution such as , where each distribution f_{ji }refers to the total probability that the a random walk starting at node v_{i}stops at any node labeled c_{j }after t steps. The largest f_{ji }allows v_{i }to be assigned label c_{j}.
K sometimes is a singular matrix because of insufficient data or the existence of noise, or there could be more than one optimized solution for W. In either case, computing w is not recommended. We thus use regularization to improve upon illposed problems. To enhance the robustness of our classifier, we introduced a regularization parameter λ into the kernel matrix, thereby formulating the regularized random walk basic classifier. In our experiments, we fixed λ to 0.0001 to avoid interference from the original data.
If the dimension of X is d, then the time cost for computing the kernel matrix and pseudoreverse matrix to build the model for our classifier is O(dn^{2}) and O(n^{3}), respectively. requires a complexity of O(mn^{2}), where m ≤ n, so the overall cost is estimated as O(dn^{2}) + O(n^{3}) + O(mn^{2}) = O(max{d, n}n^{2}).
List of Abbreviations
HMM: Hidden Markov Models; kNN: k Nearest Neighbor; SVM: Support Vector Machine; RBF: Radial Basis Function; PSI: PositionSpecific Iterated; BLAST: Basic Local Alignment Search Tool; PseAA: Pseudo Amino acid; RaWa: Random Walk Classifier; Gneg: gramnegative bacteria; ROC: receiver operating characteristic curve.
Authors' contributions
XX conceptualized the theoretical framework for this study. LL implemented the idea and conducted the experiments. PH, LC managed and coordinated the project. All authors participated in writing and revising the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under grant No. 61003180, No. 61070047 and No. 61103018; Natural Science Foundation of Education Department of Jiangsu Province under contract 09KJB20013; Natural Science Foundation of Jiangsu Province under contracts BK2010318 and BK2011442; Research Innovation Program for College Graduates of Jiangsu Province (CXLX12_0917); and The New Century Talent Project of Yangzhou University.
Declarations
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 8, 2013: Proceedings of the 2012 International Conference on Intelligent Computing (ICIC 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S8.
References

Bork P, Eisenhaber F: Wanted: subcellular localization of proteins based on sequence.
Trends Cell Biol 1998, 8:169170. PubMed Abstract  Publisher Full Text

Olof Emanuelsson: Predicting protein subcellular localisation from amino acid sequence information.
Briefings in bioinformatics 2002, 3(4):4361376. PubMed Abstract  Publisher Full Text

Kenichiro Imai, Kenta Nakai: Prediction of subcellular locations of proteins: Where to proceed?
Proteomics 2010, 10(22):39703983. PubMed Abstract  Publisher Full Text

Junfeng Xia, Xingming Zhao, Jiangning Song, Deshuang Huang: APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility.
Bioinformatics 2010, 11:174. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Junfeng Xia, Xingming Zhao, Jiangning Song, Deshuang Huang: Predicting proteinprotein interactions from protein sequences using Meta predictor.
Amino Acids 2010, 39(5):15951599. PubMed Abstract  Publisher Full Text

Paul Horton, Kenta Nakai: A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. In proceedings of fourth international conference on Intelligent Systems in Molecular Biology 1215 June 1996. Volume 4. Edited by David J. States, David Johnson. St. Louis USA; 1996::109115. PubMed Abstract

Paul Horton, Kenta Nakai: Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier.
In proceedings of fifth international conference on Intelligent Systems in Molecular Biology: 2125 June 1997; Halkidiki, Greece Edited by Terry Gaasterland, Theresa. 1997, 5:147152. PubMed Abstract

Feng ZP: An overview on predicting subcellular location of a protein.
Silico Biology 2002, 2(3):291303. PubMed Abstract  Publisher Full Text

Donnes P, Hoglund A: Predicting protein subcellular localization: past, present, and future.
Genomics Proteomics Bioinform 2004, 2(4):209215. PubMed Abstract

Kuochen Chou, Hongbin Shen: Recent progress in protein subcellular location prediction.
Analytical Biochemistry 2007, 37:0116. PubMed Abstract  Publisher Full Text

Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machinelearned classifiers.
Bioinformatics 2004, 20(4):547556. PubMed Abstract  Publisher Full Text

Gardy JL, Brinkman FS: Methods for predicting bacterial protein subcellular localization.
Nat Rev Microbiol 2006, 4(10):741751. PubMed Abstract  Publisher Full Text

Yu CS, Chen YC, Lu CH, Hwang JK: Prediction of protein subcellular localization Proteins.
Proteins 2006, 64(3):643651. PubMed Abstract  Publisher Full Text

Nair R, Rost B: Protein subcellular localization prediction using artificial intelligence technology.
Methods Mol Biol 2008, 484:435463. PubMed Abstract  Publisher Full Text

Eric JuanYT, Chang JH, Li CH, Chen BY: Methods for Protein Subcellular Localization Prediction.
In proceedings of fifth International Conference on Complex, Intelligent, and Software Intensive Systems: 30 June  2 July 2011; Seoul, Korea Edited by Leonard Barolli, Fatos Xhafa, llsun You, Nik Bessis. 2011, 553558.

Briesemeister S, Rahnenfuhrer J, Kohlbacher O: Going from where to why  interpretable prediction of protein subcellular localization.
Bioinformatics 2010, 26(9):12321238. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Laurila K, Vihinen M: PROlocalizer: integrated web service for protein subcellular localization prediction.
Amino Acids 2011, 40(3):975980. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lovász L: Random Walks on Graphs: A survey.
Combinatorics, Paul Erdӧs is Eighty (Vol. 2), Keszthely(Hungary) 1993, 2:0146.

Smriti B, Graham C, Muthukrishnan S: Node classification in social networks.

Gregory LawlerF: Simple Random Walk.
Intersections of Random Walks Modern Birkhäuser Classics 2013, 1146.

Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, William Stafford Noble: Semisupervised protein classification using cluster kernels.
Bioinformatics 2005, 21:32413247. PubMed Abstract  Publisher Full Text

Min R, Bonner A, Li J, Zhang Z: Learned randomwalk kernels and empiricalmap kernels for protein sequence classification.
J Comput Biol 2009, 16(3):457474. PubMed Abstract  Publisher Full Text

Freschi V: Protein function prediction from interaction networks using a random walk ranking algorithm.
In Proceedings of the seventh IEEE international conference on Bioinformatics and Biomedicine BIBM(2007): 24 November California USA Edited by Xiaohuo Hu, lon mandoiu, Zoran Obradovic, Jiali Xia. 2007, 4248.

Yuan Z: Prediction of protein subcellular locations using Markov chain models.
FEBS Letters 1999, 451:2326. PubMed Abstract  Publisher Full Text

Caragea C, Caragea D, Silvescu A, Honavar V: Semisupervised prediction of protein subcelular localization using abstraction augmented Markov models.
BMC Bioinformatics 2010, 11(Suppl 8):S6. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Xiaohua Xu: Random Walk Learning on Graph. PhD thesis. Nanjing University of Aeronautics and Astronautics, Computer Science Department; 2008.

Chang CC, Lin CJ: LIBSVM: a library for support vector machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm] webcite

Szummer M, Jaakkola T: Patially labeled classification with markov random walk.
Advances in neural Information Processing Systems 2002, 14:945952.

Kuochen Chou, Hongbin Shen: Largescale predictions of Gramnegative bacterial protein subcellular locations.
J Proteome Res 2007, 5:34203428. PubMed Abstract  Publisher Full Text

Shen HB, Chou KC: GnegmPLoc: a topdown strategy to enhance the quality of predicting subcellular localization of Gramnegative bacterial proteins.
J Theor Biol 2010, 264(2):326333. PubMed Abstract  Publisher Full Text

Shen H, Chou K: NucPloc: a new webserver for predicting protein subnuclear localization by fusing PseAA and PsePSSM.
Protein Engineering Design & Selection 2007, 20(11):561567. PubMed Abstract  Publisher Full Text

Azran A: The rendezvous algorithm:Multiclass semisupervised learning with markov random walks.
In proceedings of the twentyforth International Conference on Machine Learning 2024 June 2007; Corvallis, Oregon USA Edited by Zoubin Ghahramani. 2007, 11441151.