Abstract
Background
Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semisupervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data.
Results
In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semisupervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semisupervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a cotraining based approaches to semisupervised training of MMs (that make use of unlabeled data).
Conclusions
The results of our experiments on three protein subcellular localization data sets show that semisupervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semisupervised MMs; and (iii) are comparable in performance, and in some cases outperform, the cotraining based semisupervised MMs.
Background
The problem of predicting subcellular protein localization is important in cell biology, because it can provide valuable information for predicting protein function and proteinprotein interactions. Furthermore, the location of proteins in their designated subcellular compartments is essential for the proper functioning of the cell. Abnormal subcellular localization has been correlated with diseases such as cancer [1].
Many supervised machine learning methods have been successfully applied to the problem of predicting the subcellular localization of a protein, which can be formulated as a sequence classification problem [2], where the amino acid sequence of a protein is used to classify it in localization classes. For example, Park and Kanehisa [3] trained Support Vector Machine (SVM) classifiers using as features, frequencies of occurrence of pairs of amino acids, with 0 to 3 gaps between them. Emanuelsson et al. [4] developed a Neural Networkbased approach using only information available in the Nterminal sequence. Höglund et al. [5] integrated information from the Nterminal sequence, amino acid composition, and protein sequence motifs in an SVMbased approach. Ong and Zien [6] trained multiclass SVMs and used an automated combination of protein motif kernels, with motifs of length up to 5 extracted from the whole sequence, and from different subsequences of it, i.e., the first 15 and 60 amino acids, and the last 15 amino acids of the sequence. Scott et al. [7] developed a Bayesian network that predicts the subcellular localization of a target protein using its features, e.g., InterPro motifs and the subcellular localization of its interacting partners. Yuan [8] trained k^{th} order Markov chain models, with k ranging from 1 to 8, and used an approximation technique to estimate the probability of each element in a sequence given the k contiguous previous elements.
The accuracy of classifiers obtained using supervised learning algorithms depends in part on the quantity of labeled data that is available. Recent advances in sequencing technologies have resulted in an exponential increase in the rate at which DNA and protein sequence data are being acquired [9]. Because annotating the sequences with their subcellular localization requires costly experiments and manual curation effort, reliable annotations are available for only a small fraction of protein sequences. However, even the unlabeled data can provide valuable information, i.e., they contain information about the joint probability distribution over sequence elements. Consequently, there is a significant interest in semisupervised algorithms [10] that can exploit large amounts of unlabeled data together with limited amounts of labeled data in training classifiers to predict protein subcellular localization.
Formally, the semisupervised learning problem can be defined as follows: Given training data D = (D_{L}, D_{u}) of labeled and unlabeled examples, where , x_{l} ∈ R^{d}, y_{l} ∈ Y, and x_{u} ∈ R^{d}, D_{L} ≪ D_{U}, respectively; a hypothesis space H; and a performance criterion P, a learning algorithm L outputs a classifier h ∈ H that optimizes P. If D_{L} = 0, the problem reduces to unsupervised learning; if D_{U} = 0, it reduces to supervised learning. The input x can represent sequences over a finite alphabet X, x ∈ X*. During classification, the task of the classifier h is to accurately assign a new example x _{test} to a class label y ∈ Y.
Recently, a variety of methods for semisupervised learning have been developed in the literature (see [11], [10] for reviews). Such methods have been successfully applied in many areas including text classification [12], [13], [14], natural language processing [15], [16], [17], image annotation [18], and more recently bioinformatics and computational biology, [19], [20], [21]. However, semisupervised learning methods have not been widely applied to the subcellular localization prediction problem.
One notable exception is the work of Xu et al. (2009) [22]. The authors applied CoForest, which is an algorithm proposed by Li and Zhou [23], to exploit unlabeled data in order to improve predictive accuracy on the protein subcellular localization prediction task. CoForest extends the cotraining approach of Blum and Mitchell [13] by using an ensemble of N classifiers, called Random Forest [24]. Note that the original cotraining approach uses only two classifiers [13]. CoForest works as follows: let H^{N} = (h_{1},…,h_{N}} denote an ensemble of N classifiers. For each classifier h_{i} ∈ H^{N}, let denote the concomitant ensemble of h_{i}, where is defined as the set of classifiers in H^{N} except h_{i}, i.e., . An ensemble H^{N} of N random trees is initially trained on . At each subsequent iteration, for each classifier h_{i} ∈ H^{N}, its concomitant ensemble examines the unlabeled examples . An unlabeled example along with the label predicted by is added to the newly labeled set of h_{i}, if the number of classifiers in that predict a particular label exceeds a predefined threshold. The classifier h_{i} is retrained on . The process is repeated until no tree in the Random Forest changes from one iteration to another [23].
In this paper, we present a novel semisupervised approach to the problem of predicting protein subcellular localization. Specifically, we use abstraction augmented Markov models (AAMMs), which are variants of Markov models, to incorporate information available in the unlabeled data. AAMMs model the dependency of each element in a sequence on abstractions of k preceding elements [25]. The abstractions are organized into an abstraction hierarchy that groups together kgrams that induce similar conditional probabilities of the next letter in the sequence. An AAMM corresponds to a generative model for sequence data expressed in terms of random variables whose values correspond to abstractions over kgrams, in addition to the MM random variables [25]. AAMMs provide a simple way to incorporate unlabeled data into the model: first, the abstraction hierarchy is constructed using the entire training set including the unlabeled data. Next, the labeled data is used to estimate the parameters of a set of AAMMs (one for each class) based on the resulting abstraction hierarchy.
Thus, in effect, AAMMs: (i) exploit the relatively large amount of unlabeled data to discover abstractions that transform the sequence data x and, hence, effectively reduce the number of parameters used to specify the probability p(x); and (ii) use the resulting representation to estimate the posterior probability p(yx). Hence, we hypothesize that AAMMs are likely to yield more robust estimates of p(yx) than MMs when the amount of labeled data is much smaller compared to the amount of unlabeled data.
To test our hypothesis on the protein subcellular localization prediction task, we compare AAMMs that use both labeled and unlabeled data with AAMMs that use only labeled data, with the standard MMs, which can not make use of unlabeled data, and also with MMs that can incorporate unlabeled data through an expectation maximization approach (EMMM) and a cotraining approach. The results of our experiments show that AAMMs can make effective use of unlabeled data and significantly outperform EMMMs when the amount of labeled data are very small, and relatively large amounts of unlabeled data are readily available. Here, because of the small amounts of labeled data available for estimating parameters, the ability of AAMMs to minimize overfitting (through parameter smoothing) turns out to be especially useful. The results also show that AAMMs are competitive with, and in some cases significantly outperform two cotrained MMs on different views of the data.
Experiments and results
We present results of experiments on three protein subcellular localization data sets: psortNeg, plant, and nonplant data sets (see Data sets Section for details).
Experimental design
Our experiments on the protein subcellular localization prediction task are designed to explore the following questions: (i) How does the performance of semisupervised AAMMs, which use both labeled and unlabeled data compare to that of MMs trained only on labeled data? (ii) How do AAMMs compare with MMs when both use unlabeled data? (iii) How effective are AAMMs at exploiting unlabeled data to improve classification accuracy when the amount of labeled data is limited? Specifically, how does the performance of an AAMM trained using both labeled and unlabeled data compare to that of an AAMM trained using only labeled data when both take advantage of abstraction? To answer the first and second questions, we compared AAMMs trained using an abstraction hierarchy constructed from both labeled and unlabeled data with the standard MMs, which can not make use of unlabeled data, and with MMs that can incorporate unlabeled data through an expectation maximization approach (EM) [26]. To answer the third question, we compared AAMMs trained using an abstraction hierarchy constructed from both labeled and unlabeled data with AAMMs trained using an abstraction hierarchy constructed only from labeled data.
In the first set of experiments, we trained semisupervised AAMMs and supervised MMs for psortNeg, plant, and nonplant data sets. We ran experiments with 1%, 10%, and 25% of the training data being used as labeled examples, and the rest being treated as unlabeled examples (by ignoring the class). To obtain the subsets of 1%, 10%, and 25% of labeled examples, we sampled examples, using a uniform distribution, from the training set. Semisupervised AAMMs are trained for values of m that range from 1 to N, where m is the cardinality of the set of abstractions A_{m} used as “features” in the classification model, and N is the number of unique kgrams. We learned a single abstraction hierarchy (AH) from both labeled and unlabeled training data and used it to train an AAMM for each class (from the labeled sequences). An MM is trained on the same fraction of labeled data as its AAMM counterpart.
In the second set of experiments, we trained AAMMs, MMs, and EMMMs for all three data sets. In the case of AAMMs, we trained classifiers for m = 1500 (m is set to 1500 because this partitioning of the set of kgrams produces classifiers that use substantially smaller number of “features” compared to MMs, i.e., ≈ 8000 kgrams, and at the same time, the model compression is not very stringent so as to lose important information in the data through abstraction). We denote by AAMM(l+u) an AAMM trained using an AH constructed from both labeled and unlabeled data, and by AAMM(l) an AAMM trained using an AH constructed only from labeled data, when it is necessary to distinguish between AAMMs training procedures. EMMMs are trained on the same fractions of labeled and unlabeled data as their AAMM(l+u) counterparts, and AAMM(l) and MMs are trained on the same fraction of labeled data as their AAMM(l+u) counterparts.
Here, we fixed the number of unlabeled examples and varied the number of labeled examples. Specifically, we performed experiments with 1%, 5%, 10%, 15%, 20%, 25%, 35%, and 50% of the training data being used as labeled examples, and 50% being treated as unlabeled examples (by ignoring the class label). Note that the unlabeled subset of the training data is the same across all the experiments; the labeled subset of the training data is successively augmented to increase the amount of labeled data that is provided to the learner.
In the third set of experiments, we compared AAMMs with EMMMs using a fixed the number of labeled examples and a variable number of unlabeled examples. We performed experiments with (i) 1% of training data being treated as labeled, while 1%, 10%, 25%, 50%, 75%, 90%, and 99% being treated as unlabeled; (ii) 10% of training data being treated as labeled, while 1%, 10%, 25%, 50%, 75%, and 90% being treated as unlabeled; (iii) 25% of training data being treated as labeled, while 1%, 10%, 25%, 50%, and 75% being treated as unlabeled. As before, to obtain the subsets of labeled and unlabeled examples, we sampled using a uniform distribution, from the training set. In all experiments, the class distribution in each labeled subset is the same as that in the entire training set.
In the fourth set of experiments, we compare the semisupervised AAMM with the cotraining procedure as described in [13]. The cotraining procedure can be used with any learning algorithms for training two classifiers h_{1} and h_{2}. In this study, we used two Markov models. Inspired from the work of Ong and Zien [6], instead of considering features extracted from the entire protein sequence, we considered two views on different subsequences. Specifically, the first view corresponds to features extracted from the first 60 amino acids of each sequence, whereas the second view corresponds to features extracted from the last 15 amino acids of each sequence. We trained each Markov model on a different view. Furthermore, in this experiment, we trained AAMMs on the two subsequences of the first 60 and the last 15 amino acids of each sequence, rather than the entire sequence.
Results
For all of the experiments, we report the average classification accuracy obtained in a 5fold crossvalidation experiment. All models are trained using 3grams extracted from the data. For psortNeg, plant, and nonplant data sets the number of 3grams is 7970, 7965, and 7999, respectively. Although the number of all unique kgrams is exponential in k, for large values of k, many of the kgrams may not appear in the data (consequently, the counts for such kgrams are zero). Note that the number of unique kgrams is bounded by the cardinality of the multiset of kgrams extracted from D.
We define the relative reduction in classification error between two classifiers to be the difference in error divided by the larger of the two error rates. To test the statistical significance of results, we used the 5fold crossvalidated paired t test for the difference in two classification accuracies [27]. The null hypothesis that the two learning algorithms M_{1} and M_{2} have the same accuracy on the same test set can be rejected if t(M_{1}, M_{2}) > t_{4,0.975} = 2.776 (p < 0.05). We abbreviate t(M_{1}, M_{2}) by t in what follows.
AAMMs can provide more accurate models compared to MMs on the protein subcellular localization prediction task when the amount of labeled data is small compared to that of unlabeled data.
Figure 1 shows results of the comparison of AAMMs with MMs for 1%, 10%, and 25% of labeled data, for nonplant, plant, and psortNeg data sets. Note that the x axis of all subfigures shows the number of abstractions m on a logarithmic scale. When only 1% and 10% of the training data are labeled (Figure 1, first and second rows), AAMMs significantly outperform MMs for many choices of m, on all three data sets. For example, on the 1% plant data set, with m = 200, the accuracy of AAMM is 38.72%, whereas that of MM is 30.53%, which represents 12% reduction in classification error, and t = 3.16 (the largest values of t are 27.58 for m = 4905, 21.91 for m = 2070, and 27.34 for m = 535 on nonplant, plant, and psortNeg, respectively). On the 10% plant data set, with m = 560, AAMM achieves an accuracy of 47.97%, compared to that of MM which is 37.87%, and t = 10.01.This represents 16% reduction in classification error. When we increased the fraction of labeled data to 25%, AAMMs still have a higher performance than MMs for many choices of m on nonplant and plant data sets, but become comparable in performance with MMs on psortNeg data set.
Figure 1. Comparison of AAMMs with MMs. Comparison of AAMMs with MMs for 1% (first row), 10% (second row), and 25% (third row) of labeled data available for nonplant (left), plant (center), and psortNeg (right), respectively.
AAMMs trained using abstraction hierarchies constructed from both labeled and unlabeled protein subcellular localization data significantly outperform AAMMs trained using abstraction hierarchies constructed only from labeled protein subcellular localization data.
Figure 2 shows results of experiments that compare AAMM(l+u) with AAMM(l), MM, and EMMM on nonplant, plant, and psortNeg data sets. The x axis indicates the number of labeled examples in each data set. The number of unlabeled examples is kept fixed and is equal to the rightmost number of labeled examples on the x axis of each plot.
Figure 2. Comparison of AAMM(l+u) with AAMM(l), MMs, and EMMMs. Comparison of AAMMs trained using an abstraction hierarchy learned from both labeled and unlabeled data, AAMM(l+u), with (i) AAMMs trained using an abstraction hierarchy learned only from labeled data, AAMM(l); (ii) ExpectationMaximization with Markov models, EMMM; and (iii) Markov models, MM, on nonplant (left), plant (center), and psortNeg (right) data sets. x axis indicates the number of labeled examples in each data set corresponding to fractions of 1%, 5%, 10%, 15%, 20%, 25%, 35%, 50% of training data being treated as labeled data. The fraction of unlabeled data in each data set is fixed to 50%.
As can be seen in the figure, AAMM(l+u) significantly outperforms AAMM(l) on all three data sets when small fractions of labeled data are available. For example, with 110 labeled sequences on nonplant (i.e., 5% of labeled data), AAMM(l+u) achieves 63% accuracy while AAMM(l) achieves 52%, which gives 23% reduction in classification error (t = 7.2). Strikingly, on the same data set, with only 22 labeled sequences (i.e., 1% of labeled data), AAMM(l+u) achieves 59% accuracy as compared to 43% obtained by AAMM(l), which gives 28% reduction in classification error (t = 9.73). Hence, AAMM(l+u) are able to incorporate information available in the unlabeled data (i.e., joint probability distributions of contiguous amino acids in a sequence) to learn more robust abstraction hierarchies than AAMM(l) when the labeled training set is limited in size (thereby, reducing the risk of overfitting).
Furthermore, AAMM(l+u) decreases the need for large numbers of labeled data. Specifically, on nonplant, AAMM(l+u) achieves 63% accuracy with 110 labeled examples, which is matched by that of AAMM(l) with 438 labeled examples (≈ 4 times more labeled data). However, when the fraction of labeled data is large, and hence, good estimates of model parameters can be obtained from such data, there is not much need for unlabeled data. For example, AAMM(l+u) becomes similar in performance with AAMM(l) on nonplant using 35% and 50% of labeled data (the null hypothesis is not rejected, t = 1.38 and t = 0.26, respectively).
As expected, the performance of AAMM(l+u) increases with the increase in the amount of labeled data. For example, on psortNeg with 12 labeled sequences (i.e., 1% of labeled data), AAMM(l+u) achieves 32% accuracy while AAMM(l+u) with 289 labeled sequences (i.e., 25% of labeled data) achieves 58% accuracy, which corresponds to 38% reduction in classification error.
AAMMs are able to incorporate information available in the unlabeled protein subcellular localization data, and hence, produce more robust classifiers than MMs and EMMMs, when the fraction of labeled protein subcellular localization data is small.
Again as can be seen in Figure 2, AAMM(l+u) is superior in performance to MM, especially when small amounts of labeled data are available. For example, on plant, with 75 labeled sequences (i.e., 10% of labeled data), MM achieves 39% accuracy as compared to 44% obtained using AAMM(l+u) (t = 3.07). On nonplant, with 219 labeled sequences (i.e., 10% of labeled data), MM achieves 51% accuracy whereas AAMM(l+u) achieves 64% (t = 14). AAMM(l+u) not only incorporates information available in the unlabeled data (see previous comparison), but also performs parameter smoothing. Thus, AAMM(l+u) provides more robust estimates of model parameters than MMs, and hence, help reduce overfitting when the labeled training set is limited in size.
Both AAMM(l+u) and EMMM make use of information available in the unlabeled data (i.e., both improve the performance of their counterpart classifiers trained only from labeled data) on all three data sets, although the improvement is not very large on psortNeg (Figure 2). However, AAMM(l+u) uses the joint distribution over amino acids (independent of the class variable) to learn a more robust abstraction hierarchy (i.e., a finer partitioning of the set of kgrams), especially when the amount of labeled data is small, so that better estimates of parameters can be obtained. On the other hand, EMMM uses the joint distribution over amino acids after an initial classifier has made predictions on the unlabeled data. When small amounts of labeled data are available, the predictions made by the initial classifier may not be reliable.
AAMM(l+u) significantly outperforms EMMMs on nonplant, plant, and psortNeg data sets, when the fraction of labeled data is small (see Figure 2). For example, with only 22 labeled sequences on nonplant (i.e., 1% of labeled data), AAMM(l+u) achieves 59% accuracy while EMMM achieves 42%, which gives 29% reduction in classification error (t = 8.83). Similarly, with only 8 labeled sequences on plant (i.e., 1% of labeled data), AAMM(l+u) achieves 34% accuracy as compared to 28% obtained by EMMM, which gives 8% reduction in classification error (t = 4.66). As the amount of labeled data increases, EMMM significantly outperforms AAMM(l+u). For example, with 767 labeled sequences on nonplant (i.e., 35% of labeled data), EMMM achieves 69% accuracy while AAMM(l+u) achieves 67% (t = 4.87).
Note that EM may decrease rather than increase the accuracy of classifiers if the generative model assumptions are not satisfied (see Figure 2 plant data set). A weighted EM (i.e., weighting unlabeled sequences less) [12] helped improved the performance of EMMMs (data not shown). A similar approach could be considered in AAMMs during learning the abstraction hierarchies.
Figure 3 shows results of comparison of AAMMs with EMMMs on nonplant, plant, and psortNeg data sets, respectively, while varying the amount of unlabeled data for three different fractions of labeled data (i.e., 1%, 10%, and 25% of labeled data) that are kept fixed. The x axis indicates the number of unlabeled examples in each data set.
Figure 3. Comparison of AAMMs with EMMMs. Comparison of AAMMs with EMMMs for three different fractions of labeled data (i.e., 1%, 10%, and 25%) while varying the amount of unlabeled data on nonplant (left), plant (center), and psortNeg (right) data sets. x axis indicates the number of unlabeled examples in each data set corresponding to fractions of 1%, 10%, 25%, 50%, 75%, 90%, 99% of training data being treated as unlabeled data.
As can be seen in Figure 3, the improvement in performance of AAMMs over EMMMs is rather dramatic when the amount of labeled data is quite small. For example, when only 1% of labeled data is used regardless of the amount of unlabeled data, AAMMs consistently significantly outperform EMMMs on nonplant and plant data sets (the largest and smallest t values on nonplant are 10.96 and 5.66, respectively). However, the difference in performance between AAMMs and EMMMs diminishes as more and more labeled data become available (and eventually levels off). When the amount of labeled data is increased (e.g., 25% of labeled data), EMMMs often significantly outperform AAMMs (Figures 3(a) and 3(c)). For example, on nonplant with 25% of unlabeled data, EMMM achieves 68% accuracy, whereas AAMM achieves 66% (t = 7).
The classification accuracy of AAMMs typically increases with the amount of unlabeled data (when the subset of labeled data is fixed) (see Figure 3). For example, on nonplant, AAMM with 22 labeled sequences (i.e., 1% of labeled data) and 219 unlabeled sequences (i.e., 10% of unlabeled data) achieves an accuracy of 56% as compared to 49% obtained by AAMM with 22 labeled sequences (i.e., 1% of labeled data) and 22 unlabeled sequences (i.e., 1% of unlabeled data), 14% reduction in classification error.
AAMMs are comparable in performance with, and in some cases outperform, the cotraining procedure, which uses MMs trained on different views of the protein subcellular localization data.
Figure 4 shows results of experiments that compare AAMMs with cotraining MMs on nonplant, plant, and psortNeg data sets, where we fixed the number of unlabeled examples (to 50%) and varied the number of labeled examples (from 1% to 50% as before). The x axis indicates the number of labeled examples in each data set. The number of unlabeled examples is kept fixed and is equal to the rightmost number of labeled examples on the x axis of each plot.
Figure 4. Comparison of AAMMs with cotraining MMs. Comparison of AAMMs with cotraining MMs on nonplant (left), plant (center), and psortNeg (right) data sets. AAMMs are trained on the first 60 and the last 15 amino acids of each protein sequence, AAMM(60 + 15). Cotraining MMs consists of two cotrained MMs, one trained on the first 60 amino acids of each sequence, the other trained on the last 15 amino acids of each sequence. x axis indicates the number of labeled examples in each data set corresponding to fractions of 1%, 5%, 10%, 15%, 20%, 25%, 35%, 50% of training data being treated as labeled data. The fraction of unlabeled data in each data set is fixed to 50%.
As can be seen in the figure, AAMMs trained on the first 60 and last 15 amino acids of each protein sequence significantly outperform two cotrained MMs, one trained on the first 60 amino acids of each sequence, and the other trained on the last 15 amino acids of each sequence on the nonplant data set. For example, with 22 labeled sequences (i.e., 1% of labeled data), AAMM achieves 59% accuracy while cotraining MMs achieves 56% (t = 7.14). With 548 labeled sequences (i.e., 25% of labeled data), the accuracy of AAMM is 61%, whereas that of cotraining MMs is 57% (t = 7.12). These results give 1% reduction in classification error .
However, on plant and psortNeg data sets, AAMMs are comparable in performance with cotraining MMs. For example, on plant data set using 188 labeled sequences (i.e., 25% of labeled data), the accuracy of AAMM is 57%, whereas that of cotraining MMs is 58% (the null hypothesis is not rejected, t = 0.61).
Summary and discussion
Identifying subcellular localization of proteins is an important problem with broad applications in computational biology, e.g., rational drug design. Computational tools for identifying protein subcellular localization that can exploit large amounts of unlabeled data together with limited amounts of labeled data are especially important because of the high cost and efforts involved in labeling the data.
In this study, we presented an abstractionbased approach to semisupervised learning of classifiers for the protein subcellular localization prediction task. Our approach utilizes abstraction augmented Markov models [25], which extend higher order Markov models by adding new variables corresponding to abstractions of kgrams (i.e., substrings of a fixed length k). AAMMs are probabilistic generative models that have the ability to incorporate information available in the unlabeled data: initially, an abstraction hierarchy over the kgrams is constructed from both labeled and unlabeled data, independent of the class variable. The labeled data is used to estimate the model parameters, based on the resulting abstraction hierarchy.
In this paper we compare AAMMs with MMs and EMMMs and cotrained MMs. The results of our experiments on the subcellular localization prediction task show that semisupervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semisupervised MMs; and (iii) are comparable in performance, and in some cases outperform, the cotraining based semisupervised MMs.
Related work on semisupervised learning
A variety of approaches to semisupervised learning have been studied in the literature (see [11], [10] for reviews). Most of the existing semisupervised learning algorithms including those based on cotraining [13], Expectation Maximization (EM) [12], Transductive Support Vector Machines (TSVM) [14], cluster kernel [28], manifold based approaches [29,30], essentially involve different means of transferring labels from labeled to unlabeled samples in the process of learning a classifier that can generalize well on new unseen data.
EMbased methods provide a way to estimate the parameters of a generative model from incomplete data [26], i.e., samples that contain missing values for some of the variables. Semisupervised learning is a special case of such inference where it is the class labels that are missing for a subset of the data [12]. Specifically, the parameters of the model are estimated initially from the labeled fraction of the training data, D_{L}, and the resulting model is used to predict p(yx) for each of the unlabeled samples in D_{U}. The parameters are reestimated using the entire training data D and this process is repeated until the estimates converge. Cotraining [13] is a variant of this approach where unlabeled data are labeled with two different classifiers trained on different subsets of the features in x.
Several semisupervised learning algorithms based on discriminative approaches to classification have been investigated. TSVM [14] can be seen as a discriminative counterpart of EM. TSVM starts by training an SVM on the labeled data and uses the trained SVM to label the unlabeled data. The algorithm iteratively attempts to maximize the margin of separation between the sets of samples labeled by the SVM (by considering at each iteration, alternative labels for pairs of originally unlabeled samples that have been assigned different labels by the SVM). A similar outcome can be achieved by adding an additional regularization term for unlabeled data to the objective function optimized by SVM [10]. Similar approaches for exploiting unlabeled data in training discriminative classifiers include [31], [32], [33], [34].
An alternative approach to exploiting unlabeled data relies on the manifold assumption: highdimensional data lies on a lower dimensional manifold, making it possible to propagate labels from labeled samples to unlabeled samples based on some measure of closeness of the data points on the manifold. The manifold can be approximated by a weighted graph in which the nodes correspond to data samples and the weights on the links between nodes correspond to the pairwise similarity of the corresponding data points [35]. A number of techniques for label propagation have been proposed [29], [30]. Note that graph laplacian based techniques can be interpreted as a more general type of regularization where not only the L2 norm of the hypothesis is penalized but also the L2 norm of the hypothesis gradient.
In contrast to the approaches reviewed above, we present a novel abstractionbased approach to semisupervised learning of sequence classifiers. We compared the semisupervised AAMMs with the semisupervised variants of Markov models trained using expectation maximization [12], and using cotraining [13], [10].
Expectation Maximization applied to Markov models
EM applied to MMs (EMMMs) involves an iterative process of E and Msteps. Specifically, an initial Markov model is learned only from labeled sequences D_{L} using Equations (3), (4), and (5) (initialization step). The current model is used to assign probabilistic labels to the (originally) unlabeled sequences D_{U} (i.e., to calculate the probability that each class generated an unlabeled sequence, , u =1,…,D_{U}) using Equation (6) (Estep). Next, a new model is learned from originally labeled sequences combined with the newly probabilistically labeled sequences , which were originally unlabeled, using Equations (3), (4), and (5) (Mstep) (See Methods section for Equations (3), (4), (5), and (6) for details). E and M steps are repeated until the model does not change from one iteration to another [12].
Cotraining of Markov models
Let D_{L} be a set of labeled examples, and D_{U} a set of unlabeled examples. A set D_{U′} is obtained by sampling u examples from D_{U} (we used u = 75 examples in experiments). Each example x has two views, i.e., can be encoded with two different sets of features, x^{(1)} and x^{(2)}. First, use D_{L} and the x^{(1)} encoding to train a classifier h_{1}, and D_{L} and the x^{(2)} encoding to train another classifier h_{2}. Second, classify the examples in D_{U′} using h_{1} and h_{2} separately. Select h_{1}’s and h_{2}’s top (k_{j})_{j=}_{1,}…,C most confident predictions from each class (corresponding to the underlying data distribution), add them to D_{L}, and remove them from D_{U′}. Sample 2 examples from D_{U} and move them to D_{U′}. This process is repeated for a fixed number of iterations, or until all unlabeled data are used up [13], [10] (in experiments, we iterated until all unlabeled data was used). In cotraining, the idea is that the two classifiers teach one another by retraining each classifier on the data enriched with predicted examples that the other classifier is most confident about.
Semisupervised abstraction augmented Markov models  our approach
Our abstractionbased approach to learning classifiers for the protein subcellular localization prediction task exploits large amounts of unlabeled data together with small amounts of labeled data to construct more robust abstraction hierarchies over the values of the parents of each node in a Markov model. Two values (kgrams) are clustered together if they induce similar conditional distributions of the next node, independent of the class. When the data are scarce, the estimates of joint probabilities are not reliable. However, the unlabeled data contain information about the joint probability distribution over sequence elements, and can help improve the statistical estimates of parameters. The abstraction hierarchy is subsequently used to learn a Markov model with abstract values of the parents.
It is worth mentioning that part of the AAMM is the representation of the clustering. Specifically, after the abstraction hierarchy is learned, for a given choice of the size m of an mcut that defines an AAMM, an array of indices of size equal to the number of unique kgrams specifies the mapping between kgrams and abstractions (the space complexity is X^{k}, where X is the alphabet). However, the number of parameters of AAMM (for a given class) based on such an abstraction hierarchy and an mcut is mX, as opposed to X^{k}X in the case of MMs, where m ≪ X^{k}.
While AAMMs reduce the complexity of the learned model, some information is lost due to abstraction. It is of interest to incorporate into AAMMs some means of gracefully trading off the complexity of the model against its predictive accuracy. One way to do this is to augment the algorithm, e.g., by designing an MDLbased scoring function to guide a topdown search for an optimal cut [36].
AAMMs not only significantly outperform MMs but also are simpler than MMs, and hence easier to interpret from a biological standpoint: the set of kgrams in an abstraction can be seen as a sequence profile (e.g., Position Specific Scoring Matrix).
The results of our experiments show that AAMMs can make effective use of unlabeled data and that AAMMs significantly outperform EMMMs when the amount of labeled data is very small, and relatively large amounts of unlabeled data are readily available. Here, because of the small amounts of labeled data available, the ability of AAMMs to minimize overfitting (through parameter smoothing) turns out to be especially beneficial. In comparing semisupervised AAMMs with the previous semisupervised work on the protein subcellular localization prediction task, we found that AAMMs are competitive with, and in some cases outperform, cotraining of MMs.
The results presented here demonstrate the effectiveness of an abstractionbased approach to exploiting unlabeled data in a semisupervised setting on the protein subcellular localization prediction task. Such an approach can in principle be combined with existing semisupervised learning techniques including those that use EM, cotraining, manifold assumption (propagation of labels from labeled to unlabeled samples based on some similarity measure between samples).
Our current implementation of AAMM constructs an abstraction hierarchy over the values of the k predecessors of a sequence element by grouping them together if they induce similar conditional distributions over that element of the sequence. It would be interesting to explore alternative approaches to building abstraction hierarchies, e.g., probabilistic suffix trees (PSTs) [37].
Methods
In this section, we briefly described the data sets used in experiments, provide some background on Markov models for sequence classification, and then present our novel AAMMbased approach to semisupervised learning.
Data sets
The first and second data sets used in our experiments, plant and nonplant [38], were first introduced in [4]. The plant data set contains 940 examples belonging to one of the following four classes: chloroplast (141), mitochondrial (368), secretory pathway/signal peptide (269) and other (consisting of 54 examples with label nuclear and 108 examples with label cytosolic). The nonplant data set contains 2738 examples, each in one of the following three classes: mitochondrial (361), secretory pathway/signal peptide (715) and other (consisting of 1214 examples labeled nuclear and 438 examples labeled cytosolic).
The third data set used in our experiments, PSORTdb v.2.0 [39] Gramnegative sequences, introduced in [40], contains experimentally verified localization sites. We refer to this data set as psortNeg. We use all proteins that belong to exactly one of the following five classes: cytoplasm (278), cytoplasmic membrane (309), periplasm (276), outer membrane (391) and extracellular (190). The total number of examples (proteins) in this data set is 1444.
Markov models
Markov models (MMs) are probabilistic generative models that assume a mixture model as the underlying model that generated the sequence data. Each mixture component corresponds to a class c_{j} ∈ C = {c_{1},…,c_{}_{c}_{}}. A sequence is generated according to a set of parameters, denoted by θ, that define the model.
Let x = x_{0}…x_{n}_{−1} be a sequence over a finite alphabet X, x ∈ X*, and let y denote x’s class (note that if x was generated by the j^{th} mixture component, then y = c_{j}). Let X_{i}, for i = 0,…, n − 1, denote the random variables corresponding to the sequence elements x_{i} in x. In a k^{th} order MM, the sequence elements satisfy the Markov property: . That is, X_{i} is conditionally independent of X_{0},…, X_{i}_{−}_{k}_{−1} given X_{i}_{−}_{k},…, X_{i}_{−1} for i = k,…, n − 1. X_{i}_{−}_{k},…, X_{i}_{−1} are called parents of X_{i}. Figure 5 shows the dependency of X_{i} on X_{i−k},…,X_{i}_{−1} in a k^{th} order MM. Hence,
p(x_{i}x_{0}…x_{i}_{−1},c_{j};θ) = p(x_{i}x_{i}_{−}_{k}…x_{i}_{−1},c_{j};θ). (1)
Figure 5. Markov model for sequence classification. Dependency of X_{i} on X_{i}_{−}_{k},…,X_{i}_{−1} in a k^{th} order Markov model.
The probability of x given its class c_{j}, p(xc_{j};θ), can be written as follows:
(2)
Let S_{i}_{−1} denote the parents X_{j}_{−}_{k}…X_{i}_{−1} of X_{i}. The values of S_{i}_{−1} represent instantiations of X_{i}_{−}_{k}…X_{i}_{−1}, which are substrings of length k (i.e., kgrams) over the alphabet X. Let S denote the set of kgrams over X, s denote a kgram in S, and σ a symbol in X. The cardinality of S is X^{k} and is denoted by N.
The set of parameters θ of an MM is: , where , and .
Learning Markov models
Given a labeled training set , learning a Markov model reduces to estimating the set of parameters θ from D_{L}, using the maximum likelihood estimation [41]. The estimate of is obtained from D_{L} as follows:
(3)
where #[sσ, x_{l}] is the number of times the symbol σ “follows” the kgram s in the sequence x_{l}, and p(y_{l} = c_{j}x_{l}) ∈ {0,1} is obtained based on the sequence label.
The estimate of is obtained from D_{L} as follows:
(4)
where #[s, x_{l}] is the number of times s occurs in x_{l}.
The class prior probabilities are estimated as follows:
(5)
We used Laplace correction to obtain smoothed estimates.
Using Markov models for classification
Classification of a new sequence x requires computation of conditional probability . Applying Bayes rule:
(6)
The class with the highest posterior probability, is assigned to x.
Semisupervised AAMM
We first provide the AAMM definitions and then describe how to learn semisupervised AAMMs.
AAMMs
AAMMs effectively reduce the number of parameters of a k^{th} order MM (which is exponential in k) by learning an abstraction hierarchy (AH) over the set of kgrams S.
Definition 1 (Abstraction Hierarchy)An abstraction hierarchy T over a set of kgrams S is a rooted tree such that: (1) the root of T denotes S; (2) the leaves of T correspond to singleton sets containing individual kgrams in S; (3) the children of each internal node (say a) correspond to a partition of the set of kgrams denoted by a. Thus, a denotes an abstraction or grouping of “similar” kgrams.
Note that each internal node (or abstraction a) contains the subset of kgrams at the leaves of the subtree rooted at a. Figure 6(a) shows an example of an AH T on a set S = {s_{1},…,s_{9}} of 2grams over an alphabet of size 3.
Figure 6. Abstraction augmented Markov models. (a) An abstraction hierarchy T on a set S = {s_{1},…,s_{9}} of 2grams over an alphabet of size 3. The abstractions a_{1} to a_{9} correspond to the 2grams s_{1} to s_{9}, respectively. The subset of nodes A = {a_{15}, a_{6}, a_{14}} represents a 3cut γ_{3} through T; (b) Dependency of X_{i} on A_{i}, which takes values in a set of abstractions A corresponding to an mcut γ_{m}, in a k_{th} order AAMM.
Definition 2 (mCut)An mcut γ_{m} through an abstraction hierarchy T is a subset of m nodes of T such that for any leaf s ∈ S, either s ∈ γ_{m} or s is a descendant of some node in γ_{m}. The set of abstractions A at any given mcut γ_{m} forms a partition of S.
Specifically, an mcut γ_{m} partitions the set S of kgrams into m (m ≤ N = S) nonoverlapping subsets A = {a_{1} : S_{1},…,a_{m}:S_{m}}, where a_{i} denotes the ith abstraction and S_{i} denotes the subset of kgrams that are grouped together into the ith abstraction based on some similarity measure. Note that S_{1} ∪…∪ S_{m} = S and ∀1 ≤ i, j ≤ m, S_{i} ∩ S_{j} = ∅. In Figure 6(a), the subset of nodes {a_{15}, a_{6}, a_{14}} represents a 3cut γ_{3} through T.
AAMMs extend the graphical structure of MMs by introducing new variables A_{i} that represent abstractions over the values of S_{i}_{−1}, for i = k,…, n − 1 (Figure 6(b)). Each A_{i} takes values in the set of abstractions A = {a_{1},…,a_{m}} corresponding to an mcut, γ_{m}. We model the fact that A_{i} is an abstraction of S_{i}_{−1} by defining p(A_{i} = a_{i}S_{i}_{−1} = s_{i}_{−1}) = 1 if s_{i}_{−1} ∈ a_{i}, and 0 otherwise, where s_{i}_{−1} ∈ S and a_{i} ∈ A represent instantiations of variables S_{i}_{−1} and A_{i}, respectively. Furthermore, in AAMMs, the node X_{i} directly depends on A_{i} instead of being directly dependent on S_{i}_{−1}, as in the standard MMs. Hence, the probability of x given its class, p(xc_{j};θ), can be written as follows:
(7)
The set of parameters θ of an AAMM is: , where , and .
Learning semisupervised AAMMs
In what follows we show how to learn AAMMs from both labeled and unlabeled data. This involves: learning abstraction hierarchies from both labeled and unlabeled data; and learning model parameters from labeled data using the resulting abstraction hierarchy.
Learning abstraction hierarchies
The algorithm for learning AHs over a set S of kgrams starts by initializing the set of abstractions A such that each abstraction a_{j} ∈ A corresponds to a kgram s_{j} ∈ S, j = 1,…,N. The leaves of the AH T are initialized with elements of S. The algorithm recursively merges pairs of abstractions that are most “similar” to each other and terminates with an abstraction hierarchy after N − 1 steps. We store T in a lastinfirstout (LIFO) stack. For a given choice of the size m of an mcut through T, the set of abstractions that define an AAMM can be extracted by discarding m − 1 elements from the top of the stack.
We consider two kgrams to be “similar” if they occur within similar contexts. In our case, we define the context of a kgram s ∈ S to be the conditional probability distribution of the next letter in the sequence given the kgram, p(X_{i}s), independent of the class variable. Hence, this can be estimated from both labeled sequences D_{L} and unlabeled sequences D_{U} as follows:
(8)
where #[sσ, x_{l}] and #[sσ, x_{u}] represent the number of times the symbol σ “follows” the kgram s in the sequence x_{l}, and x_{u}, respectively.
Since an abstraction is a set of kgrams, the context of an abstraction a = {s_{1},…, s_{}_{a}_{}} is obtained by a weighted aggregation of the contexts of its kgrams. That is,
(9)
where .
We identify the most “similar” abstractions as those that have the smallest weighted JensenShannon divergence between their contexts. JS divergence [42] provides a natural way to compute the distance between two probability distributions that represent contexts of two abstractions. Specifically, we define the distance between two abstractions a′ and a″ in D as follows:
where and .
Learning AAMM parameters
Given a labeled training set , learning an AAMM reduces to estimating the set of parameters θ from D_{L}, denoted by . This can be done as follows: use Equation (3) to obtain the estimates of for any kgram s ∈ S (note that these estimates correspond to the estimates when a = {s}, i.e., the leaf level in the AH T). The estimates of , when a = {s_{1},…,s_{}_{a}_{}}, are a weighted aggregation of the estimates of a’s constituent kgrams, i.e.,
(10)
where . Use Equations (4) and (5) to obtain the estimates of and of , respectively.
Using AAMMs for classification
Given a new sequence x = x_{0},…,x_{n}_{−1} and an mcut γ_{m} through T, can be computed as follows: initialize by ; parse the sequence from left to right. For each kgram x_{i}_{−}_{k}…x_{i}_{−1} find the abstraction a_{w} ∈ γ_{m} it belongs to and retrieve the parameters associated with a_{w}. Successively multiply for i = k,…, n − 1 to obtain .
As in MMs, apply Bayes rule to obtain and assign the class with the highest posterior probability to x.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors read and approved the final manuscript.
Acknowledgements
This research was funded in part by an NSF grant IIS 0711356 to Vasant Honavar and Doina Caragea.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S8.
References

Alberts B, Bray D,et al (Eds): In Molecular Biology of the Cell. New York and London, Garland Publishing.; 1994.

Baldi P, Brunak S: . In Bioinformatics: the Machine Learning Approach. MIT Press; 2001.

Park K, Kanehisa M: Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs.
Bioinformatics 2003, 19(13):16561663. PubMed Abstract  Publisher Full Text

Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their Nterminal amino acid sequence.
J. Mol. Biol. 2000, 300:10051016. PubMed Abstract  Publisher Full Text

Höglund A, Donnes P, Blum T, Adolph HW, Kohlbacher O: MultiLoc: prediction of protein subcellular localization using Nterminal targeting sequences, sequence motifs, and amino acid composition.
Bioinformatics 2006, 22(10):11581165. PubMed Abstract  Publisher Full Text

Ong CS, Zien A: An Automated Combination of Kernels for Predicting Protein Subcellular Localization. In Proceedings of the 8th Workshop on Algorithms in Bioinformatics (WABI). Springer. Lecture Notes in Bioinformatics.; 2008:186179.

Scott MS, Calafell SJ, Thomas DY, Hallett MT: Refining Protein Subcellular Localization.
PLoS Comput Biol 2005, 1(6):e66. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yuan Y: Prediction of Protein Subcellular Locations using Markov Chain Models.
FEBS Letters 1999, 451:2326. PubMed Abstract  Publisher Full Text

Ansorge W: Nextgeneration DNA sequencing techniques.
New Biotechnology 2009, 25(4):195203. PubMed Abstract  Publisher Full Text

Zhu X, Goldberg A: . In Introduction to SemiSupervised Learning. Morgan & Claypool; 2009.

Chapelle O, Schöelkopf B, Zien A (Eds): In SemiSupervised Learning. MIT Press; 2006.

Nigam K, Mccallum AK, Thrun S, Mitchell T: Text Classification from Labeled and Unlabeled Documents using EM.

Blum A, Mitchell T: Combining labeled and unlabeled data with cotraining. In Proc. of COLT’ 98. New York, NY, USA: ACM; 1998:92100. Publisher Full Text

Joachims T: Transductive Inference for Text Classification using Support Vector Machines.

Niu ZY, Ji DH, Tan CL: Word sense disambiguation using label propagation based semisupervised learning.

Goldberg A, Zhu X: Seeing stars when there aren’t many stars: Graphbased semisupervised learning for sentiment categorization.

Qi Y, Kuksa P, Collobert R, Sadamasa K, Kavukcuoglu K, Weston J: SemiSupervised Sequence Labeling with SelfLearned Features. In Proc. of ICDM. Washington, DC, USA; 2009:428437.

Campsvalls G, Member S, B TV, Zhou D: Semisupervised graphbased hyperspectral image classification.
IEEE Transactions on Geoscience and Remote Sensing 2007, 45:20443054. Publisher Full Text

Käll L, Canterbury J, Weston J, Noble W, MacCoss M: Semisupervised learning for peptide identification from shotgun proteomics datasets.
Nature Methods 2007, 4(11):923925. PubMed Abstract  Publisher Full Text

Lafferty J, Zhu X, Liu Y: Kernel conditional random fields: Representation and clique selection.

Kuksa P, Huang PH, Pavlovic V: Efficient use of unlabeled data for protein sequence classification: a comparative study.
BMC Bioinformatics 2009, 10(Suppl 4):S2. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Xu Q, Hu DH, Xue H, Yu W, Yang Q: Semisupervised protein subcellular localization.
BMC Bioinformatics 2009, 10(Suppl 1):S47. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Li M, Zhou ZH: Improve ComputerAided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples.
2007.

Machine Learning 2001, 45:532. Publisher Full Text

Caragea C, Silvescu A, Caragea D, Honavar V: Abstraction Augmented Markov Models.
NIPS Workshop on “Machine Learning in Comp. Biol.”
’09.

Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, Series B 1977, 39:138.

Dietterich TG: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.
Neural Computation 1998, 10:18951923. PubMed Abstract  Publisher Full Text

Weston J, Leslie CS, Zhou D, Elisseeff A, Noble WS: Semisupervised Protein Classification Using Cluster Kernels.

Bengio Y, Delalleau O, Le Roux N: Label Propogation and Quadratic Criterion. In SemiSupervised Learning. Edited by Chapelle O, Schoelkopf B, Zien A,. MIT Press; 2006:193217.

Jebara T, Wang J, Chang SF: Graph construction and bmatching for semisupervised learning. In ICML ’09: Proc. of the 26th Annual ICML. ACM; 2009:441448.

Lawrence ND, Jordan MI: Semisupervised learning via Gaussian processes.

Szummer M, Jaakkola T: Information regularization with partially labeled data.
In Advances in Neural Information processing systems 15 2002.

Grandvalet Y, Bengio Y: Semisupervised Learning by Entropy Minimization. In Advances in Neural Information Processing Systems 17. MIT Press; 2005:529236.

Jaakkola T, Meila M, Jebara T: Maximum entropy discrimination.

Belkin M, Niyogi P, Sindhwani V: Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples.

Zhang J, Kang DK, Silvescu A, Honavar V: Learning Accurate and Concise Naive Bayes Classifiers from Attribute Value Taxonomies and Data.
Knowledge and Information Systems 2006, 9(2):157179. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ron D, Singer Y, Tishby N: The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length.
In Machine Learning 1996, 117149. Publisher Full Text

TargetP [http://www.cbs.dtu.dk/services/TargetP/datasets/datasets.php] webcite

PSORTdb v.2.0 [http://www.psort.org/dataset/datasetv2.html] webcite

Gardy JL, et al.: PSORTB: improving protein subcellular localization prediction for Gramnegative bacteria.
NAR 2003, 31(13):361317. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Casella G, Berger RL: . In Statistical Inference. Duxbury; 2002.

Lin J: Divergence measures based on the Shannon entropy.
IEEE Trans. on Inf. Thr. 1991, 37:145151. Publisher Full Text