This article is part of the supplement: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics
Prediction of nuclear proteins using nuclear translocation signals proposed by probabilistic latent semantic indexing
1 Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
2 Comparative Bioinformatics, Bioinformatics and Genomics, Centre for Genomic Regulation (CRG), Barcelona, 08003, Spain
3 Universitat Pompeu Fabra (UPF), Barcelona, 08003, Spain
4 Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
BMC Bioinformatics 2012, 13(Suppl 17):S13 doi:10.1186/1471-2105-13-S17-S13Published: 13 December 2012
Identification of subcellular localization in proteins is crucial to elucidate cellular processes and molecular functions in a cell. However, given a tremendous amount of sequence data generated in the post-genomic era, determining protein localization based on biological experiments can be expensive and time-consuming. Therefore, developing prediction systems to analyze uncharacterised proteins efficiently has played an important role in high-throughput protein analyses. In a eukaryotic cell, many essential biological processes take place in the nucleus. Nuclear proteins shuttle between nucleus and cytoplasm based on recognition of nuclear translocation signals, including nuclear localization signals (NLSs) and nuclear export signals (NESs). Currently, only a few approaches have been developed specifically to predict nuclear localization using sequence features, such as putative NLSs. However, it has been shown that prediction coverage based on the NLSs is very low. In addition, most existing approaches only attained prediction accuracy and Matthew's correlation coefficient (MCC) around 54%~70% and 0.250~0.380 on independent test set, respectively. Moreover, no predictor can generate sequence motifs to characterize features of potential NESs, in which biological properties are not well understood from existing experimental studies.
In this study, first we propose PSLNuc (
Experiment results demonstrate that the proposed method shows a significant improvement for nuclear localization prediction. To compare our predictive performance with other approaches, we incorporate two non-redundant benchmark data sets, a training set and an independent test set. Evaluated by five-fold cross-validation on the training set, PSLNuc attains an overall accuracy of 79.7%, which is 4.8% improvement over the state-of-the-art system. In addition, our method also enhances the MCC from 0.497 to 0.595. Compared on the independent test set, PSLNuc outperforms other predictors by 3.9%~19.9% on accuracy and 0.077~0.207 on MCC. This suggests that, in addition to NLSs, which have been shown important for nuclear proteins, NESs can also be an effective indicator to detect non-nuclear proteins. Most notably, using only a few proposed gapped-dipeptide signatures as input features for the SVM classifier, PSLNTS further enhances the accuracy and MCC to 80.9% and 0.618, respectively. Our results demonstrate that gapped-dipeptide signatures can better discriminate nuclear and non-nuclear proteins. Moreover, the proposed gapped-dipeptide signatures can be biologically interpreted and used in further experiment analyses of nuclear translocation signals, including NLSs and NESs.