A machine learning approach for the identification of odorant binding proteins from sequence-derived properties

Pugalenthi, Ganesan; Tang, Ke; Suganthan, PN; Archunan, G; Sowdhamini, R

doi:10.1186/1471-2105-8-351

Research article
Open access
Published: 19 September 2007

A machine learning approach for the identification of odorant binding proteins from sequence-derived properties

Ganesan Pugalenthi¹,
Ke Tang^1,2,
PN Suganthan¹,
G Archunan³ &
…
R Sowdhamini⁴

BMC Bioinformatics volume 8, Article number: 351 (2007) Cite this article

4816 Accesses
24 Citations
Metrics details

Abstract

Background

Odorant binding proteins (OBPs) are believed to shuttle odorants from the environment to the underlying odorant receptors, for which they could potentially serve as odorant presenters. Although several sequence based search methods have been exploited for protein family prediction, less effort has been devoted to the prediction of OBPs from sequence data and this area is more challenging due to poor sequence identity between these proteins.

Results

In this paper, we propose a new algorithm that uses Regularized Least Squares Classifier (RLSC) in conjunction with multiple physicochemical properties of amino acids to predict odorant-binding proteins. The algorithm was applied to the dataset derived from Pfam and GenDiS database and we obtained overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively).

Conclusion

Our study suggests that RLSC is potentially useful for predicting the odorant binding proteins from sequence-derived properties irrespective of sequence similarity. Our method predicts 92.8% of 56 odorant binding proteins non-homologous to any protein in the swissprot database and 97.1% of the 414 independent dataset proteins, suggesting the usefulness of RLSC method for facilitating the prediction of odorant binding proteins from sequence information.

Background

Olfaction is an important process to establish behavioural response and involves the binding of small, hydrophobic, volatile molecules to receptors of the nasal neuroepithelia [1]. The olfaction mechanism has been well studied and is generally similar in vertebrates, insects, crustaceans, and nematodes [2–4]. The first step in olfaction is the solubilization of the hydrophobic odorants in the hydrophilic nasal mucus.

Odorant Binding Proteins (OBPs) play a vital role in the olfaction. OBPs are small soluble polypeptides, which are thought to act as a carrier for odorants and carries odorant from the environment to the nasal epithelium in vertebrates and sensillar lymph in insects [5, 6]. OBPs of vertebrate are members of large family lipocalin and shares eight stranded beta barrel [7]. Insects OBPs include the general odorant-binding proteins (GOBPs) and the pheromone-binding proteins (PBPs), which are completely different from their vertebrate counterpart both in sequence and three-dimensional folding [8]. Insect OBPs contains alpha helical barrel and six highly conserved cysteines [9]. Another class of putative OBPs, named chemosensory proteins (CSPs) has been reported in different orders of insects, including Lepidoptera [10–12]. These polypeptides, of about 12 kDa, do not exhibit significant homology to PBPs and GOBPs and contain four conserved cysteine residues all involved in intramolecular disulphide bridges. In spite of the sequence and structural difference, their general chemical properties indicate similar functions in olfactory transduction.

Previous reports have shown that OBPs are present in large number within a species [13]. This suggests that OBPs do play an active role in odorant recognition rather than merely serving as passive odorant shuttles [14, 15]. Several reports have demonstrated selective binding of odorants to different OBPs derived from a given species [16–18]. OBPs are also suspected to participate in the deactivation of odorants and signal termination [19]. Presence of OBPs in non-sensory tissues of insect suggests their non-sensory roles [20]

Although many efforts have been made to study the role of OBPs, their physiological function is still unclear and more sequence data are required for the complete understanding of the odorant binding and transport mechanism. With the rapid increase in newly found protein sequences entering into databanks, an efficient method is needed to identify OBPs from the sequence databases. At present, prediction of the odorant binding proteins is primarily based on sequence similarity search methods [21, 22] and these methods will not be employed efficiently due to the fact that OBPs show very low sequence similarity between species and within the same species [23, 24]. So far, SVM and other statistical learning methods have not been explored for predicting odorant binding proteins. Here, we propose a method based on regularized least squares classifier (RLSC) method to predict odorant binding proteins from sequence-derived properties irrespective of sequence similarity.

Results and discussion

The dataset used for the prediction was obtained from GenDiS [25] and Pfam [26] databases. Positive class consists of 476 odorant binding protein domains [see Additional file 1]. whereas the negative class has 2157 non-odorant binding protein domains [see Additional file 2]. A regularized least squares classifier (RLSC) [27, 28] was used to conduct the training and testing on the dataset. First, the classification was carried out without feature selection, i.e. all the 1463 features were used. The confusion matrix achieved by RLSC is given in Table 1.

Table 1 Confusion matrix for RLSC on the training dataset

Full size table

To analyze the impact of the feature selection procedure on the classification performance, we selected eight feature subsets by decreasing the number of features. The performance of the method for discriminating between odorant binding proteins and non-odorant binding proteins is summarized in Table 2. In this Table, TP and TN stand for true positive (correctly predicted OBPs), and true negative (correctly predicted non-class-members). The results show that our method can distinguish odorant binding proteins from other protein sequences with an accuracy of >90% and Matthews Correlation Coefficient (MCC) of 0.922, when evaluated through leave one out cross validation. Using all the 1463 features, the RLSC achieved the TP rate of 94.5% and the TN rate of 98.4%. The overall Leave-one-out accuracy (LOOA), Balanced LOOA and MCC were 97.7%, 96.5% and 0.922 respectively. As seen in Table 2, feature selection generally does not deteriorate the classification performance much. The usage of smaller number of features only leads to a decrease of the TN rate. The TP rate is less influenced by the feature selection. In some cases, feature selection even leads to slight increase of the TP rates.

Table 2 Classification results achieved on different feature subsets. The optimal values of σ and λ are also given.

Full size table

To test the capability, our algorithm was evaluated by independent dataset obtained from NCBI database using keyword search. The keywords used for the search includes "odorant binding proteins", "pheromone binding proteins", "chemosensory proteins", "antennal protein" and "juvenile hormone binding proteins". The sequences that are present in the positive training dataset were removed from the list. After careful manual inspection, 414 odorant binding proteins were selected for independent testing [see Additional file 3]. The performance of our algorithm was compared with PSI-BLAST [29] and HMM [30]. PSI-BLAST search for each sequence was carried out against the database of positive training dataset. HMM analysis for each query sequence was performed against the HMM profile obtained from the positive training dataset. Our approach correctly predicts 402 proteins as odorant binding proteins whereas PSI-BLAST and HMM methods predict 369 and 360 proteins respectively [see Additional file 4]. The overall prediction accuracy for our approach, PSI-BLAST and HMM method is 97.1%, 89.1% and 86.9% respectively (Table 3).

Table 3 Prediction result of 414 odorant binding proteins by RLSC, PSI-BLAST and HMM methods

Full size table

Further analysis of 414 odorant binding proteins shows that 56 proteins have no single homologous protein in the SWISSPROT [31] database based on PSI-BLAST search result. A similarity E-value threshold of 0.01 was used for homologue search to ensure maximum exclusion of proteins that have a homologue. Our method correctly predicts 52 proteins as odorant binding proteins. This result shows the capability of our prediction systems for recognizing novel odorant binding proteins that are non-homologous to other proteins.

In this work, a total of nine physicochemical properties, secondary structural content and frequencies of di-peptides and tripeptides were used to represent each protein sequence. It has been reported that not all feature vectors contribute equally to the classification of proteins; some have been found to play a relatively more prominent role than others in specific aspects of proteins [32]. It is therefore of interest to examine which feature properties play more prominent roles in the classification of odorant-binding proteins. Our analysis suggests that molecular weight, hydrophobicity, hydration potential, average accessible surface area and refractivity play more prominent role. Hydrophobicity is an important factor for the formation of binding pocket and also for the interaction between OBP and odorant molecule. It is also observed that the tripeptides play significant role in our classification scheme than dipeptides.

Conclusion

Overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively) shows that RLSC is a potentially useful tool for the prediction of odorant-binding proteins. It is also a computationally efficient method for the prediction of odorant binding proteins despite the low sequence identity. Further, the capability of our method is tested by an independent dataset consisting of 414 members and this method is able to predict 97.1% of 414 odorant binding proteins. This approach can be used to identify novel odorant binding proteins from genome sequence databases using sequence-derived properties.

Methods

Classification models

All results presented in this paper are acquired through a leave-one-out cross-validation (LOOCV) procedure. A regularized least squares classifier (RLSC) is used as the classification model. From the machine learning viewpoint, RLSC belongs to the large family of kernel methods and is closely related to the well-known support vector machines (SVM) [33, 34]. The difference between RLSC and SVM is that they formulate the classification in different ways. However, both of them can achieve comparable classification performance [35]. Recall that our dataset is now represented as S = {(x₁, y₂),..., (x_n, y_n)}, where x_idenotes the instance (i.e. the protein sequences) and y_iis the corresponding class label. An RLSC (denoted as f) typically classifies a data points x by

f (x) = s i g n [\sum_{i = 1}^{n} α_{i} k (x_{i}, x)]

(1)

where k is the so-called kernel function that models the relationship between data points x_iand x, and the coefficients α_i's are to be computed by training. In practice, the kernel function is usually defined before training the RLSC. And the α_i's are computed through the training process, which involves solving a system of linear equations:

(K + λn I)α = Y (2)

where α = [α₁, α₂,..., α_n]^T, Y = [y₁, y₂,..., y_n]^Tand λ is a predefined positive constant called the regularization parameter. I is an identity matrix of size n. K is the kernel matrix, whose components can be computed as K_ij= k(x_i, x_j).

In our experiment, a Gaussian kernel k(x_i, x_j) = exp(-σ² ||x_i- x_j||²) is used for the RLSC since the Gaussian kernel is suggested as the first choice for most kernel methods. It is obvious that the values of the kernel-parameter σ and the regularization parameter λ are crucial to the RLSC's performance. Thus, both parameters are optimized to maximize the balanced leave-one-out accuracy. Due to the specific formulation of RLSC and our choice of LOOCV for fine tuning the parameters of a model, we can overcome the longer time problem by computing the training process only once.

Datasets

All odorant binding proteins are obtained from GenDiS [25] and Pfam [26] databases. Sequences having more than 40% sequence identity are removed from the dataset. After careful manual examination, a total of 476 odorant binding proteins are considered for the construction of positive dataset which includes 40 vertebrate odorant binding proteins, 282 insect general odorant binding proteins, 46 pheromone binding proteins and 108 chemosensory proteins [see Additional file 1]. Due to the limitation in the number of known odorant binding proteins, the positive dataset could not be enhanced any further. However, in future, as more and more sequences are clarified to belong to the family, we can enrich the positive dataset. The negative samples are taken from seed proteins of Pfam protein families, which are unrelated to odorant binding proteins. Our final negative dataset consists of 2157 non-odorant binding domains [see Additional file 2].

Derivation of physicochemical properties from protein sequence

Amino acid composition is one of the most basic characteristics of proteins and is extensively used in sequence based prediction studies [36]. Instead of using the conventional 20-D amino acid composition, another new concept called "pseudo amino acid composition" has been reported in order to include the sequence-order information which leads to a higher success rate in sequence based prediction studies [37–40]. Owing to the wide applications of PseAA (pseudo amino acid) composition, recently, a webserver called PseAA [41] was designed in a flexible manner to generate various kinds of PseAA composition for a given protein sequence [37, 38] according to the needs of users. Apart from the amino acid composition, sequence-derived structural and physicochemical features have frequently been used for various prediction studies.

In this work, amino acid composition and nine physicochemical properties were employed to describe each protein. Given the sequence of a protein, its amino acid composition and the properties of every constituent amino acid are computed and then used to generate feature vector. The computed amino acid properties include molecular weight, hydrophobicity, hydrophilicity, hydration potential, refractivity, average and total accessible surface area, secondary structural content and propensity of amino acids at secondary structures [42]. Secondary structure for each sequence is predicted using PSIPRED [43]. Additionally, frequencies of dipeptides and tripeptides are used to represent protein sequences for classification [44]. To reduce the dimensionality of feature space, the amino acids are clustered into 11 groups with similar physicochemical or structural properties as shown in Table 4. All possible pairwise and triplet combinations are computed from the 11 groups and this gives rise to 66 dipeptide and 1331 triplet combinations. The dipeptide and tripeptide frequencies are computed from each sequence and are represented by one or more pairwise and triplet combinations respectively. As a feature space, 1463 feature vectors represent each protein sequence.

Table 4 Amino acid groupings (11 groups) according to their physical and chemical properties

Full size table

Feature selection

In this work, the main purpose of conducting feature selection is to remove possible redundant features from the original feature set. By redundancy, we mean that the feature has negligible influence on the final classification performance. We design a wrapper approach [45] to conduct feature selection for our dataset. In this method, we utilize the balanced leave-one-out accuracy (BLOOA) of RLSC as the selection criterion. The sequential backward elimination (or the recursive feature elimination) scheme is employed as the search scheme. To be specific, the feature selection procedure can be described as follows: We start from the whole feature subset (i.e. with all the 1463 features) and calculate the BLOOA. Then, features are iteratively pruned from the feature set. At each iteration, the feature whose omission leads to the largest BLOOA is pruned. Assume that we need to prune the number of features from 1463 to d, the feature selection (or redundant feature elimination) procedure is demonstrated in Figure 1, where |F| denotes the cardinality of F.

Leave-one-out cross-validation

Among the independent test dataset, sub-sampling (e.g., 5 or 10-fold sub-sampling) test and jackknife test, which are often used for examining the accuracy of a statistical prediction method, the jackknife test is deemed the most rigorous and objective as analyzed by a comprehensive review [46] and has been increasingly adopted by leading investigators to test the power of various prediction methods [47–51].

In this paper, we have used Leave-one-out (i.e., jackknife) cross-validation approach to estimating generalization performance of a classifier. It involves removing one protein from the training set, training the classifier (in our case, the RLSC) on the remaining proteins and then predicting class label of the removed (left out) protein using the trained classifier. This process was repeated until all proteins had been left out. Then the leave-one-out accuracy is computed by counting the total number of correct predictions and divided it by n (i.e. the number of samples in the original dataset).

Balanced LOOA for unbalanced population of classes

Although LOOA has been commonly used in the literature, it is also known that LOOA may not provide a precise evaluation on the performance of a classifier if a large unbalance in the population of different classes exists in the data of interest. To be specific, a good classifier is usually expected to provide high accuracy on both the positive and negative data. But LOOA will bias more to the True Positive rate if we have much more positive samples in the dataset and vice versa. Since our dataset contains much more negative instances than positive instances, alternative metrics needs to be used in addition to the LOOA. We resort to the balanced LOOA (BLOOA) [52], which can be computed as:

B L O O A = \frac{1}{2} (T P + T N)

(3)

where TP and TN denote the true positive and true negative rate, respectively.

References

Buck L, Axel R: A novel multigene family may encode odorant receptors: a molecular basis for odor recognition. Cell. 1991, 65 (1): 175-187.
Article CAS PubMed Google Scholar
Ache BW: Towards a common strategy for transducing olfactory information. Semin Cell Biol. 1994, 5 (1): 55-63.
Article CAS PubMed Google Scholar
Hildebrand JG, Shepherd GM: Mechanisms of olfactory discrimination: Converging evidence for common principles across phyla. Ann Rev Neurosci. 1997, 20: 595-631.
Article CAS PubMed Google Scholar
Pelosi P: Perireceptor events in olfaction. J Neurobiol. 1996, 30 (1): 3-19.
Article CAS PubMed Google Scholar
Vogt RG, Riddiford LM: Pheromone binding and inactivation by moth antennae. Nature. 1981, 293: 161-163.
Article CAS PubMed Google Scholar
Pelosi P: Odorant-binding proteins. Crit Rev Biochem Mol Biol. 1994, 29 (3): 199-228.
Article CAS PubMed Google Scholar
Bianchet MA, Bains G, Pelosi P, Pevsner J, Snyder SH, Monaco HL, Amzel LM: The three-dimensional structure of bovine odorant binding protein and its mechanism of odor recognition. Nat Struct Biol. 1996, 3 (11): 934-939.
Article CAS PubMed Google Scholar
Pelosi P, Maida R: Odorant-binding proteins in insects. Comp Biochem Physiol B Biochem Mol Biol. 1995, 111 (3): 503-514.
Article CAS PubMed Google Scholar
Vogt RG, Callahan FE, Rogers ME, Dickens JC: Odorant binding protein diversity and distribution among the insect orders, as indicated by LAP, an OBP-related protein of the true bug Lygus lineolaris (Hemiptera, Heteroptera). Chem Senses. 1999, 24 (5): 481-495.
Article CAS PubMed Google Scholar
Jacquin-Joly E, Vogt RG, Francois MC, Nagnan-Le Meillour P: Functional and expression pattern analysis of chemosensory proteins expressed in antennae and pheromonal gland of Mamestra brassicae. Chem Senses. 2001, 26 (7): 833-844.
Article CAS PubMed Google Scholar
Danty E, Arnold G, Huet JC, Masson C, Pernollet JC: Separation, characterization and sexual heterogeneity of multiple putative odorant-binding proteins in the honeybee Apis mellifera L. (Hymenoptera: Apidea). Chem Senses. 1998, 23 (1): 83-91.
Article CAS PubMed Google Scholar
Wanner KW, Willis LG, Theilmann DA, Isman MB, Feng Q, Plettner E: Analysis of the insect os-d-like gene family. J Chem Ecol. 2004, 30 (5): 889-911.
Article CAS PubMed Google Scholar
Felicioli A, Ganni M, Garibotti M, Pelosi P: Multiple types and forms of odorant-binding proteins in the Old-World porcupine Hystrix crispate. Comp Biochem Physiol B. 1993, 105 (3–4): 775-784.
CAS PubMed Google Scholar
Raming K, Krieger J, Breer H: Primary structure of a pheromone-binding protein from Antheraea pernyi: Homologies with other ligand-carrying proteins. J Comp Physiol B. 1990, 160 (5): 503-509.
Article CAS PubMed Google Scholar
Krieger J, Raming K, Breer H: Cloning of genomic and complementary DNA encoding insect pheromone binding proteins: Evidence for microdiversity. Biochim Biophys Acta. 1991, 1088 (2): 277-84.
Article CAS PubMed Google Scholar
Vogt RG, Köhne AC, Dubnau JT, Prestwich GD: Expression of pheromone binding proteins during antennal development in the gypsy moth Lymantria dispar. J Neurosci. 1989, 9 (9): 3332-3346.
CAS PubMed Google Scholar
Du G, Prestwich GD: Protein structure encodes the ligand binding specificity in pheromone binding proteins. Biochemistry. 1995, 34 (27): 8726-8732.
Article CAS PubMed Google Scholar
Kaissling KE: Pheromone deactivation catalyzed by receptor molecules: a quantitative kinetic model. Chem Senses. 1998, 23 (4): 385-395.
Article CAS PubMed Google Scholar
Graham LA, Tang W, Baust JG, Liou YC, Reid TS, Davies PL: Characterization and cloning of a Tenebrio molitor hemolymph protein with sequence similarity to insect odorant-binding proteins. Insect Biochem Mol Biol. 2001, 31 (6–7): 691-702.
Article CAS PubMed Google Scholar
Kodrik D, Filippov VA, Filippova MA, Sehnal F: Sericotropin: an insect neurohormonal factor affecting RNA transcription. Neth J Zool. 1995, 45 (1–2): 68-70.
Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402.
Article PubMed Central CAS PubMed Google Scholar
Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763.
Article CAS PubMed Google Scholar
Dear TN, Campbell K, Rabbitts TH: Molecular cloning of putative odorant-binding and odorant-metabolizing proteins. Biochemistry. 1991, 30 (43): 10376-10382.
Article CAS PubMed Google Scholar
Pes D, Mameli M, Andreini I, Krieger J, Weber M, Breer H, Pelosi P: Cloning and expression of odorant-binding proteins Ia and Ib from mouse nasal tissue. Gene. 1998, 212 (1): 49-55.
Article CAS PubMed Google Scholar
Pugalenthi Ganesan, Bhaduri Anirban, Sowdhamini Ramanathan: GenDiS: Genomic Distribution of protein structural domain Superfamilies. Nucleic Acids Res. 2005, 33: D252-D255.
Article PubMed Central CAS PubMed Google Scholar
Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997, 28 (3): 405-420.
Article CAS PubMed Google Scholar
Evgeniou T, Pontil M, Poggio T: Regularization networks and support vector machines. Advances in Computational Mathematics. 2000, 13: 1-50.
Article Google Scholar
Rifkin R, Yeo G, Poggio T: Regularized least-squares classification. Advances in Learning Theory: Methods, Models and Applications, NATO Science Series III: Computer and Systems Sciences. 2003, 190: 131-153.
Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402.
Article PubMed Central CAS PubMed Google Scholar
Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763.
Article CAS PubMed Google Scholar
Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28 (1): 45-48.
Article PubMed Central CAS PubMed Google Scholar
Ding C, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001, 17 (4): 349-358.
Article CAS PubMed Google Scholar
Cortes C, Vapnik V: Support vector networks. Machine Learning. 1995, 20: 273-297.
Google Scholar
Burges CJC: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 1998, 2: 121-167.
Article Google Scholar
Zhang P, Peng J: SVM vs. regularized least squares classification. Proceedings of the 17th International Conference on Pattern Recognition. 2004, 176-179.
Google Scholar
Zhang CT, Chou KC: An optimization approach to predicting protein structural class from amino acid composition. Protein Sci. 1992, 1 (3): 401-408.
Article PubMed Central CAS PubMed Google Scholar
Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics. 2001, 43: 246-255.
Article CAS Google Scholar
Chou KC: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005, 21: 10-19.
Article CAS PubMed Google Scholar
Shen HB, Chou KC: Ensemble classifier for protein fold pattern recognition. Bioinformatics. 2006, 22: 1717-1722.
Article CAS PubMed Google Scholar
Chou KC, Cai YD: Prediction of membrane protein types by incorporating amphipathic effects. J Chem Inf Model. 2005, 45 (2): 407-413.
Article CAS PubMed Google Scholar
[http://chou.med.harvard.edu/bioinf/PseAA/]
Kawashima S, Ogata H, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 1999, 27: 368-369.
Article PubMed Central CAS PubMed Google Scholar
McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16 (4): 404-405.
Article CAS PubMed Google Scholar
Smialowski P, Schmidt T, Cox J, Kirschner A, Frishman D: Will my protein crystallize? A sequence-based predictor. Proteins. 2006, 62 (2): 343-355.
Article CAS PubMed Google Scholar
Kohavi R, John GH: Wrappers for feature subset selection. Artificial Intelligence. 1997, 97: 273-324.
Article Google Scholar
Chou KC, Zhang CT: Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology. 1995, 30: 275-349.
Article CAS PubMed Google Scholar
Chou KC, Shen HB: Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization. Biochem Biophys Res Commun. 2006, 347: 150-157.
Article CAS PubMed Google Scholar
Shen HB, Chou KC: Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochem Biophys Res Commun. 2007, 355: 1006-1011.
Article CAS PubMed Google Scholar
Chou KC, Shen HB: Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research. 2006, 5: 1888-1897.
Article CAS PubMed Google Scholar
Chou KC, Shen HB: Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. Journal of Proteome Research. 2007, 6: 1728-1734.
Article CAS PubMed Google Scholar
Chou KC, Shen HB: Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Commun. 2007, 357: 633-640.
Article CAS PubMed Google Scholar
Cawley GC: Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs. Proceedings of the International Joint Conference on Neural Networks (IJCNN-2006) Vancouver BC Canada. 2006, 16-21.
Google Scholar

Download references

Acknowledgements

GP, KT and PNS acknowledge the financial support offered by the A*Star (Agency for Science, Technology and Research, Singapore) under the grant # 052 101 0020. RS acknowledge National Centre for Biological Sciences (TIFR) for infrastructural and financial support. RS also acknowledges Wellcome Trust (UK) for funding. Authors thank Professor Dmitrij Frishman for his comments on this work.

Author information

Authors and Affiliations

School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore
Ganesan Pugalenthi, Ke Tang & PN Suganthan
Nature Inspired Computation and Applications Laboratory (NICAL), Department of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, China
Ke Tang
Department of Animal Science, Bharathidasan University Trichirapalli, Tamilnadu, 620 024, India
G Archunan
National Centre for Biological Sciences, UAS-GKVK campus, Bellary Road, Bangalore, 560 065, India
R Sowdhamini

Authors

Ganesan Pugalenthi
View author publications
You can also search for this author in PubMed Google Scholar
Ke Tang
View author publications
You can also search for this author in PubMed Google Scholar
PN Suganthan
View author publications
You can also search for this author in PubMed Google Scholar
G Archunan
View author publications
You can also search for this author in PubMed Google Scholar
R Sowdhamini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R Sowdhamini.

Additional information

Competing interests

The author(s) declares that there are no competing interests.

Authors' contributions

GP and KT contributed equally for the analysis and manuscript preparation. RS and PNS coordinated the study, helped drafting the manuscript and critically revised its content. GA provided useful suggestions to improve the classification scheme. All authors read and approved the manuscript.

Electronic supplementary material

12859_2007_1723_MOESM1_ESM.doc

Additional file 1: Positive training dataset. This data provides 476 protein sequences that are used for training. (DOC 226 KB)

12859_2007_1723_MOESM2_ESM.doc

Additional file 2: Negative training dataset. This data provides 2157 protein sequences that are used for training. (DOC 194 KB)

12859_2007_1723_MOESM3_ESM.doc

Additional file 3: Independent testing dataset. This data provides 414 protein sequences that are used for testing. (DOC 205 KB)

12859_2007_1723_MOESM4_ESM.doc

Additional file 4: Prediction results of 414 odorant binding proteins. This table provides prediction results for 414 odorant binding proteins by our method, BLAST and HMM, where "+" represents proteins correctly predicted as odorant binding proteins, and "-" represents proteins incorrectly predicted as non odorant binding proteins. (DOC 748 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Pugalenthi, G., Tang, K., Suganthan, P. et al. A machine learning approach for the identification of odorant binding proteins from sequence-derived properties. BMC Bioinformatics 8, 351 (2007). https://doi.org/10.1186/1471-2105-8-351

Download citation

Received: 08 May 2007
Accepted: 19 September 2007
Published: 19 September 2007
DOI: https://doi.org/10.1186/1471-2105-8-351

A machine learning approach for the identification of odorant binding proteins from sequence-derived properties

Abstract

Background

Results

Conclusion

Background

Results and discussion

Conclusion

Methods

Classification models

Datasets

Derivation of physicochemical properties from protein sequence

Feature selection

Leave-one-out cross-validation

Balanced LOOA for unbalanced population of classes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

12859_2007_1723_MOESM1_ESM.doc

12859_2007_1723_MOESM2_ESM.doc

12859_2007_1723_MOESM3_ESM.doc

12859_2007_1723_MOESM4_ESM.doc

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

A machine learning approach for the identification of odorant binding proteins from sequence-derived properties

Abstract

Background

Results

Conclusion

Background

Results and discussion

Conclusion

Methods

Classification models

Datasets

Derivation of physicochemical properties from protein sequence

Feature selection

Leave-one-out cross-validation

Balanced LOOA for unbalanced population of classes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Electronic supplementary material

12859_2007_1723_MOESM1_ESM.doc

12859_2007_1723_MOESM2_ESM.doc

12859_2007_1723_MOESM3_ESM.doc

12859_2007_1723_MOESM4_ESM.doc

Authors’ original submitted files for images

Authors’ original file for figure 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us