Open Access Highly Accessed Open Badges Research article

A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition

Alok Sharma13*, Kuldip K Paliwal2, Abdollah Dehzangi2, James Lyons2, Seiya Imoto1 and Satoru Miyano1

Author Affiliations

1 Laboratory of DNA Information Analysis, University of Tokyo, Minato-ku, Tokyo, Japan

2 School of Engineering, Griffith University, Brisbane, Australia

3 School of Engineering and Physics, University of the South Pacific, Suva, Fiji

For all author emails, please log on.

BMC Bioinformatics 2013, 14:233  doi:10.1186/1471-2105-14-233

The electronic version of this article is the complete one and can be found online at:

Received:25 July 2012
Accepted:20 June 2013
Published:24 July 2013

© 2013 Sharma et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Assigning a protein into one of its folds is a transitional step for discovering three dimensional protein structure, which is a challenging task in bimolecular (biological) science. The present research focuses on: 1) the development of classifiers, and 2) the development of feature extraction techniques based on syntactic and/or physicochemical properties.


Apart from the above two main categories of research, we have shown that the selection of physicochemical attributes of the amino acids is an important step in protein fold recognition and has not been explored adequately. We have presented a multi-dimensional successive feature selection (MD-SFS) approach to systematically select attributes. The proposed method is applied on protein sequence data and an improvement of around 24% in fold recognition has been noted when selecting attributes appropriately.


The MD-SFS has been applied successfully in selecting physicochemical attributes of the amino acids. The selected attributes show improved protein fold recognition performance.


Discovering the three dimensional structure of a protein from its amino acid sequence via computational means is a challenging task and open for research in biological science and bioinformatics. Deciphering protein structure elucidates protein functions. This has a profound impact on understanding the heterogeneity of proteins, protein-protein interactions and protein-peptide interactions. This further helps in drug design. A usual way to predict the structure of a protein is to first acquire proteins with known structures (e.g. by crystallography techniques) and then from their sequences, the prediction process can be conducted by developing recognition techniques. Thereafter, the developed techniques can be used to classify unknown protein sequences into one of its classes or folds. The length of a protein sequence (i.e., the number of amino acids in it) is usually different from the length of another protein sequence. However, two proteins with different lengths and low sequential similarities can be categorized to the same fold. The identification of protein folds from a protein sequence would bring us one step closer to the recognition of protein structures. A wide range of techniques have been developed over the past two decades to recognize protein folds. Despite numerous contributions and significant enhancements achieved [1,2], the protein fold recognition problem is yet to be completely solved.

The focus in protein fold recognition can be broadly classified into two categories: 1) the development of classifiers to improve fold recognition, and 2) the development of feature extraction techniques using alphabetical sequence (syntactical-based) and/or using physicochemical properties of the amino acids (attribute-based or physicochemical-based). For the former case, several classifiers have been developed or used including linear discriminant analysis [3], Bayesian classifiers [4], Bayesian decision rule [5], K-Nearest Neighbor [6,7], Hidden Markov Model [8,9], Artificial Neural Network [10,11] and ensemble classifiers [1,12]–[14]. For the latter case, several feature extraction techniques have been developed including composition, transition and distribution [15], occurrence [16], pairwise frequencies [17], pseudo-amino acid composition [18], bigrams [19], autocorrelation [6,20,21] and deriving features by considering more physicochemical properties [22].

Dubchak et al. [15] proposed syntactical and physicochemical-based features for protein fold recognition. They used the five following attributes of amino acids for deriving physicochemical-based features namely, hydrophobicity (H), predicted secondary structure based on normalized frequency of α-helix (X), polarity (P), polarizability (Z) and van der Waals volume (V). The features proposed by Dubchak et al. [15] have been widely used in the field of protein fold recognition [4,12,22]–[28]. Apart from the above mentioned 5 attributes used by Dubchak et al. [15], features have also been extracted by incorporating other attributes of the amino acids. Some of the other attributes used are: solvent accessibility [29], flexibility [30], bulkiness [31], first and second order entropy [32], size of the side chain of the amino acids [22]. Several attributes have been picked for feature extraction usually in an arbitrary way for protein fold recognition. Contrary to this, Taguchi and Gromiha [16] argued that features from attributes of amino acids can be ignored due to having insufficient information and only syntactical-based features should be considered. This shows that proper exploration of the amino acid attributes has not been conducted. To this, we posed a question: ‘which of the attributes of the amino acids are to be selected for the protein fold recognition problem?’ The answer to this would open the third category of research apart from 1) the development of classifiers, and 2) the development of feature extraction techniques based on the syntactic and/or physicochemical properties.

In this study, we develop a methodology for selecting the attributes of the amino acids for protein fold recognition in a systematic manner. In order to do this, a successive feature selection (SFS) technique based on an exhaustive greedy search algorithm can be applied [33,34]. The SFS technique can find important features from a group of features. However, since several features could be extracted from an attribute (e.g. composition, transition and distribution from hydrophobicity of amino acids) and there could be many attributes, this would lead to selecting multi-dimensional features belonging to an attribute. Therefore, we develop a scheme to identify important attributes by investigating multi-dimensional features corresponding to attributes. For brevity we call the proposed technique as multi-dimensional SFS (MD-SFS).

We show two schemes of MD-SFS: backward elimination and forward selection. In the backward elimination scheme, the search for the best subset of attributes will start by first retaining all the given attributes. Then an irrelevant attribute is discarded from this subset at an iteration time point that causes minimum loss of information for the subset. This elimination of attributes from a subset is performed until all the attributes are ranked. This scheme is useful to find attributes of low importance that could perform well, if selected in an appropriate subset. In the forward selection scheme, the best attribute is selected first, and a subsequent attribute is included in the subset such that the included attribute improves the performance (e.g., in terms of classification) of the subset. This scheme, however, could be biased towards the highest ranking attribute.

Experiments are carried out using Dubchak’s (DD) dataset [25], Taguchi’s (TG) dataset (Taguchi and Gromiha, [16]) and extended Ding and Dubchak (EDD) dataset [2]. The selection of physicochemical attributes by MD-SFS technique shows improvement in protein fold recognition by around 18 ~ 24% on all the datasets when 10-fold cross-validation has been applied. The MD-SFS technique has been illustrated in the next section and its usefulness has been demonstrated in the subsequent sections.

Multi-dimensional successive feature selection

The MD-SFS scheme has been illustrated in Figures 1 and 2. The backward-elimination procedure of MD-SFS has been shown in Figure 1 and the forward-selection procedure has been shown in Figure 2. The purpose of MD-SFS is to select the best attribute for protein fold recognition. In the figures, four attributes (Ta = 4) have been depicted. A feature extraction technique has been used to extract d-dimensional features from each attribute. Attributes are represented as Aj (where j = 1, 2,..., Ta) and extracted features of Aj are represented as. f1j, f2j, …, fdj In the figures, there are 4 levels in total, including the beginning state. The number of attributes at each of the level is denoted by NA. The classification accuracy using k-fold cross-validation of a subset of attributes is denoted by H( · ) (Figure 2). The highest average classification accuracy using k-fold cross-validation at each of the level is depicted by αl where l = 0, 1, …, Ta − 1. The output is the ranked attributes.

thumbnailFigure 1. Multi-dimensional successive feature selection: backward elimination scheme.

thumbnailFigure 2. Multi-dimensional successive feature selection: forward selection scheme.

MD-SFS: backward elimination

For the backward-elimination case of MD-SFS (Figure 1), a group of features belonging to an attribute is dropped one at a time in each of the successive levels. This would give subsets of attributes containing features. The number of features in a subset at level l is (Ta − l)d. A classifier is used to compute average classification accuracy using k-fold cross-validation procedure on each of the subsets. The subset of attributes with the highest average classification accuracy is progressed to the next subsequent level. The size of subset is reduced by d number of features as we progress across the levels. This process is terminated when all the attributes are ranked. In Figure 1, at level 1, the highest average classification accuracy (α1) obtained is by attribute subset {A1, A2, A4}. It is also possible that average classification accuracy of more than one subset is the same. In that case, the subsets with the highest average classification accuracies would progress to the next level. In Figure 1, subset {A1, A2, A4} is progressed to level 2 and at this level the subset with highest average classification accuracy (α2) is {A2, A4}. At level 3, the subset with highest average classification accuracy (α3) is {A2}. In Figure 1, ranked attributes are {A2, A4, A1, A3}, where A2 is the top ranked attribute and A3 is the bottom ranked or least important attribute. Furthermore, there could be two criteria in which attributes can be selected. For an instance, if we want to select best 3 attributes for the design then we can take {A2, A4, A1} from the ranked attributes. However, a better way would be to find the argument of the maximum of αl i.e., <a onClick="popup('','MathML',630,470);return false;" target="_blank" href="">View MathML</a>. For an instance, if r = 2 then this indicates that subset {A2, A4} at level 2 exhibits the maximum accuracy among all the selected subsets at all the levels. Therefore, attributes of subset {A2, A4} can be selected for the design. We refer the former criterion of selection as brute-n (where n is the number of attributes to be selected) and the latter criterion as maximum accuracy (MA) based criterion.

The MD-SFS backward elimination procedure would approximately require between <a onClick="popup('','MathML',630,470);return false;" target="_blank" href="">View MathML</a> and <a onClick="popup('','MathML',630,470);return false;" target="_blank" href="">View MathML</a> search combinations, where Tα is the total number of attributes and the term mCn is the n-combination of m elements. If ts denotes the number of attributes in a subset s then this subset would have tsd features. Therefore, the computational complexity of a classifier for doing classification using subset s will be based on tsd number of features.

MD-SFS: forward selection

For the forward-selection case of MD-SFS (Figure 2), an attribute with corresponding d-dimensional features would be taken at a time for computing average classification accuracy using the k-fold cross-validation procedure. The attribute corresponding to the highest average classification accuracy will be stored; i.e., <a onClick="popup('','MathML',630,470);return false;" target="_blank" href="">View MathML</a>. The selected attribute containing the features will go to the next successive level. In the next level, an attribute that exhibits the highest average classification accuracy in combination with the selected attribute from the previous level <a onClick="popup('','MathML',630,470);return false;" target="_blank" href="">View MathML</a> will be retained. This process will continue until all the attributes are ranked. The number of features used in computing classification accuracy at level l is (l + 1)d. Further, we can apply the same two criteria (brute-n and MA-based) for obtaining attributes from the ranked set of attributes as it was discussed in MD-SFS backward elimination approach.

The MD-SFS forward selection would require around Ta(Ta + 1)/2 search combinations, where Ta is the total number of attributes. A subset s with ts attributes would have tsd number of features. The computational complexity of a classifier used to compute classification accuracy would depend on tsd number of features.



In this study, three protein sequence datasets have been used: 1) DD-dataset [25], 2) TG-dataset (Taguchi and Gromiha, [16]) and 3) EDD-dataset [2]. The DD-dataset that we have used consists of 311 protein sequences in the training set where two proteins have no more than 35% of sequence identity for aligned subsequence longer than 80 residues. The test set consists of 383 protein sequences where sequence identity is less than 40%. Both the sets belong to 27 SCOP folds which represented all major structural classes: α, β, α/β, and α + β[25]. The training set and test set have been merged as a single set of data in order to perform k-fold cross-validation process.

TG-dataset consists of 1612 protein sequences belonging to 30 different folding types of globular proteins. The names of the number of protein sequences in each of 30 folds have been described in Taguchi and Gromiha [16]. The protein sequences of TG-dataset have been first transformed into their corresponding PSSM (position-specific-scoring-matrix) [35] sequences by using PSIBLAST ( webcite) (the cut off E-value is set to E = 0.001).

EDD-dataset consists of 3418 proteins with less than 40% sequential similarity belonging to the 27 folds that originally used in DD-dataset. We extracted the EDD-dataset from the 1.75 SCOP in similar manner to Dong et al. [2] in order to study our proposed method using a larger number of samples.

Physicochemical attributes

In this study 30 physicochemical attributesa have been utilized including 5 popular attributes as used by Dubchak et al. [15]. The attributes with the corresponding symbols are listed in Table 1. The residues of amino acids of these 30 attributes are given in Table 2.

Table 1. Physicochemical attributes used in the study

Table 2. Residues of amino acids of the 30 attributes1

Feature extraction

As discussed in the Background Section, there exist several feature extraction techniques. Given a classifier, the features derived from different feature extraction techniques would exhibit different fold recognition performances. Since in this paper the aim is not to find a feature extraction technique for a particular classifier, we use a simple autocorrelation of the residues of protein sequences. The expression for autocorrelation features used in the paper is given as follows:

<a onClick="popup('','MathML',630,470);return false;" target="_blank" href="">View MathML</a>


where N is the length of protein sequence, sk is the residue of kth amino acid in a protein sequence and μ is the mean (or average) of N residues. In this work, we use i = 1, 2, …, 20. Therefore, each protein sequence will give 20-dimensional autocorrelation features.


In the literature, several classifiers have been used for the protein fold recognition problem. We used three techniques for classification: support vector machine (SVM), Naïve Bayes (NB) and linear discriminant analysis (LDA) with nearest centroid classifier [57]–[59]. SVM and NB classifiers are used from WEKA environment [60] by using WEKA’s default parameter settings.

Results and discussions

Five attributes used by Ding and Dubchak [25] are used as a benchmark. These attributes are H, P, Z, X and V (see Table 1 for the description of these symbols). In all the experiments we use a 10-fold cross-validation process to obtain the recognition performance. First we present in Table 3 the fold recognition using these 5 attributes on DD, TG and EDD datasets. It can be clearly observed that the highest fold recognition on DD-dataset obtained by HPZXV is 32.8%, on TG-dataset is 28.8% and on EDD-dataset is 38.4%.

Table 3. Protein fold recognition (shown in percentage) on all the datasets using HPZXV attributes used by Ding and Dubchak [[25]]

Next we apply MD-SFS backward elimination approach on DD-dataset, TG-dataset and EDD-dataset, respectively on three cases: 1) using top 10 attributes of the amino acids from Tables 1, 2) using top 15 attributes of the amino acids from Tables 1, and 3) using all 30 attributes from Table 1. We use two criteria: brute-n and MA-based (as discussed in Section MD-SFS: Backward Elimination), to select the attributes. Since in Table 3 the results are reported using 5 attributes, we apply brute-5 to compare the results with that of Table 3. The selected attributes with their corresponding protein fold recognition (abbreviated as PFR in Tables 4, 5, 6, 7, 8, 9, 10 and 11) performance on DD-dataset using brute-5 criterion is given in Table 4 and using MA-based criterion is given in Table 5. The first row of results is by HPZXV (which is taken from Table 3). The first column indicates the number of attributes taken for attribute selection. The same setup has been used for all the remaining tables (Tables 6, 7, 8, 9, 10 and 11). It can be seen from Tables 4 and 5 that incorporating more attributes and then performing attribute selection is helping in improving the recognitionperformance. By using only 5 attributes (Table 4), the recognition performance has significantly improved by 5.7% to 16.6% as compared with the recognition performance of HPZXV attributes. If the number of attributes is not fixed and selection is based on MA criterion then the improvement is recorded between 14.1% and 18.1%.

Table 4. MD-SFS backward elimination approach on DD-dataset using brute-5 criterion

Table 5. MD-SFS backward elimination approach on DD-dataset using MA-based criterion

Table 6. MD-SFS backward elimination approach on TG-dataset using brute-5 criterion

Table 7. MD-SFS backward elimination approach on TG-dataset using MA-based criterion

Table 8. MD-SFS backward elimination approach on EDD-dataset using brute-5 criterion

Table 9. MD-SFS backward elimination approach on EDD-dataset using MA-based criterion

Table 10. MD-SFS forward selection approach on DD-dataset using brute-5 criterion

Table 11. MD-SFS forward selection approach on DD-dataset using MA-based criterion

A similar scheme has been applied using the TG-dataset and the results are reported in Tables 6 and 7 (Table 6 using brute-5 criterion and Table 7 using MA-based criterion). It can be observed from Table 6 that recognition performance has been improved between 7.5% and 10.2%. Also the improvement from Table 7 is between 12.6% and 18.1%.

We have also employed the EDD-dataset for the experiment and the results are reported in Tables 8 and 9 (Table 8 using brute-5 criterion and Table 9 using MA-based criterion). From Table 8, we note that the improvement in recognition performance is between 6.5% and 8.8%, and from Table 9, it is between 15.5% and 24.3%.

Subsequently we applied the MD-SFS forward selection approach on the DD, TG and EDD datasets. Again we use brute-5 and MA-based criteria. The protein fold recognition performance using the DD-dataset with brute-5 criterion is show in Table 10 and with MA-based criterion is shown in Table 11. It can be observed from Table 10 that by using only 5 attributes the recognition performance can be improved between 4.5% and 12.2%. In a similar way, the improvement using MA-based criterion is noted from 13.3% to 17.7%.

On TG-dataset, MD-SFS forward selection with brute-5 criterion is depicted in Table 12 and with MA-based criterion is depicted in Table 13. The improvement from Table 12 using only 5 attributes is between 8.1% and 10.4%; and, from Table 13 we have improvement from 12.4% to 17.5%.

Table 12. MD-SFS forward selection approach on TG-dataset using brute-5 criterion

Table 13. MD-SFS forward selection approach on TG-dataset using MA-based criterion

Similarly, on EDD-dataset, MD-SFS forward selection with brute-5 criterion is shown in Table 14 and with MA-based criterion is shown in Table 15. The improvement from Table 14 using only 5 attributes is between 7.4% and 8.7%; and, from Table 15 we have improvement from 10.5% to 16.2%.

Table 14. MD-SFS forward selection approach on EDD-dataset using brute-5 criterion

Table 15. MD-SFS forward selection approach on EDD-dataset using MA-based criterion

From the results, we can deduce that physicochemical based attributes are important for the prediction accuracy of protein folds. An appropriately selected subset of attributes could enhance the prediction accuracy significantly. The subset of attributes selected for different datasets are different. The attributes in a subset also vary depending on the classifier used. However, some attributes repeatedly appear on the obtained subsets. For an instance, a subset BPEVO is selected from all 30 attributes using brute-5 criterion on DD-dataset when LDA is used and a subset BPDFM is selected when SVM is used (see Table 4). It can be observed that the attributes B and P are common in both the subsets. This could imply that these attributes contain more discriminative information for protein fold recognition than others. When we analyzed all the subsets using brute-5 criterion on all the three datasets (Tables 4, 6, 8, 10, 12 and 14), we found that top 5 occurrences of attributes are J (appeared 12 times), B (appeared 9 times), T (appeared 9 times), F (appeared 8 times) and M (appeared 6 times). Therefore, these attributes (J,B,T,F and M) can be seen as important attributes. However, it does not imply that a subset containing all these 5 attributes would perform the best as the performance of attributes in combination with other attributes is also crucial.

We have also carried out a statistical hypothesis test to exhibit the significance of the results achieved. In order to do this, we randomly selected m attributes from a given set of n attributes and computed prediction accuracy using these m attributes. We repeated this random selection r times and computed average prediction accuracy. All three classifiers (LDA, SVM and NB) are used for this purpose. We applied this testing on all the three benchmark datasets (DD, TG and EDD) and compared the results with the proposed schemes. In this testing, we used m = 5, n = 30 and r = 20. The results are reported in Tables 16, 17 and 18. It can be observed from these tables that the prediction accuracy using a random selection approach is inferior to the proposed schemes. This depicts that systematically selecting attributes (using MD-SFS procedures) contributed to the prediction accuracy of protein folds.

Table 16. Statistical analysis using DD-dataset

Table 17. Statistical analysis using TG-dataset

Table 18. Statistical analysis using EDD-dataset

Furthermore, we have carried out paired t-test with 5% significance level to study the statistical significance of the prediction accuracy obtained. We used MD-SFS backward elimination method (using brute-5 criterion) as a prototype and used all the three classifiers (LDA, SVM and NB). We compared the results obtained by all the classifiers for HPZXV attributes for DD, TG and EDD benchmarks (the degree of freedom is 2). The paired t-test results for LDA, SVM and NB are 0.029, 0.003 and 0.004, respectively. These results show that the prediction accuracies obtained are significant.

We can summarize that the performance of the protein fold recognition improved when the attributes are appropriately selected. This also shows that physicochemical attributes can play an important role in protein fold recognition if selected appropriately. It should also be noted that the performance can be improved further by considering several other feature extraction techniques with sophisticated ensemble classifiers.


In this study, we have shown that by selecting physicochemical attributes of amino acids the protein fold recognition performance improved significantly. It is, therefore, beneficial to explore important attributes in the process of determining the three dimensional structure of proteins. To do this, we have developed a multi-dimensional successive feature selection (MD-SFS) technique and shown it on both backward elimination and forward selection approaches. There are several attributes available (e.g. a list of 544 attributes can be found in AAindex, webcite, [61]) and the investigation of these attributes by an exhaustive search would help in solving the problem better. Though it is always useful to explore as many attributes as possible, it comes with an expense of additional computational cost and memory requirements. Nonetheless, computationally efficient techniques for an exhaustive exploration of important attributes should care to develop along with the development of feature extraction and classification techniques.


aThough there are large number of physicochemical based attributes defined for amino acids, many authors (e.g. [31,62]–[65]) in the past, used limited number of attributes (up to 8) in their studies. We attempted to study the attributes which were given more emphasis in the literature.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AS designed and carried out the experiments, and wrote the first draft of the manuscript. KKP assisted in designing a section of experiments. AD provided the dataset and helped in the second draft of the manuscript. JL also helped in the second draft of the manuscript. SI and SM financed the project. All authors read and approved the final manuscript.


  1. Yang T, Kecman V, Cao L, Zhang C, Huang JZ: Margin-based ensemble classifier for protein fold recognition.

    Expert Syst Appl 2011, 38:12348-12355. Publisher Full Text OpenURL

  2. Dong Q, Zhou S, Guan G: A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation.

    Bioinformatics 2009, 25(20):2655-2662. PubMed Abstract | Publisher Full Text OpenURL

  3. Klein P: Prediction of protein structural class by discriminant analysis.

    Biochim Biopjys Acta 1986, 874:205-215. Publisher Full Text OpenURL

  4. Chinnasamy A, Sung WK, Mittal A: Protein structure and fold prediction using tree-augmented naive Bayesian classifier.

    J Bioinform Comput Biol 2005, 3(4):803-819. PubMed Abstract | Publisher Full Text OpenURL

  5. Wang ZZ, Yuan Z: How good is prediction of protein-structural class by the component-coupled method?

    Proteins 2000, 38:165-175. PubMed Abstract | Publisher Full Text OpenURL

  6. Shen HB, Chou KC: Ensemble classier for protein fold pattern recognition.

    Bioinformatics 2006, 22:1717-1722. PubMed Abstract | Publisher Full Text OpenURL

  7. Ding YS, Zhang TL: Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier.

    Patt Recog Letters 2008, 29:1887-1892. Publisher Full Text OpenURL

  8. Bouchaffra D, Tan J: Protein fold recognition using a structural Hidden Markov Model.

    Proceedings of the 18th International Conference on Pattern Recognition 2006, 3:186-189. OpenURL

  9. Deschavanne P, Tuffery P: Enhanced protein fold recognition using a structural alphabet.

    Proteins: Structure, Function, and Bioinformatics 2009, 76:129-137. Publisher Full Text OpenURL

  10. Chen K, Zhang X, Yang MQ, Yang JY: Ensemble of probabilistic neural networks for protein fold recognition.

    Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering (BIBE) 2007, I:66-70. OpenURL

  11. Ying Y, Huang K, Campbell C: Enhanced protein fold recognition through a novel data integration approach.

    BMC Bioinforma 2009, 10(1):267. BioMed Central Full Text OpenURL

  12. Dehzangi A, Amnuaisuk SP, Ng KH, Mohandesi E: Protein fold prediction problem using ensemble of classifiers.

    Proceedings of the 16th International Conference on Neural Information Processing 2009, Part II:503-511. OpenURL

  13. Dehzangi A, Amnuaisuk SP, Dehzangi O: Enhancing protein fold prediction accuracy by using ensemble of different classifiers.

    Aust J Intell Inf Process Syst 2010, 26(4):32-40. OpenURL

  14. Dehzangi A, Karamizadeh S: Solving protein fold prediction problem using fusion of heterogeneous classifiers.

    INF, Int Interdiscip J 2011, 14(11):3611-3622. OpenURL

  15. Dubchak I, Muchnik I, Kim SK: Protein folding class predictor for SCOP: approach based on global descriptors. In Proceedings, 5th International Conference on Intelligent Systems for Molecular Biology. Kalkidiki, Greece; 1997:104-107. OpenURL

  16. Taguchi Y-h, Gromiha MM: Application of amino acid occurrence for discriminating different folding types of globular proteins.

    BMC Bioinforma 2007, 8:404. BioMed Central Full Text OpenURL

  17. Ghanty P, Pal NR: Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers.

    IEEE Trans On Nano Bioscience 2009, 8:100-110. OpenURL

  18. Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition.

    Proteins 2001, 43:246-255.

    erratum: 2001, vol. 44, 60

    PubMed Abstract | Publisher Full Text OpenURL

  19. Sharma A, Lyons J, Dehzangi A, Paliwal KK: A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition.

    J Theor Biol 2013, 320(7):41-46. PubMed Abstract | Publisher Full Text OpenURL

  20. Kurgan LA, Cios KJ, Chen K: SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.

    BMC Bioinforma 2008, 9:226. BioMed Central Full Text OpenURL

  21. Liu T, Geng X, Zheng X, Li R, Wang J: Accurate Prediction of Protein Structural Class Using Auto Covariance Transformation of PSI-BLAST Profiles.

    Amino Acids 2012, 42:2243-2249. PubMed Abstract | Publisher Full Text OpenURL

  22. Dehzangi A, Amnuaisuk SP: Fold prediction problem: the application of new physical and physicochemical-based features.

    Protein Pept Lett 2011, 18:174-185. PubMed Abstract | Publisher Full Text OpenURL

  23. Krishnaraj Y, Reddy CK: Boosting methods for protein fold recognition: an empirical comparison.

    IEEE Int Conf Bioinfor Biomed 2008, 393-396. OpenURL

  24. Valavanis IK, Spyrou GM, Nikita KS: A comparative study of multi-classification methods for protein fold recognition.

    Int J Comput Intell Bioinform Syst Biol 2010, 1(3):332-346. OpenURL

  25. Ding C, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks.

    Bioinformatics 2001, 17(4):349-358. PubMed Abstract | Publisher Full Text OpenURL

  26. Kecman V, Yang T: Protein fold recognition with adaptive local hyper plane Algorithm. In Computational Intelligence in Bioinformatics and Computational Biology, CIBCB '09. IEEE Symposium. Nashville, TN, USA; 2009:75-78. OpenURL

  27. Kavousi K, Moshiri B, Sadeghi M, Araabi BN, Moosavi-Movahedi AA: A protein fold classier formed by fusing different modes of pseudo amino acid composition via PSSM.

    Comput Biol Chem 2011, 35(1):1-9. PubMed Abstract | Publisher Full Text OpenURL

  28. Chmielnicki W, Stapor K: A hybrid discriminative-generative approach to protein fold recognition.

    Neurocomputing 2012, 75:194-198. Publisher Full Text OpenURL

  29. Zhang H, Zhang T, Gao J, Ruan J, Shen S, Kurgan LA: Determination of protein folding kinetic types using sequence and predicted secondary structure and solvent accessibility.

    Amino Acids 2010, 1-13. OpenURL

  30. Najmanovich R, Kuttner J, Sobolev V, Edelman M: Side-chain flexibility in proteins upon ligand binding.

    Proteins: Structure, Function, and Bioinformatics 2000, 39(3):261-268. Publisher Full Text OpenURL

  31. Huang JT, Tian J: Amino acid sequence predicts folding rate for middle-size two-state proteins.

    Proteins: Structure, Function, and Bioinformatics 2006, 63(3):551-554. Publisher Full Text OpenURL

  32. Zhang TL, Ding YS, Chou KC: Prediction protein structural classes with pseudo amino acid composition: approximate entropy and hydrophobicity pattern.

    J Theor Biol 2008, 250:186-193. PubMed Abstract | Publisher Full Text OpenURL

  33. Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to algorithms. USA: MIT Press; 1990. OpenURL

  34. Sharma A, Imoto S, Miyano S: A top-r feature selection algorithm for microarray gene expression data.

    IEEE/ACM Trans Comput Biol Bioinform 2012, 9(3):754-764. PubMed Abstract | Publisher Full Text OpenURL

  35. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

    Nucleic Acids Res 2001, 29:2994-3005. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  36. Argos P, Rao JKM, Hargrave PA: Structural prediction of membrane-bound proteins.

    Eur J Biochem 1982, 128:565-575. PubMed Abstract | Publisher Full Text OpenURL

  37. Zimmerman JM, Eliezer N, Simha R: The characterization of amino acid sequences in proteins by statistical methods.

    J Theor Biol 1968, 21:170-201. PubMed Abstract | Publisher Full Text OpenURL

  38. Charton M, Charton BI: The structural dependence of amino acid hydrophobicity parameters.

    J Theor Biol 1982, 99:629-644. PubMed Abstract | Publisher Full Text OpenURL

  39. Burgess AW, Ponnuswamy PK, Scheraga HA: Analysis of conformations of amino acid residues and prediction of backbone topography in proteins.

    Isr J Chem 1974, 12:239-286. OpenURL

  40. Fauchere JL, Charton M, Kier LB, Verloop A, Pliska V: Amino acid side chain parameters for correlation studies in biology and pharmacology.

    Int J Peptide Protein Res 1988, 32:269-278. OpenURL

  41. Bundi A, Wuthrich K: 1H-nmr parameters of the common amino acid residues measured in aqueous of the linear tetrapeptides H-Gly-Gly-X-L-Ala-OH.

    Biopolymers 1979, 18:285-297. Publisher Full Text OpenURL

  42. Charton M, Charton BI: The dependence of the Chou-Fasman parameters on amino acid side chain structure.

    J Theor Biol 1983, 111:447-450. OpenURL

  43. Khanarian G, Moore WJ: The Kerr effect of amino acids in water.

    Aust J Chem 1980, 33:1727-1741. Publisher Full Text OpenURL

  44. Cid H, Bunster M, Canales M, Gazitua F: Hydrophobicity and structural classes in proteins.

    Protein Eng 1992, 5:373-375. PubMed Abstract | Publisher Full Text OpenURL

  45. Chou PY, Fasman GD: Prediction of the secondary structure of proteins from their amino acid sequence.

    Adv Enzymol 1978, 47:45-148. PubMed Abstract OpenURL

  46. Levitt M: Conformational preferences of amino acids in globular proteins.

    Biochemistry 1978, 17:4277-4285. PubMed Abstract | Publisher Full Text OpenURL

  47. Dawson DM: The Biochemical Genetics of Man. Edited by Brock DJH, Mayo O. Academic Press; 1972.

  48. Dayhoff MO, Hunt LT, Hurst-Calderone S: Composition of proteins.

    Atlas of Protein Sequence and Structure 1978, 5(3):363-375. OpenURL

  49. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins.

    Atlas of Protein Sequence and Structure 1978, 5(3):345-352. OpenURL

  50. Eisenberg D, McLachlan AD: Solvation energy in protein folding and binding.

    Nature 1986, 319:199-203. PubMed Abstract | Publisher Full Text OpenURL

  51. Fasman GD (Ed): Handbook of Biochemistry: Section A In Proteins. 3rd edition. CRC Press; 1976. OpenURL

  52. Geisow MJ, Roberts RDB: Amino acid preferences for secondary structure vary with protein class.

    Int J Biol Macromol 1980, 2:387-389. Publisher Full Text OpenURL

  53. Grantham R: Amino acid difference formula to help explain protein evolution.

    Science 1974, 185:862-864. PubMed Abstract | Publisher Full Text OpenURL

  54. Guy HR: Amino acid side-chain partition energies and distribution of residues in soluble proteins.

    Biophys J 1985, 47:61-70. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  55. Hutchens JO: Heat capacities, absolute entropies, and entropies of formation of amino acids and related compounds. In Handbook of Biochemistry. 2nd edition. Edited by Sober HA. Cleveland, Ohio: Chemical Rubber Co; 1970. OpenURL

  56. Janin J, Wodak S, Levitt M, Maigret B: Conformation of amino acid side-chains in proteins.

    J Mol Biol 1978, 125:357-386. PubMed Abstract | Publisher Full Text OpenURL

  57. Sharma A, Paliwal KK: Rotational linear discriminant analysis technique for dimensionality reduction.

    IEEE Trans Knowl Data Eng 2008, 20(10):1336-1347. OpenURL

  58. Sharma A, Paliwal KK: A gradient linear discriminant analysis for small sample sized problem.

    Neural Processing Letters 2008, 27(1):17-24. Publisher Full Text OpenURL

  59. Sharma A, Paliwal KK: Cancer classification by gradient LDA technique using microarray gene expression data.

    Data Knowl Eng 2008, 66(2):338-347. Publisher Full Text OpenURL

  60. Witten IH, Frank E: Data mining: practical machine learning tools with java implementations. San Francisco, CA: Morgan Kaufmann; 2000. webcite


  61. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008.

    Nucleic Acids Res 2008, 36:D202-D205. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  62. Li ZC, Zhou XB, Lin YR, Zou XY: Prediction of protein structure class by coupling improved genetic algorithm and support vector machine.

    Amino Acids 2008, 35:581-590. PubMed Abstract | Publisher Full Text OpenURL

  63. Liu L, Hu X: Based on improved parameters predicting protein fold.

    Sixth Int Conf Nat Comput (ICNC 2010) 2010, 6:3291-3295. OpenURL

  64. Kurgan L, Chen K: Prediction of protein structural class for the twilight zone sequences.

    Biochem Biophys Res Commun 2007, 357:453-460. PubMed Abstract | Publisher Full Text OpenURL

  65. Gromiha M: A statistical model for predicting protein folding rates from amino acid sequence with structural class information.

    J Chem Inf Model 2005, 45:494-501. PubMed Abstract | Publisher Full Text OpenURL