Skip to main content
  • Methodology article
  • Open access
  • Published:

Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences

Abstract

Background

Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences.

Results

The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes.

Conclusions

The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.

Background

Protein function, regulation, and interactions can be learned from their structure [1, 2], which motivates development of novel methods for the prediction of the protein structure. These predictions concern various levels and aspects of the protein structure including the tertiary structure [3, 4], solvent accessibility, depth, flexibility and packing of residues [5], and secondary structure [6]. In contrast to the tertiary structure that describes position of each of the protein's atoms, the secondary structure simplifies the protein structure to a set of spatially local folding patterns that include α-helices, β-strands and coils. The spatial distribution of these local patterns determines the overall, three-dimensional shape of proteins in which individual secondary structures interact with each other creating more complex structures such as parallel or antiparallel β-sheets, β-barrels, and others. In spite that final product is complex, protein structures can be categorized into a few structural classes depending on the amount, types and spatial distribution of the secondary structures found in their fold.

Knowledge of the structural class is shown to stimulate the development of methods for identification of other structural and functional characteristics of proteins [7]. Examples include prediction of protein unfolding rates [8], characterization and prediction of folding rates [9–11], quantification of the relation between chain lengths and folding rates of two-state proteins [12], prediction of DNA-binding sites [13], discrimination of outer membrane proteins [14], fold prediction [15], secondary structure and secondary structure content prediction [16, 17], reduction of the conformation search space [18] and implementation of a heuristic approach to find tertiary structure [19], to name just a few. At the same time, the structural classes are known for a relatively small number of proteins. The most recent release 1.75 of SCOP database [20, 21] includes 110,800 protein domains with the annotated classes, while release 36 of the NCBI's RefSeq database [22] includes 8,181,910 non-redundant protein sequences. The main reason for this wide gap is unavailability of protein structure, which is used to assign the structural class, for the significant majority of the known protein sequences. To this end, an accurate and automated method for classification of sequences into the corresponding structural classes would provide assistance when the structural class in unknown for a given chain.

Template-based modeling, which is successfully used to predict the tertiary structure, is based on an assumption that similar sequences (usually defined as sequences with similarity of above 30%) share similar structures [23–25]. Prediction methods that rely on the sequence alignment [26, 27] usually perform relatively poorly when sequences with high identity are not available. More specifically, over 95% of protein chains characterized by low, 20-25%, pairwise identity, which is referred to as the twilight-zone similarity, have different structures [28], which substantially reduces accuracy of the corresponding predictions. We observe that about 40% of sequences for which the tertiary structure was deposited to Protein Data Bank (PDB) [29] in 2005 share twilight-zone pairwise similarity with any sequence deposited in the PDB before 2005 [30], which motivates development of the prediction methods for these challenging chains. Further motivation comes from the fact that finding similar folding patterns among the proteins characterized by low sequence identity is beneficial for the reconstruction of the tertiary structure [31, 32]. Researchers have observed that pairs of sequences with low identity may share similar folding patterns or overall structure [33–35] and they can be used to predict tertiary structure [3, 36, 37]. The accurate alignment of the distant homologues (proteins with similar structures and sequences that share low identity) is still a challenging problem in spite of many years of research in this area [36, 37]. We note that structurally similar proteins that share low sequence identity can be found based on coarse grained classifications such as the structural classes that are addressed in this work. We believe that the proposed method could find applications in the detection of remote homologues.

Protein structural class

Two databases which classify protein structures include SCOP (S tructural C lassification o f P roteins) [20, 21] and CATH (C lass, A rchitecture, T opology and H omologous superfamily) [38, 39]. The former database relies on a manual process to classify the structures while the latter applies a combination of automated and manual procedures. The first level of the classification hierarchy in both databases is the structural class. The SCOP distinguishes seven classes where the four major classes, which cover almost 90% of all SCOP entries, are all-α, all-β, α+β and α/β. The two former classes include structures dominated by α-helices and β-strands, respectively. The two latter classes correspond to structures that include both helices and strands where in the case of the α+β class these secondary structures are segregated, whereas for α/β class the structures are interspersed. The three remaining classes include multi-domain proteins, membrane and cell surface proteins and peptides, and small proteins. The multi-domain proteins consist of several domains where each domain may belong to a different class while the small proteins have short sequences and their secondary structures do not fit the definition of the other classes. We note that in spite of the fact that membrane proteins are relatively common their coverage in SCOP database is relatively low as it is difficult to obtain their structure [40]. The SCOP also includes four supplementary categories, i.e., coiled coil, designed, and low resolution, proteins; and peptides, but they have limited practical implications. Figure 1 shows representative structures for the seven classes in the SCOP database.

Figure 1
figure 1

Cartoon structures of proteins that cover the seven structural classes defined in the SCOP database. Panel a shows structure of protein with PDB identifier 1mty, b for 1a8d, c for 2f62, d for 2bf5, e for 1vqq, f for 1u7g, and g for 4hir. Helices are shown in light grey, coils in dark gray, and strands in black.

CATH database defines only four classes that include mainly α, mainly β, mixed α-β, and proteins with few secondary structures. In this work we address the SCOP based classification as it further subdivides the mixed proteins, defines several important additional classes such as membrane and multi domain proteins, and since most of the existing structural class prediction methods are also based on this definition of the structural classes. Moreover, the structural classes defined in CATCH are relatively easy to predict based on the secondary structure content of a protein, which in turn could be predicted using existing content prediction methods [41, 42]. This is in contrast to the classification in the SCOP database where more complex information, such as relative amount and spatial position of the secondary structures, is used to assign classes [43].

Related work

The manual assignment of structural classes performed in SCOP is based on spatial arrangement of secondary structure segments which is inspected using the tertiary structure. We aim at building an automated method which makes the class predictions based solely on the protein sequence. Prediction is typically performed in two steps: 1) the variable-length sequences are converted into a fixed-length feature vectors; 2) the feature vectors are inputted into a classification algorithm to generate the class prediction.

Due to a relatively large existing body of research in this area the following review concentrates on recent methods. The reader is referred to a review by Chou [7] that provides further details on older methods and that motivates the development of the structural class prediction methods.

Majority of the developed methods use relatively simple features such as composition vector, pseudo amino acid (AA) composition [44], composition of short polypeptides, sequence itself and other features obtained from AA sequence [45–71]. Several recent methods use more advanced feature vectors [30, 72–78] which are based on the AA sequence and/or PSSM profile computed using PSI Blast [26]. A recently explored alternative is to construct features based on the predicted secondary structure. This approach was used in SCPred algorithm [79], which up to date provides favorable prediction quality on datasets characterized by the twilight-zone similarity.

A wide range of classification algorithms was used to perform the predictions. They include component coupling [70], neural network [80], Bayesian classifier [81], logistic regression [30, 58, 72, 73], decision tree [46, 54], covariant or linear discriminant algorithm [57, 64, 65, 77, 78], principal component analysis [55], nearest neighbor [52, 67–69], rough sets [49] and support vector machine (SVM) [45, 47, 48, 50, 53, 54, 61, 62, 66, 72, 74, 75, 79, 82]. Recent works also explored more complex classification models such as ensembles [72], bagging [54, 63], and boosting [56, 59, 71]. Overall, we observe that SVM is the most popular and the best-performing classifier for this task [79].

The prediction quality of these methods varies widely depending on the datasets [73]. The methods which were tested on datasets with relatively high sequence identity report accuracies of close to or over 90% [45, 47–50, 53–55, 57, 58, 61–66, 68–72, 74–77]. The tests on the dataset characterized by the low, twilight-zone identity show accuracies between 50 and 70% [30, 45, 52, 56, 59–61, 67, 72, 73, 75, 78, 79, 81] with only one approach, namely SCPred, that obtains accuracies 80% [79]. We concentrate on the latter problems as they are more challenging and have implications in the context of the remote homology detection.

The above methods considered only the four major classes from the SCOP database, which was motivated by a relatively small number of proteins in the remaining classes. At the same time, recent years observed a substantial increase in the size of the SCOP database which doubled in size between 2003 and 2007, and which currently includes over 100,000 protein domains. Even when considering a small subset of the protein domains in SCOP which is characterized by the twilight-zone similarity, we note that the current SCOP includes sufficient number of proteins for the smaller three classes to allow for the development of a prediction system.

There are only two methods that addressed prediction of the seven classes [83, 84]. The first method predicts the four main classes and multiple domain, small protein, and peptide classes [84]. This differs from prediction targets of MODAS which additionally considers membrane and cell surface proteins as a part of the peptide class. This method is shown to achieve accuracy of over 90% for a low-identity dataset by using a large library of reference functional sequence motifs from the InterPro database [85]. This resulted in the feature vector with 7,785 features where each feature denotes occurrence of a given motif in the input sequence. Although this method is characterized by good prediction quality, we note that it does not provide a web server, is difficult to implement due to the excessive number of used features, and was not redesigned in spite of the updates in the InterPro database (the current release 19 of InterPro includes 17,412 motifs while the authors used version 6.2 from April 2003). We also note that the usage of such a large number of features results in an ill-defined problem in which the number of classification instances (protein chains) is smaller than the number of features. The second, more recent method [83] uses a complex representation of the protein sequence that includes pseudo AA composition, evolutionary conservation information, and physicochemical properties of AAs, and the SVM classifier to perform predictions. It achieves accuracy of 57.4% for a dataset with the twilight-zone identity. We perform an empirical side-by-side comparison with this method.

Although structural class predictors usually do not consider membrane and multi domain classes, such predictions could be addressed using methods designed specifically for these classes. We refer the reader to recent review articles concerning methods that are available for the prediction of membrane proteins [86–88] and for the domain prediction [89, 90]. These developments are motivated by the availability of specialized databases for the membrane [91] and multi-domain proteins [92]. The abovementioned methods could discriminate chains in the corresponding class from all other chains, and they could be used to either pre-filter the chains or post-process results of the proposed MODAS method. More specifically, once a given chain is known to be a membrane protein, specialized predictors could be used to further categorize its membrane proteins type [14, 24]. Similarly, the predicted multi-domain proteins could be processed by the available methods to predict the domain boundaries [90].

Motivation and goals

All but two existing structural class predictors consider only the four major classes, while the remaining three classes are also important and their prediction should be addressed. For instance, while approximately 20 to 35% of the proteins encoded by an organism's genome are membrane proteins [93], they are not covered in the four major classes. As mentioned above, the main reason for their under-representation in the SCOP database is that they are difficult to crystallize and as a result only a small number of membrane protein structures are known [40]. We also note that the current methods are relatively weak in the context of the sequence representation. Most of the methods compute the representation directly from the sequence, only a handful of them use sequence-derived information such as multiple alignment [75, 79, 83] and predicted secondary structure [79], and there were no attempts to combine residue conservation computed from the alignment and the secondary structure. At the same time, the usage of the predicted secondary structure results in improved prediction quality for the low identity datasets [79], and numerous prior studies have demonstrated that evolutionary information generated with PSI Blast [26] is more informative than the sequence itself [94–96]. Moreover, most of the existing predictors achieve good quality for datasets with high sequence similarity, while results on the datasets with the twilight-zone pairwise similarity are generally characterized by a relatively low, <70%, accuracy (with the exception of one method that obtains close to 80% accuracy). At the same time, a solution that accommodates for the low sequence identity could have important applications for the tertiary structure prediction [3, 35, 36]. Finally, the existing methods are fixed to a given set of classes, while a modular design would allow the user to choose how many and which classes should be considered for the prediction. The latter is a particularly attractive feature for a method that would address all 7 classes, i.e., the user could choose which subset of classes, including the four major classes, should be considered for a given prediction. We also observe that current methods use the same feature-based sequence representation for prediction of all classes. In the modular design a separate predictor is created for each class and the results of these predictors are combined together. This allows for the design of a specialized sequence representation for each class.

Our goal is to develop a novel, modular method that predicts the seven structural classes from the protein sequences. The proposed modular approach to structural class prediction (MODAS) exploits sequence and sequence-derived information to generate input for the classifier. More specifically, MODAS is the first to combine both the multiple sequence alignment profiles and the predicted secondary structure to generate features that are fed into a set of seven SVM classifiers. Our design concentrates on datasets that include sequences characterized by low, twilight-zone similarity and we aim at providing prediction quality that is competitive or better than the quality offered by the existing methods.

Methods

Datasets

We use total of 7 datasets to design and test the proposed method. We utilized version 1.73 of the ASTRAL database [97], which is a subset of the sequences from the SCOP database characterized by a certain similarity threshold, to derive two datasets. We selected the ASTRAL database with < 20% sequence similarity that includes 6264 sequences where 1280 of them belong to the all-α class, 1324 to all-β, 1495 to α+β 1527 to α/β, 106 to multi-domain proteins, 138 to membrane and cell surface proteins and peptides, and 394 to small proteins class. We randomly divided this set into two equal size subsets, one that was used as the training set (ASTRALtraining) and the second that was used as the test set (ASTRALtest). The ASTRALtraining set was used to design the proposed method, which includes features and classifier selection and parameterization of the classifiers. The ASTRALtest set was used to perform an independent (from the training set) validation of the proposed method. Both of these datasets are available at http://biomine.ece.ualberta.ca/MODAS/.

We also selected 4 widely used low sequence identity benchmark datasets to provide a comprehensive and unbiased comparison with the existing prediction methods. The D2230 dataset includes 2230 sequences extracted using ASTRAL version 1.63 using 20% identity threshold which was used to test the most recent method for prediction of the 7 classes [83]. We use this dataset to perform a side-by-side comparison with the method by Chen and colleagues [83]. The remaining 3 datasets are used to compare against methods that address prediction of the four major structural classes. The 25PDB dataset, which includes 1673 sequences which share twilight-zone pairwise similarity, was taken from [73] and two datasets D1189 and D675 were taken from [81] and [75] and include 1189 sequences with up to 40% pairwise identity and 675 sequences with up to 30% pairwise identity, respectively. The latter three datasets are the most commonly used benchmark sets that include low identity sequences and they allow for a side-by-side comparison with a wide selection of recent methods for the prediction of the four major structural classes.

Finally, we include one larger benchmark dataset, namely D498, which have been proposed in [70] and which includes a set of sequences that were not filtered with respect to their similarity. We include this dataset to demonstrate the quality of the proposed method when compared with a wider range of predictors which were tested on datasets with unspecified sequence identity. We explore the distribution of the sequence identity in this dataset to compare it with the other 6 datasets. For each chain we compute maximal sequence identity with all remaining sequences in the dataset. We chose the maximal values since the empirical tests are based on the jackknife strategy in which all but one sequence are used to predict the class for the remaining chain. We generate pairwise sequence alignments using Smith-Waterman algorithm [98] with Gotoh's improvement [99] and for each sequence we report the highest obtained score. The number of matching residues in the alignment is divided by the length of the query sequence including the gaps/insertions; a result of 100% sequence identity means that there were no gaps/insertions and that a query sequence was a substring of one of the sequences in the dataset. Figure 2 shows the distribution of sequences in the D498 dataset based on the sequence identity. Almost 70% of sequences from this dataset have 100% sequence identity and around 89% have identity of above 90%. This means that using the jackknife test, 89% of the tested sequences have are at least one very similar sequence in the training part of the dataset. This explains higher predictive performance on this dataset when compared with results on the remaining datasets with controlled, low sequence identity (see Results and Discussion section).

Figure 2
figure 2

Distribution of sequences with respect to their maximal pairwise sequence identity in the D498 dataset.

Overall design

The input protein sequence is first processed by PSI Blast to obtain the position specific scoring matrix (PSSM) and by PSI Pred [100] to predict secondary structure. We selected PSI Pred due to its successful application in the SCPred method [79] and since this predictor enjoys a widespread use in prediction of a variety of related structural properties of proteins including template-based tertiary structure prediction [37], and prediction of beta-turns [101], residue depth [102], protein fold [31], and contact orders [103], to name just a few. Next, the sequence, the PSSM and the predicted secondary structure are converted into a set of features that are fed into seven classifiers (user can opt to use a subset of the classifiers), where each classifier corresponds to one of the seven SCOP classes. We performed feature selection to find a suitable set of features for each structural class. We also considered several different classifier types and selected the one that provides the best prediction quality for a given class. The seven classifiers generate a probability of classification into the corresponding class and these probabilities are aggregated to predict the final outcome. The aggregation is based on a simple max operator, i.e., we predict the class that corresponds to the highest probability. Although more complex aggregations could be conceived, this approach is motivated by the necessity to assure modularity of the predictor, i.e., the aggregation should work for every subset of the considered seven classes. The overall design of the proposed MODAS method is shown in Figure 3.

Figure 3
figure 3

Diagram of the proposed MODAS method.

The design of the proposed method concerns development and selection of the features which best describe each of the classes and a classifier which provides the best predictive performance. The feature and classifier selection is based on 10-fold cross validation on the ASTRALtraining dataset to assure that the design is independent of the other datasets and, at the same time, that it generalizes into the other datasets. The methods were written in JAVA language and we utilized the WEKA workbench [104] in this research.

Feature vector

The three sources of data used to generate the features include the protein sequence, the PSSM matrix generated with PSI Blast and the secondary structure predicted from the sequence using PSI Pred. The PSSM matrices were built using the nr (non-redundant) dataset [105], as of October 2008. The quality of the matrix, and consequently the quality of the proposed method, depends on the size of the dataset used. Prior results demonstrate that larger number of diverse sequences in the database leads to more accurate evolutionary information, which in turn was shown to improve secondary structure predictions [106]. This suggests that subsequent retraining of the MODAS method at a later time using updated, larger nr database may potentially lead to better predictive performance. Besides features that were based on counting individual AAs, the AAs were grouped according to their physicochemical properties including polarity (R group), hydrophobicity, structure-preserving mutations (exchange groups), and their ability to be electron donors or acceptors, see Table 1. We also used these groupings in connection with the predicted secondary structure, i.e., amino acids were grouped based on their secondary structure and a given property. Finally, we considered combining information coming from the predicted secondary structure with the multiple alignments.

Table 1 The property groups used to aggregate similar amino acids.

The features are divided into five sets: 1) features generated directly from the sequence; 2) features computed from the PSSM matrix; 3) features generated by combining information from PSSM and the predicted secondary structure; 4) features obtained from the predicted secondary structure, which are based on the features utilized in the SCPred method [79]; and 5) novel features based on the predicted secondary structure which describe collocation of helical and strand segments.

Features based on the AA sequence (39 features)

These features describe basic characteristics of the input sequence, such as length, AAs composition and composition of property groups. They include:

  • SeqLen - the length of a sequence. (1 feature)

  • Comp_AA i = , the number of AA i in the sequence (also called composition of AA i ) normalized by the sequence length where i = 1, 2,..., 20 and AA i stands for ith AA type. (20 features)

  • Comp_GR_GR jk = , the number of AAs in the sequence belonging to GR jk where j ∈ {R group, Electronic group, Hydrophobicity group, Exchange group} and k is a particular subgroup (e.g., hydrophobic and hydrophilic), see Table 1, normalized by the sequence length. (18 features)

Features based on the PSSM matrix (196 features)

The PSI Blast provides two position specific scoring matrices; one contains conservation scores of a given AA at a given position in a sequence, denoted as PSSMcons lm , and the other provides probability of occurrence of a given AA at given position in the sequence, denoted as PSSMprob lm where l = 1, 2...SeqLen denotes the position in the sequence and m = 1, 2,..., 20 denotes one of the substitution positions that correspond to the twenty AAs (columns in the PSSM matrix). We normalized the conservation scores (PSSMcons lm values) using max-min normalization where min and max equal -8 and 13, respectively. The PSSMprob lm values are already normalized by the PSI Blast. The matrix values were aggregated either horizontally (along m) or vertically (along l) to obtain a fixed length feature vector. This feature set, which quantifies evolutionary information of individual AA types and grouping of AAs according to the property groups, includes the following features:

  • Ach_CS_{AA i } = , sum of all normalized PSSMcons lm values ("Ach_CS" stands for achieved conservation scores), where l includes only positions of AA i (along the sequence only the positions of AA i residues where considered) and m = AA i (column that corresponds to AA i ), divided by the sequence length. (20 features)

  • Max_CS_{AA i } = , sum of maximal, over m, PSSMcons lm values, where l includes only positions of AA i , divided by the sequence length. (20 features)

  • Max-Ach_CS_{AA i } = , sum of differences between maximal PSSMcons lm (over m values) and PSSMcons li values where l includes only positions of AA i , and i = AA i (the difference between the maximal and the achieved values), divided by the sequence length. (20 features)

  • Ach_Prob_{AA i } = , sum of all normalized PSSMprob lm values ("Ach_Prob" stands for achieved probability of occurrence), where l includes only positions of AA i and m = AA i , divided by the sequence length. (20 features)

  • Max_Prob_{AA i } = , sum of maximal, over m, PSSMprob lm values, where l includes only positions of AA i , divided by the sequence length. (20 features)

  • Max-Ach_Prob_{AA i } = , sum of differences between maximal PSSMprob lm (over m values) and PSSMprob li values where l includes only positions of AA i , and i = AA i (the difference between the maximal and the achieved values), divided by the sequence length. (20 features)

  • CSSeq_{AA i } = , sum of normalized PSSMcons lm values where l = 1, 2...SeqLen and m = AA i , divided by the sequence length (average conservation score of AA i , for the whole sequence). (20 features)

  • CSSeq_GR_{GR jk } = , sum of normalized PSSMcons lm values where l = 1, 2...SeqLen and m = GR jk (all AA types that belong to GR jk ) divided by the sequence length. (18 features)

  • Ent_{AA i } = , entropy of PSSMprob lm values, for l = 1, 2...SeqLen and m = AA i . (20 features)

  • Avg_Prob_GR_{ GR jk } = , average PSSMprob lm values where l = 1, 2...SeqLen and m = GR jk (all AA types that belong to GR jk ) divided by the sequence length. (18 features)

The Ach_CS_{AA i }, Max_CS_{AA i }, Max-Ach_CS_{AA i }, Ach_Prob_{AA i }, Max_Prob_{AA i }, and Max-Ach_Prob_{AA i } features aggregate information along the sequence by the AA type. The CSSeq_{AA i }, CSSeq_GR_{GR jk }, Ent_{AA i }, and Avg_Prob_GR_{GRi,j} aggregate the values along the columns of the PSSM.

Features based on the PSSM matrix in combination with the predicted secondary structure (486 features)

The third feature set is analogous to the features based on the PSSM matrix, but instead of aggregating the values by AA type, they are aggregated either by the type of the secondary structure predicted with PSI Pred or by the combination of the AA type and the predicted secondary structure. These features quantify conservation of predicted secondary structures, as well as the conservation for individual AA types and grouping of AAs according to the property groups that are in a given predicted secondary structure. This feature set consists of:

  • Ach_CS_{AA i }, Max_CS_{AA i }, Max-Ach_CS_{AA i }, Ach_Prob_{AA i }, Max_Prob_{AA i }, and Max-Ach_Prob_{AA i } are redefined as Ach_CS_{SS n }, Max_CS_{SS n }, Max-Ach_CS_{SS n }, Ach_Prob_{SS n }, Max_Prob_{SS n }, and Max-Ach_Prob_{SS n }, respectively, where instead of using 20 AA i we aggregate by the predicted three state secondary structure SS n = {H, E, C}. (6*3 = 18 features)

  • Ach_CS_{AA i }, Max_CS_{AA i }, Max-Ach_CS_{AA i }, Ach_Prob_{AA i }, Max_Prob_{AA i }, and Max-Ach_Prob_{AA i } are redefined as Ach_CS_{SS n }_{AA j }, Max_CS_{SS n }_{AA j }, Max-Ach_CS_{SS n }_{AA j }, Ach_Prob_{SS n }_{AA j }, Max_Prob_{SS n }_{AA j }, and Max-Ach_Prob_{SS n }_{AA j }, respectively, where we aggregate PSSMcons lm /PSSMprob lm values by l that corresponds to positions of AA i that are predicted as SS n . (6*3*20 = 360 features)

  • CSSeq_GR_{GR jk } and Avg_Prob_GR_{GR jk } are redefined as CSSeq_GR_{GR jk }_SS_{SS n } and Avg_Prob_GR_{GR jk }_SS_{SS n }, respectively, where we aggregate PSSMcons lm /PSSMprob lm values by l that corresponds to a given SS n . (2*3*18 = 108 features)

Features based on the predicted secondary structure (144 features)

The fourth feature set, which was computed based on the output of PSI Pred, describes the content of the predicted secondary structures and distribution of the predicted secondary structures segments aggregated based on segments length and by grouping of AAs according to the property groups. This set consists of:

  • Content_{SS n } = , the number of residues predicted as SS n where l = 1, 2...SeqLen, divided by the sequence length. (3 features)

  • Content_{SS n }_GR_{GR jk } = , the number of residues predicted as SS n and that belong to GR jk where l = 1, 2...SeqLen, divided by the sequence length. (3*18 = 54 features)

  • SegCount_{E,H}_L{L i } = , the number of helix or strand segments which contain at least L i = 2, 3, .. 20 AAs divided by the total number of helix and strand segments in the input protein chain. (2*19 = 38 features)

  • SegCount_C_L{L i } = , the number of coils which contain at least L i = 2, 3, .. 20 AAs divided by the number of all segments in a protein (i.e., the sum of all coil, helix and strand segments). (19 features)

  • SegCount_{E,H}_P{P i } = , the number of helix or strand segments which contain at least P i AAs where P i = 2,4,..,10% of the sequence length, divided by the total number of helix and strand segments in the input protein chain. (2*5 = 10 features)

  • SegCount_C_P{P i } = , the number of coil segments which contain at least P i AAs where P i = 2,4,..,10% of the sequence length, divided by the number of all segments. (5 features)

  • NormSegCount_{SS n } = , the total number of SS n segments divided by the total number of all secondary structure segments in the input protein chain. (3 features)

  • MaxSegLength_{SS n } = max Len(SEG: SEG(SS n )), the maximal SS n segment length. (3 features)

  • NormMaxSegLength_{SS n } = , the maximal SS n segment length divided by the sequence length. (3 features)

  • AvgSegLength_{SS n } = avgLen(SEG: SEG(SS n )), the average SS n segment length. (3 features)

  • NormAvgSegLength_{SS n } = , the average SS n segment length divided by the sequence length. (3 features)

Features based on the collocation of helix and strand segments in the predicted secondary structure (127 features)

The four main structural classes are based on the content and relative spatial position of the secondary structures. The preferred way to represent these collocations of the secondary structures would be to use 3D protein structure. However, since our input is only the sequence, we approximate this information using features that quantify collocation of helices (H) and strands (E) in the predicted secondary structure. We use the predicted secondary structure to annotate helix, coil and strand segments and to compute relative position of these segments in the sequence. The following features are computed:

  • HH = count(HH), the number of helix-coil-helix motifs (two helices separated by a coil) divided by the total number of the secondary structure segments in a protein. (1 feature)

  • EE = count(EE), the number of strand-coil-strand motifs (two strands separated by a coil) divided by the total number of the secondary structure segments in a protein. (1 feature)

  • HE = count(HE) + count(EH), the number of strand-coil-helix or helix-coil-strand motifs (strand and helix separated by a coil) divided by the total number of the secondary structure segments in a protein. (1 feature)

  • {HH,HE,EE}_L{L i } = , the number of helix-coil-helix, helix-coil-strand/strand-coil-helix, or strand-coil-strand motifs which include at least L i = 2, 3, .., 20 residues in the middle coil, divided by the total number of the secondary structure segments in a protein. (3*19 = 57 features)

  • {HH,HE,EE}_P{P i } = , the number of helix-coil-helix, helix-coil-strand/strand-coil-helix, or strand-coil-strand motifs which include at least P i = 2, 4, .., 10% of a sequence length residues in the middle coil, divided by the total number of the secondary structure segments in a protein. (3*5 = 15 features)

  • MaxHCH = max(HC..H: count(H)), the maximal number of helices among all helix-coil-helix-coil...coil-helix motifs, i.e., the maximal number of helix segments separated only by coils. (1 feature)

  • MaxECE = max(EC..E: count(E)), the maximal number of strands among all strand-coil-strand-coil...coil-strand motifs, i.e., the maximal number of strand segments separated only by coils. (1 feature)

  • AvgHCH = , the average number of helices in all helix-coil-helix-coil...coil-helix motifs, divided by the total number of the secondary structure segments in a protein. (1 feature)

  • AvgECE = , average number of strands in all strand-coil-strand-coil...coil-strand motifs, divided by the total number of the secondary structure segments in a protein. (1 feature)

  • HCH_L{L i } = , the number of helix-coil-helix-coil...coil-helix motifs with more than L i = 1, 2, .., 20 helices, divided by the total number of the secondary structure segments (19 features)

  • HCH_P{P i } = , the number of helix-coil-helix-coil...coil-helix motifs with more than P i = 2, 4, .., 10% of all helices in a protein, divided by the total number of the secondary structure segments (5 features)

  • ECE_L{L i } = , the number of strand-coil-strand-coil...coil-strand motifs with more than L i = 1, 2, .., 20 strands, divided by the total number of the secondary structure segments (19 features)

  • ECE_P{P i } = , the number of strand-coil-strand-coil...coil-strand motifs which more than P i = 2, 4, .., 10% of all strands in a protein, divided by the total number of the secondary structure segments (5 features)

Feature and classifiers selection

Feature selection was performed to select the best subset of the considered features for each structural class. This is motivated by the fact that while the considered features are generic, the individual structural classes are likely characterized by a small and specific set of descriptors. In other words, while the features describe the sequence, conservation of residues and predicted secondary structure for every protein in the same way, the structural classes can be described by a subset of these features, i.e., for a specific class some features could be irrelevant and should be discarded to improve the efficiency of the prediction model. We considered a comprehensive set of eight feature selection methods which include four methods that select feature sets and four methods that perform feature ranking. The first group includes consistency subset selection [107], wrapper-based feature selection with Naïve Bayes and SVM classifiers [108], and Correlation-based Feature Subset selection [109] (CFS) methods. The latter group includes a filter-based ReliefF algorithm [110], and three methods that perform ranking based on Symmetrical Uncertainty [111], Chi-Squared (the chi-squared statistic with respect to the class) and Gain Ratio (measure based on entropy with respect to the class) criterions. The feature selection was performed based on tenfold cross validation on the ASTRALtraining dataset. In the case of the methods that select feature sets, individual features were ranked based on the number of folds in which they were selected. For the ranking methods the feature were ranked based on as the average rank over the ten folds.

We considered four classifiers which are based on complementary model types: nonlinear kernel-based SVM [112], probabilistic Naïve Bayes [113], linear Logistic regression [114], and instance-based k-Nearest Neighbor [115] (k-NN) with k = 3. The selection was also motivated by their prior successful applications in the context of the structural class predictions, i.e., Naïve Bayes based classifier was used in [81], logistic regression in [30, 58, 72, 73], nearest neighbor in [52, 67–69], and SVM in [45, 47, 48, 50, 53, 54, 61, 62, 66, 72, 74, 75, 79, 82].

The quality of the prediction was reported using several measures including overall accuracy (the number of correct predictions divided by the total number of test sequences), accuracy for each structural class (number of correct predictions for a given class divided by the number of sequences from this class), Matthews's correlation coefficient (MCC) for each structural class, and generalized squared correlation (GC2). The MCC values range between -1 and 1, where 0 represents random correlation, and bigger positive (negative) values indicate better (lower) prediction quality for a given class. Since MCC works only for binary classification, we also reported GC2, which is based on χ2 statistics. The GC2 values range between 0 and 1, where 0 corresponds to the worst classification (all predictions are incorrect) and 1 corresponds to the perfect classification. MCC and GC2 are described in detail in [116]. During the design we selected a classifier/feature subset combination that provides the best MCC value for a given class. We used MCC since this measure, in contrast to accuracy, takes into account the unbalanced nature of the datasets, i.e., while high accuracy could be obtained for a default classification in which small class is ignored (only large class is predicted), positive MCC values assure that both small and large classes are correctly predicted.

For each structural class and each of the four considered classifiers we used the output of each of the eight feature selection methods to find the best subset of features, i.e., subset of features that provides the highest MCC value for a given classifier. For the four selection methods that generate subsets of features, we considered different subsets based on the number of folds in which a given feature was selected. In other words, for each of the four methods we generated subsets of features that were included in at least 1 cross validation fold, at least 2 folds, ...., and at least 10 folds (total of 4 × 10 = 40 feature sets). In the case of the four feature ranking methods, we started with the highest ranked features and kept adding subsequent features until the MCC values for a given classifier was increasing (total of 4 feature sets). Finally, for each of the 44 feature sets we compared results of the tenfold cross validation test on the ASTRALtraining dataset using each of the classifiers to select the setup with the highest MCC for a given structural class.

We note that although Naïve Bayes, logistic regression and k-NN do not require parameterization, SVM is sensitive to parameterization. We used SVM with linear kernel and cost parameter C set to 1 to find the best feature set for each structural class (this default setup allows for fast generation of the model), and later we used two different kernels, polynomial and RBF, and different values of C to parameterize the SVM for the selected feature sets. We performed a grid search (considering values of C and γ for the RBF kernel, and values of C and exponent for the polynomial kernel) and selected the configurations that provide the highest MCC values for the tenfold cross validation on the ASTRALtraining dataset.

Our resulting design shows that the best results for all seven classes were obtained with the SVM classifier. This is consistent with the successful prior use of this classifier for the prediction of the four major structural classes [45, 47, 48, 50, 53, 54, 61, 62, 66, 72, 74, 75, 79, 82]. Table 2 summarizes the selected classifiers, i.e., it lists the results of the parameterization of the SVM classifier, and the feature selection methods together with the number of the selected features for each of the seven considered structural classes. We observe that usage of a variety of feature selection methods was proven beneficial since five out of eight of them were used to derive the final feature sets.

Table 2 Results of the feature and classifier selection for the considered seven structural classes.

Classification

Once the user selects the classes that (s)he would like to consider, the input sequence is converted into the feature space and the corresponding feature sets are passed to the classifiers for each of the selected classes. Each of the classifiers returns a probability that the input sequence belongs to a given class. The prediction corresponds to the class that is associated with the highest probability. This type of aggregation allows the user to select any combination of the classes.

Results and Discussion

This section includes discussion of the selected feature sets, reports results of the proposed MODAS method on the independent test set ASTRALtest and compares them with results provided by several competing solutions, and compares the results of the proposed and over two dozens of existing methods for the prediction of the structural classes on five benchmarking datasets including D2230, 25PDB, D1189, D675, and D498. We emphasize that all considered datasets, except D498, are characterized the twilight zone pairwise sequence similarity (which is also true for the pair of the ASTRALtest and ASTRALtraining datasets). We report the overall accuracy, accuracies and MCC values for each structural class, and the GC2 values.

Discussion of the selected features

The selected features are summarized using Tables 3 and 4. The former table shows the number of selected features for each of the five feature set and for each structural class. The latter table presents details related to features computed from the predicted secondary structure focusing on different types of the secondary structures.

Table 3 Number of features selected for each structural class for different categories of features.
Table 4 Number of the selected features for the features computed from the predicted secondary structure.

We observe that only a few sequence based features are used by the proposed MODAS method. More specifically, although the total number of features in this set includes 39 only between 0 and 8 of them are used by the seven classifiers. The most frequently used source of information is the PSSM in combination with the predicted secondary structure. For almost all classes, including all-α, all-β, α+β, multi-domain, membrane and small proteins, over half of the features are computed using PSSM. This confirms that the conservation of the residues provides higher quality information than their presence. In the case of the remaining α/β class the majority of features are based on the predicted secondary structure. We also note that a few other classes, such as all-α, α+β, multi-domain and membrane proteins, heavily utilize the information concerning the predicted secondary structure in connection with the PSSM. The popularity of the features derived from the secondary structure stems from the fact that the structural classes are de facto defined based on the secondary structures.

The predictor for the all-α class uses large number of features from PSSM and PSSM combined with the predicted secondary structure. This shows that residue conservation is an important factor that distinguishes between all-α and other classes. We also observe that these features utilize information about both helix and strand segments, where the strand segments are likely used to indicate non all-α proteins. Finally, this feature set includes 8 features based on the helix-coil-helix motifs that occur in virtually all proteins from this class.

Most of the features for the all-β class are again based on the PSSM. This feature set also includes features that quantify the amount of helix-coil-helix (likely to exclude non all-β classes) and strand-coil-strand segments (which are specific to the proteins from the all-β class) and a relatively large number of coil-based features. The latter is likely due to the fact that proteins from the all-β class include relatively large number of β-sheets which incorporate larger number of coils (when compared with other classes) that connect individual strand segments that make up the β-sheet.

The α/β class incorporates a relatively large number of features that quantify the occurrence of the helix-coil-strand and strand-coil-helix motifs. This agrees with the definition of this class that incorporates structures in which helices and strands are interspersed. Such spatially scattered secondary structures are likely to also alternate in the sequence.

The largest number of features was selected for the α+β class. This is likely because this class is the hardest to predict among the four major classes, e.g., 17 out of 18 structural class prediction methods that were recently compared in [79] provide the lowest prediction quality for this class when compared with the predictions for the all-α, all-β and α/β classes. Most of the features utilized by the α+β classifier are based on the PSMM combined with the predicted secondary structure. All of the features that exploit collocation of the helix and strand segments are based on either collocation of helix (helix-coil-helix) or strand (strand-coil-strand) segments. This is motivated by the definition of this class that includes protein in which secondary structures are segregated.

The multi domain proteins have structures that combine characteristics of the four major structural classes since different domains may fold into structures characteristic to different classes. This is likely the reason why this class uses relatively equal number of features coming from different sources, like the PSSM and the predicted secondary structure, and why the secondary structure based features equally cover all three structure types (coils, strands and helices).

The membrane proteins include long transmembrane α-helices and this is the likely the reason why the corresponding classifier makes use of 24 out of 46 features that are based on the predicted helices. As in the case of most of the other classes, features used to classify membrane proteins also heavily rely on the residue conservation.

We note that although the small protein class includes short protein chain, the feature that measures the sequence length was not selected for the corresponding classifier. This is likely since several other classes also include short chains, but their secondary structure fits the definition of a given class rather than being composed mostly of coils which is characteristic for the small proteins class. The features for this class come from different sources including the sequence, the PSSM and the predicted secondary structure. We observe that helix/strand collocation based features were not selected for this class; again, this is likely since these proteins are mostly composed of coils.

We also discuss the most useful features for prediction of each of the considered seven structural classes. We select two representative features for each class and use a scatter plot of their values to explain their relation with the classes. The selection of the features is based on their correlation with the classes (which should be high) and correlation with each other (which should be relatively low to limit their overlap). The first feature was selected based on the largest values of its biserial correlation with the class labels (a given class vs. the remaining classes). The remaining features were ranked based on their biserial correlation coefficients and the top ranked feature for which the Pearson correlation coefficient with the first feature is smaller than 0.7 was selected as the second feature. We also compare the scatter plots for these two features with the scatter plots when using helix and strand content to discriminate between the classes. This is motivated by the fact that some older structural class assignment methods performed the class assignment using the secondary structure content rather than the spatial arrangement of the secondary structures which comes from the tertiary structure [43]. Figure 4 presents the corresponding 14 scatter plots.

Figure 4
figure 4

Scatter plots for two representative features for each structural class (left column) and helix and strand contents (right column) for a) all-α; b) all-β; c) α/β; d) α+β; e) multi-domain; f) membrane and cell surface proteins; and g) small proteins classes. The plots were computed on the ASTRALtrainingdataset and they use markers with colors and shapes that indicate the class and number of protein chains for a given combination of the values of the two features, respectively. The larger the marker is the more chains are found for the corresponding values of the two features. The darker the shading of the marker is the larger the fraction of the chains that correspond to the target class is for the given values of the two features.

The two representative features for the all-α class are CSSeq_SS_E_C (normalized conservation scores for the substitution into Cys for AAs that were predicted as strands) and SegCount_H_L16 (normalized count of long helical segments in the predicted secondary structure). We observe that proteins with high SegCount_H_L16 values and proteins with low values of CSSeq_SS_E_C likely belong to the all-α class. This is supported by the fact that all-α proteins are characterized by significant helix content and thus they include relatively large number of long helices. The CSSeq_SS_E_C feature shows that all-α proteins include virtually no strands in which Cys is conserved. Costantini and colleagues have observed that Cys has strong propensity to form strands and is more prevalent among the proteins from all-β class [117] and thus proteins that include strands with conserved Cys are unlikely to belong to all-α class. The right-hand-side plot in Figure 4a shows that the all-α proteins are characterized, as expected, by a high content of helices and a low content of strands. At the same time, we note that some non all-α proteins (right lower corner of the scatter plot) could be misclassified using this criteria, which shows that the two representative features used in the proposed method likely provide better discriminatory power.

The two features selected for the all-β class (Figure 4b) include HE (the number of strand-coil-helix or helix-coil-strand motifs in the predicted secondary structure) and CSSeq_SS_H_A (normalized conservation scores for the substitution into Ala for AAs that were predicted as helices). The proteins from this class have low CSSeq_SS_H_A and medium to low HE values for chains for which CSSeq_SS_H_A values are close to zero. The HE feature is motivated by the fact that all-β proteins include relatively large number of strands and a low number of helices and thus strand-coil-helix or helix-coil-strand motifs are less likely to occur in these proteins. The CSSeq_SS_H_A feature shows that the all-β class includes chains that have very few helices with conserved Ala. This is supported by the work in [117] which shows that Ala has strong propensity to form helices and occurs relatively more frequently in proteins from the all-α class, which suggests that chains that include helices with conserved Ala are unlikely to belong to the all-β class.

The proteins from the α/β class are characterized by average values of AvgHCH (the average number of helices in all helix-coil-helix-coil...coil-helix motifs in the predicted secondary structure) and high values of HE_L2 (the number of helix-coil-strand or strand-coil-helix motifs which includes at least 2 residues in the middle coil) features. The HE_L2 indicates that the proteins from this class have the helices and strands interspersed in the sequence and AvgHCH shows that they do not include secondary structures with no consecutive helices and with many consecutive helices. The latter shows that α/β class includes proteins with helices, but they are less likely to form long helix-coil-helix-coil...coil-helix motifs.

The two representative features for the α+β class include SegCount_E_L2 (the number of strand segments which contain at least 2 AAs) and ECE_L2 (the number of strand-coil-strand-coil...coil-strand motifs with more than 2 strands). Protein from this class have average to high values of both features which is motivated by the observation that they have strands (SegCount_E_L2 features excludes beta bridges and includes extended strands that likely form sheets) and the strands and helices are segregated, i.e., that strands co-occur closely in the sequence, which results in high values of ECE_L2. We observe that usage of the content leads to a significant overlap between the proteins from the α/β and α+β classes, see the right-hand-side plots in Figures 4c and 4d. At the same time, the proposed method uses different features for different classes, which can potentially provide better discrimination between these two classes when compared to using the content. The representative features for the α/β and α+β classes quantify the spatial relation of the helices and strands (which is done based on their co-occurrence close by in the sequence) which, in our opinion, better captures the characteristics of these two classes when compared with the content.

The scatter plot for the multi domain proteins class shows no clear trends since the number of proteins in this class is small, only 53 out of 3132 in the ASTRALtraining dataset, and since the best feature for this class has relatively small biserial correlation value of 0.12. This is likely due to the significant overlap between this class and other classes, i.e., individual domains in these proteins belong to different structural classes. We observe that proteins from this class have relatively high value of SegCount_C_P4 (the number of coil segments which length is at least 4% of the sequence length) combined with low value of SegCount_E_P4 (the number of strand segments which length is at least 4% of the sequence length). This suggests that on average they include longer coil segments and a few or none longer strands when compared with other classes. We note that similar structures occur also for chains from other classes, i.e., the markers in Figure 4e have only relatively light shading. We also observe that the usage of helix and strand contents results in the scatter plot with even lighter shading of the markers.

The membrane and cell surface proteins are best described using CSSeq_GR_R_PolarCharged (sum of the normalized conservation scores for the substitution into polar charged residues that include Asp, Glu, His, Lys, and Arg) and Max-Ach_Prob_SS_H_G (the difference between the maximal and the achieved probability of the occurrence of Gly residues predicted as helices) features (Figure 4f). These proteins are characterized by high values of Max-Ach_Prob_SS_H_G, which is motivated by the inclusion of transmembrane helices [118] and by frequent presence of Gly in membrane proteins [119]. This class is also associated with medium to low values of CSSeq_GR_R_PolarCharged, which is supported by prior research that shows that Asp, Arg, Lys, Gln, Asn, Glu, Pro, Ser, Thr, Gly, and His are characterized by low (in descending order) propensity to form membrane regions based on the membrane propensity scale from [120]; in other words, the existence of the conserved residues of the above type suggests that the corresponding chain is less likely to be associated with the membrane regions in the protein chain.

Lastly, the high values of Comp_C (content of the predicted coils) together with above average values of Comp_GR_E_Neutral (composition of the neutral residues that include Cys, Gly, His, Ser, and Trp) features are shown to be associated with the small proteins class. The former agrees with the strand and helix content scatter plot (see right-hand-side plot in Figure 4g) that shows that small proteins usually include only a few helix and strand structures. According to Costantini and coworkers Gly, His, and Ser are shown to be among the amino acids with high propensity to form coils [117], which is a likely reason why Comp_GR_E_Neutral feature was selected.

Results for the independent test set ASTRALtest

The proposed prediction system was trained using the ASTRALtraining dataset and tested using the ASTRALtest database. A summary presented in Table 5 shows results for three configurations of the proposed MODAS method that include prediction of the four major classes, six classes that exclude the small proteins class, and prediction of all seven classes. For each setup we use only the instances from the selected classes to perform the test.

Table 5 Experimental results for the test on the independent ASTRALtrainingdataset for the proposed MODAS method that considers the 4 major structural classes, 6 classes excluding the small proteins class, and all 7 considered classes.

The results show that the accuracy is around 83% for the prediction of the four major classes and close to 80% when considering the 7 classes. This moderate drop in accuracy is attributed to the predictions for the multi-domain proteins class which obtains the lowest accuracies. We note that positive MCC values indicate that the proposed model provides predictions that are always better than random. Most importantly, in spite of the twilight zone similarity between training and testing sequences we observe that the proposed method is characterized by good performance for all classes except the multi-domain proteins class, which is supported by the MCC and GC2 values of above 0.6 and 0.5, respectively. The all-α class is the easiest to predict. The corresponding predictions for all three configurations are characterized by accuracy of above 91% and MCC of 0.89 or higher. The predictions for the α/β and all-β classes have similar quality with accuracies ranging between 82 and 85% and MCC between 0.75 and 0.79. The predictions of the small proteins class are also characterized by a relatively high accuracy and MCC. We observe that inclusion of this class, see the results for the 6 and 7 classes in Table 5, results in a slight drop in the quality of the prediction of the all-α, all-β, and α+β classes. This suggests existence of an overlap between these classes and the small proteins class. The relatively poor scores for the multi-domain proteins class are likely due to the small size of this class and since proteins from this class consist of domains that likely belong to different structural classes. Although the accuracy of the prediction of the membrane proteins is at 58%, we emphasize that relatively high MCC value of 0.75 indicates that the proposed method performs well for this class. The results for this class should be considered successful given that this class is significantly underrepresented in the datasets, i.e., membrane proteins account for only 2.2% of proteins in both the ASTRALtraining and the ASTRALtest databases.

We also compare the results obtained by the proposed MODAS method on the ASTRA test dataset with the results of two recent representative methods that were designed to work with low identity sequences, SCPred [79] and SCEC [75]. Both of these methods use SVM to perform predictions and they are shown to provide favorable prediction quality with compared with other existing structural class predictors (see results in the "Comparison with the existing structural class predictors" section). SCPred is the only existing method that uses predicted secondary structure to predict the structural classes and SCEC uses PSMM to compute the predictions. These two methods predict only the four major classes and thus we compare the performance considering only these classes. We removed sequences from the three minor classes and sequences with less than 30 residues from the training and test sets since SCEC cannot provide predictions for such short chains. The SCPred algorithm was trained both on the original 25PDB dataset as it was done by the authors of this method [79], and we also retrained this method using ASTRALtraining dataset. In the case of the SCEC algorithm we used the corresponding web server to perform predictions. We assumed that the user of the MODAS system may not know how many classes should be considered in the test and thus we included the results when prediction was made for only the 4 major classes, the 6 classes (excluding the small proteins), and all 7 classes. The results are presented in Table 6.

Table 6 Results of the experimental comparison of the proposed MODAS method and the competing SCEC and SCPRED methods on the ASTRALtestdataset with the four major structural classes.

The MODAS method is shown to provide favorable quality for the prediction of the 4 classes. The quality of the results generated by the proposed method is slightly lower when using predictions that consider more classes, but the overall accuracy and GC2 are still higher than the quality provided by both competing solutions even when using the model that predicts all 7 classes. The accuracy improvements of the best MODAS model that predicts 4 classes over the best results from other methods equal 0.9%, 4.4%, 1.7%, and 7.2% for the all-α, all-β, α/β, and α+β classes, respectively. This translates into 0.9/(100-93.1) = 0.9/6.9 = 13%, 4.4/20.9 = 21%, 1.7/16.6 = 10%, and 7.2/35.7 = 20% error rate reductions, respectively, when compared with the error produced by the best performing competing method. The corresponding error rate reduction for the overall accuracy equals 3.9/(100-79.1) = 19%. The most encouraging improvements that are measured using MCC concern α/β and α+β classes where the MODAS method is better by at least 0.08 when compared with the best existing method. We also observe that SCPred performs slightly better when trained on the bigger ASTRALtraining dataset. The SCEC provides the lowest ranked predictions among the considered methods.

Comparison with the existing structural class predictors

The side-to-side comparison with recently proposed structural class prediction methods is based on the tests on three popular benchmarking datasets, 25PDB, D1189 and D675, which are characterized by the low sequence identity. These sets were used to test methods that predict the 4 major classes and thus the proposed MODAS method is also setup to predict these 4 classes. We also use the D2230 dataset to compare with the most recent structural class predictor that considers the 7 classes [83]. Following the prior works in this area we use jackknife test to measure the performance. The selection of this test strategy is motivated by the work in [23, 121] which shows that jackknife is deemed the most objective as it always yields a unique result for a given dataset and that this test is increasingly used to examine the accuracy of various predictors. In this test all but one sequence are used to train the proposed classification system (using parameters and features identical to those discusses in the Materials and Methods section) and the remaining sequence is used to perform the test; this process is repeated to use each sequence from the dataset once as the test sequence.

Table 7 that concerns tests on the 25PDB dataset shows that the proposed MODAS method outperforms all other methods. There are only two methods that provide the overall accuracy of over 65%, which include different variants of the SCPRED method [79] and MODAS, and both of them use SVM classifiers and predicted secondary structure. This suggests that the predicted secondary structure provides a useful source of information and that SVM classifiers provide favorable prediction quality for this prediction task. Comparison with the SCPRED reveals that the proposed method obtains higher overall accuracy and higher accuracy for the all-β and α/β classes. The error rate reduction obtained by MODAS when compared with the second best SCPRED on this dataset equals 1.7/(100-79.7) = 8%. We note that the proposed predictor was designed to maximize the MCC values (the feature selection and classifier parameterization were performed to maximize the MCC values) and as a result it provides the best predictions for the 25PDB dataset according to this quality index. The biggest improvement, when compared with SCPRED, was obtained for the α/β class which is likely due to the introduction of novel features that describe collocation of helix and strand segments in the predicted secondary structure.

Table 7 Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the 25PDB dataset.

Results shown in Tables 8 and 9, which concern jackknife tests on the D1189 and D675 datasets, respectively, are consistent with the results on the 25PDB dataset. The MODAS method outperforms all competing methods as measured by the overall accuracy. The only method that provides similar prediction quality is again SCPRED. Results show that accuracy provided by MODAS is better than the accuracy of SCPRED by 2.9% and 0.5% on the D1189 and D675 datasets, respectively. The proposed method provides substantial improvements over SCPRED for the prediction of the α+β class. The SCEC predictor, which utilizes PSSM generated with PSI Blast as its input, provides the third best results on both of these datasets. This demonstrates that evolutionary information provides a better source of information for the prediction of the structural class when compared with the sequence of the input protein that is used as an input by all lower ranked methods. We note that the size of the dataset used to build PSSM would likely impacts the prediction quality, as it was demonstrated for the secondary structure predictions [106]. Larger size of the dataset may induce better prediction performance, which could explain a portion of the improvements of the MODAS method that was trained using relatively recent version of the nr database, when compared with other predictors, including SCEC and SCPRED, which used smaller datasets. We could not provide MCC and GC2 values for results on these two datasets (as well as for the D498 dataset) since they were not provided by the authors of the existing methods.

Table 8 Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the D1189 dataset.
Table 9 Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the D675 dataset.

We also compare MODAS with methods that were tested on datasets with unspecified sequence identity between the test and the training sequences. The results of the jackknife test on the D498 dataset are presented in Table 10. The proposed method again achieves the highest accuracy (96.8%) among all competing methods that were tested on this dataset. We observe that the lowest accuracy for this dataset is around 89%. The accuracy of 94.9% obtained by the third best SCEC method demonstrates that it is easier to obtain high predictive performance on this protein set when compared with the datasets with lower sequence identity, i.e., SCEC achieves 63-67% accuracy for the low-similarity datasets. Based on the observations from a recent study by Kurgan and Homaeian [73], the high levels of accuracy are most likely due to relatively high pairwise sequence similarity of the D498 dataset, see Datasets section. On the other hand, the differences between the accuracy on the low and the high-similarity datasets for the SCPRED and MODAS methods are smaller than for the SCEC. This is most likely since these methods were designed using low sequence identity datasets.

Table 10 Results of the experimental comparison between the proposed MODAS method and competing structural class prediction methods on the D498 dataset.

Table 11 compares the proposed method with the PseAA method [83] on the D2230 dataset when considering classification into the 7 classes. Although the authors of PseAA provided only the overall accuracy of their method on this dataset, we present all quality index values obtained by the proposed MODAS method. The overall accuracy of predictions generated by MODAS is better by 22.6% when compared with PseAA. This dataset includes 16.8%, 14.3%, 32%, 28.9%, 1.1%, 2.0%, and 4.9% sequences from the all-α, all-β, α/β, α+β, multi-domain, membrane and cell surface, and small protein classes, respectively. The accuracies obtained by MODAS show that our predictions are substantially better than a random chance in spite of the heavily unbalanced nature of the dataset. We note that the quality of the predictions obtained on this dataset is consistent with the results on the other benchmark datasets that are presented above.

Table 11 Results of the experimental comparison between the proposed MODAS and PseAA methods on the D2230 dataset.

The high quality of the results provided by SCEC and SCPRED supports our choice to use the evolutionary information encoded in PSSM and the predicted secondary structure as inputs for the proposed MODAS method. The above results demonstrate that MODAS consistently, over multiple datasets, outperforms competing approaches and that it is capable of providing high quality predictions for both the 4 major classes and the 7 classes.

Conclusions

This work addresses lack of structural class predictors that consider seven structural classes, as defined in SCOP, and which are characterized by high prediction quality when applied to problems that involve query sequences that share twilight-zone similarity with the sequences used to develop the prediction model. This is motivated by the fact that prediction for the low-similarity sequences has applications in the detection of the remote homologues.

We propose a prediction method that applies SVM classifier on a set of features that are computed from the input protein sequence. Our design incorporates novel features that utilize sequence-derived information that includes PSSM computed with PSI Blast and secondary structure predicted with PSI Pred. We performed a comprehensive feature selection and classifier selection and parameterization procedure to optimize the quality of the predictions. The proposed method is the first to provide modular design in which a separate classifier is created for each class.

An extensive empirical evaluation of the proposed MODAS method that includes tests on 5 twilight-zone and 1 high-similarity datasets and comparison with over two dozens of modern existing structural class predictors shows that MODAS achieves the best overall accuracies for predictions of both the 4 major structural classes (all-α, all-β, α/β, and α+β) and the 7 classes (the 4 classes plus multi-domain, membrane and cell surface, and small protein classes). MODAS is shown to achieve accuracy of over 80% and GC2 scores of over 0.5. The main advantages of the proposed method include (1) the high quality of the predictions for problems involving low sequence similarity datasets; (2) availability of predictions for 7 structural classes (in contrast to predictions offered by the majority of the existing methods that consider only the 4 major classes); and (3) modularity which allows the user to select any subsets of the 7 classes that will be considered as the possible outcomes for the query sequence. In particular, we observe that MODAS provides accurate predictions for the membrane and cell surface proteins, which is an important class that is not considered by the majority of the existing predictors. The improved quality stems from the usage of the two important sequence-derived sources of information, the predicted secondary structure and the evolutionary information, and the development of novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. The results also suggest that the information extracted from the secondary structure that is predicted along the protein chain can be successfully used to predict structural classes that are defined based on the spatial arrangement of the secondary structures.

A web server that implements the MODAS method is available at http://biomine.ece.ualberta.ca/MODAS/. This server limits the number of input sequences to 10. In the case of the larger sequence sets, the interested user is asked to contact the corresponding author. The web server was trained on the 1.73 version of the ASTRAL database with less than 20% sequence similarity (i.e. merged ASTRALtest and ASTRALtraining datasets).

Abbreviations

(3D):

Three dimensional

(AA):

Amino acid

(CATH):

Class, Architecture, Topology and Homologous superfamily

(CFS):

Correlation-based Feature Subset selection

(GC2):

generalized squared correlation

(k-NN):

k-Nearest Neighbor

(MCC):

Matthews's correlation coefficient

(MODAS):

MODular Approach to Structural class prediction

(PSSM):

position specific scoring matrix

(SCOP):

Structural Classification of Proteins

(SVM):

support vector machine.

References

  1. Chou KC, Wei D, Du Q, Sirois S, Zhong W: Progress in computational approach to drug development against SARS. Curr Med Chem 2006, 13(32):63–70.

    Google Scholar 

  2. Chou KC: Structural bioinformatics and its impact to biomedical science. Curr Med Chem 2004, 11(21):05–34.

    Google Scholar 

  3. Bujnicki JM: Protein-structure prediction by recombination of fragments. Chembiochem 2006, 7(1):19–27. 10.1002/cbic.200500235

    CAS  PubMed  Google Scholar 

  4. Floudas CA: Computational methods in protein structure prediction. Biotechnol Bioeng 2007, 97(2):207–213. 10.1002/bit.21411

    CAS  PubMed  Google Scholar 

  5. Kurgan LA, Cios KJ, Zhang H, Zhang T, Chen K, Shen S, Ruan J: Sequence-based methods for real value predictions of protein structure. Current Bioinformatics 2008, 3(3):183–196. 10.2174/157489308785909197

    CAS  Google Scholar 

  6. Rost B: Prediction in 1D: secondary structure, membrane helices, and accessibility. Methods Biochem Anal 2003, 44: 559–587. full_text

    CAS  PubMed  Google Scholar 

  7. Chou KC: Progress in protein structural class prediction and its impact to bioinformatics and proteomics. Curr Protein Pept Sci 2005, 6(5):423–436. 10.2174/138920305774329368

    CAS  PubMed  Google Scholar 

  8. Gromiha MM, Selvaraj S, Thangakani AM: Statistical Method for Predicting Protein Unfolding Rates from Amino Acid Sequence. J Chem Inf Model 2006, 46(3):1503–1508. 10.1021/ci050417u

    CAS  PubMed  Google Scholar 

  9. Galzitskaya OV, Reifsnyder DC, Bogatyreva NS, Ivankov DN, Garbuzynskiy SO: More compact protein globules exhibit slower folding rates. Proteins 2008, 70(2):329–332. 10.1002/prot.21619

    CAS  PubMed  Google Scholar 

  10. Gromiha MM: A statistical model for predicting protein folding rates from amino acid sequence with structural class information. J Chem Inf Model 2005, 45(2):494–501. 10.1021/ci049757q

    CAS  PubMed  Google Scholar 

  11. Huang LT, Gromiha MM: Analysis and prediction of protein folding rates using quadratic response surface models. J Comput Chem 2008, 29(10):1675–1683. 10.1002/jcc.20925

    CAS  PubMed  Google Scholar 

  12. Istomin AY, Jacobs DJ, Livesay DR: On the role of structural class of a protein with two-state folding kinetics in determining correlations between its size, topology, and folding rate. Protein Sci 2007, 16(11):2564–2569. 10.1110/ps.073124507

    PubMed Central  CAS  PubMed  Google Scholar 

  13. Kuznetsov IB, Gou Z, Li R, Hwang S: Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins 2006, 64(1):19–27. 10.1002/prot.20977

    CAS  PubMed  Google Scholar 

  14. Gromiha MM, Suwa M: A simple statistical method for discriminating outer membrane proteins with better accuracy. Bioinformatics 2005, 21(7):961–968. 10.1093/bioinformatics/bti126

    CAS  PubMed  Google Scholar 

  15. He H, McAllister G, Smith TF: Triage protein fold prediction. Proteins 2002, 48(4):654–663. 10.1002/prot.10194

    CAS  PubMed  Google Scholar 

  16. Ding YS, Zhang TL, Gu Q, Zhao PY, Chou KC: Using Maximum Entropy Model to Predict Protein Secondary Structure with Single Sequence. Protein Pept Lett 2009, 16: 552–560. 10.2174/092986609788167833

    CAS  PubMed  Google Scholar 

  17. Zhang Z, Sun ZR, Zhang CT: A new approach to predict the helix/strand content of globular proteins. J Theor Biol 2001, 208(1):65–78. 10.1006/jtbi.2000.2201

    CAS  PubMed  Google Scholar 

  18. Chou KC: Energy-optimized structure of antifreeze protein and its binding mechanism. J Mol Biol 1992, 223: 509–517. 10.1016/0022-2836(92)90666-8

    CAS  PubMed  Google Scholar 

  19. Carlacci L, Chou KC, Maggiora GM: A heuristic approach to predicting the tertiary structure of bovine somatotropin. Biochemistry 1991, 30: 4389–4398. 10.1021/bi00232a004

    CAS  PubMed  Google Scholar 

  20. Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 2004, 32: D226–229. 10.1093/nar/gkh039

    PubMed Central  CAS  PubMed  Google Scholar 

  21. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of protein database for the investigation of sequence and structures. J Mol Biol 1995, 247: 536–540.

    CAS  PubMed  Google Scholar 

  22. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, (35 Database):D61–65. 10.1093/nar/gkl842

  23. Ginalski K: Comparative modeling for protein structure prediction. Curr Opin Struct Biol 2006, 16(2):172–177. 10.1016/j.sbi.2006.02.003

    CAS  PubMed  Google Scholar 

  24. Ruan J, Chen K, Tuszynski JA, Kurgan LA: Quantitative analysis of the conservation of the tertiary structure of protein segments . Protein J 2006, 25: 301–315. 10.1007/s10930-006-9016-5

    CAS  PubMed  Google Scholar 

  25. Xiang Z: Advances in homology protein structure modeling. Curr Protein Pept Sci 2006, 7(3):217–227. 10.2174/138920306777452312

    PubMed Central  CAS  PubMed  Google Scholar 

  26. Altschul SF, Madden TL, Schäffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 17: 3389–3402. 10.1093/nar/25.17.3389

    Google Scholar 

  27. Yu YK, Gertz EM, Agarwala R, Schaffer AA, Altschul SF: Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 2006, 34: 5966–5973. 10.1093/nar/gkl731

    PubMed Central  CAS  PubMed  Google Scholar 

  28. Rost B: Twilight zone of protein sequence alignments. Protein Eng 1999, 2: 85–94. 10.1093/protein/12.2.85

    Google Scholar 

  29. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235

    PubMed Central  CAS  PubMed  Google Scholar 

  30. Kurgan LA, Chen K: Prediction of protein structural class for the twilight zone sequences. Biochem Biophys Res Co 2007, 357(2):453–460. 10.1016/j.bbrc.2007.03.164

    CAS  Google Scholar 

  31. Reinhardt A, Eisenberg D: DPANN: improved sequence to structure alignments following fold recognition. Proteins 2004, 56: 528–538. 10.1002/prot.20144

    CAS  PubMed  Google Scholar 

  32. Tomii K, Hirokawa T, Motono C: Protein structure prediction using a variety of profile libraries and 3D verification. Proteins 2005, 61(S7):114–121. 10.1002/prot.20727

    CAS  PubMed  Google Scholar 

  33. Chou KC, Watenpaugh KD, Heinrikson RL: A Model of the complex between cyclin-dependent kinase 5(Cdk5) and the activation domain of neuronal Cdk5 activator. Biochem Biophys Res Commun 1999, 259: 420–428. 10.1006/bbrc.1999.0792

    CAS  PubMed  Google Scholar 

  34. Paiardini A, Bossa F, Pascarella S: Evolutionarily conserved regions and hydrophobic contacts at the superfamily level: The case of the fold-type I, pyri-doxal-5'-phosphate-dependent enzymes. Protein Sci 2004, 13: 2992–3005. 10.1110/ps.04938104

    PubMed Central  CAS  PubMed  Google Scholar 

  35. Zhang Y, Skolnick J: The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 2005, 102: 1029–1034. 10.1073/pnas.0407152101

    PubMed Central  CAS  PubMed  Google Scholar 

  36. Dunbrack RL: Sequence comparison and protein structure prediction. Curr Opin Struct Biol 2006, 16(3):374–384. 10.1016/j.sbi.2006.05.006

    CAS  PubMed  Google Scholar 

  37. Wu S, Zhang Y: MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins 2008, 72(2):547–556. 10.1002/prot.21945

    PubMed Central  CAS  PubMed  Google Scholar 

  38. Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA: The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res 2009, (37 Database):D310-D314. 10.1093/nar/gkn877

  39. Orengo C, Michie A, Jones D, Swindells M, Thornton J: CATH: a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8

    CAS  PubMed  Google Scholar 

  40. Carpenter EP, Beis K, Cameron AD, Iwata S: Overcoming the challenges of membrane protein crystallography. Curr Opin Struct Biol 2008, 18(5):581–586. 10.1016/j.sbi.2008.07.001

    PubMed Central  CAS  PubMed  Google Scholar 

  41. Homaeian L, Kurgan L, Cios KJ, Ruan J, Chen K: Prediction of protein secondary structure content for the twilight zone sequences. Proteins 2007, 69(3):486–498. 10.1002/prot.21527

    CAS  PubMed  Google Scholar 

  42. Lee S, Lee BC, Kim D: Prediction of protein secondary structure content using amino acid composition and evolutionary information. Proteins 2006, 62: 1107–1114. 10.1002/prot.20821

    CAS  PubMed  Google Scholar 

  43. Kurgan LA, Zhang T, Zhang H, Shen S, Ruan J: Secondary structure based assignment of the protein structural classes. Amino Acids 2008, 35(3):551–564. 10.1007/s00726-008-0080-3

    CAS  PubMed  Google Scholar 

  44. Chou KC: Prediction of protein cellular attributes using pseudo amino acid composition. Protein Struct Funct Gene 2001, 43: 246–255. 10.1002/prot.1035

    CAS  Google Scholar 

  45. Anand A, Pugalenthi G, Suganthan PN: Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. J Theor Biol 2008, 253(2):375–380. 10.1016/j.jtbi.2008.02.031

    CAS  PubMed  Google Scholar 

  46. Cai Y, Feng K, Lu W, Chou K: Using LogitBoost classifier to predict protein structural classes. J Theor Biol 2006, 238: 172–176. 10.1016/j.jtbi.2005.05.034

    CAS  PubMed  Google Scholar 

  47. Cai Y, Liu X, Xu X, Chou K: Support vector machines for prediction of protein domain structural class. J Theor Biol 2003, 221: 115–120. 10.1006/jtbi.2003.3179

    CAS  PubMed  Google Scholar 

  48. Cai YD, Liu XJ, Xu XB, Chou KC: Prediction of protein structural classes by support vector machines. J Comput Chem 2002, 26(3):293–296. 10.1016/S0097-8485(01)00113-9

    CAS  Google Scholar 

  49. Cao Y, Liu S, Zhang L, Qin J, Wang J, Tang K: Prediction of protein structural class with Rough Sets. BMC Bioinformatics 2006, 7: 20. 10.1186/1471-2105-7-20

    PubMed Central  PubMed  Google Scholar 

  50. Chen C, Tian YX, Zou XY, Cai PX, Mo JY: Using pseudo-amino acid composition and support vector machine to predict protein structural class. J Theor Biol 2006, 243(3):444–448. 10.1016/j.jtbi.2006.06.025

    CAS  PubMed  Google Scholar 

  51. Chou KC: A Novel Approach to Predicting Protein Structural Classes in a (20–1)-D Amino Acid Composition Space. Proteins 1995, 21: 319–344. 10.1002/prot.340210406

    CAS  PubMed  Google Scholar 

  52. Costantini S, Facchiano AM: Prediction of the protein structural class by specific peptide frequencies. Biochimie 2009, 91(2):226–229. 10.1016/j.biochi.2008.09.005

    CAS  PubMed  Google Scholar 

  53. Ding YS, Zhang TL, Chou KC: Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein Pept Lett 2007, 14(8):811–815. 10.2174/092986607781483778

    CAS  PubMed  Google Scholar 

  54. Dong L, Yuan Y, Cai T: Using Bagging classifier to predict protein domain structural class. J Biomol Struct Dyn 2006, 24: 239–242.

    CAS  PubMed  Google Scholar 

  55. Du QS, Jiang ZQ, He WZ, Li DP, Chou KC: Amino Acid Principal Component Analysis (AAPCA) and its applications in protein structural class prediction. J Biomol Struct Dyn 2006, 23(6):635–640.

    CAS  PubMed  Google Scholar 

  56. Gu F, Chen H, Ni J: Protein structural class prediction based on an improved statistical strategy. BMC Bioinformatics 2008, 9(Suppl 6):5. 10.1186/1471-2105-9-S6-S5

    Google Scholar 

  57. Jahandideh S, Abdolmaleki P, Jahandideh M, Asadabadi EB: Novel two-stage hybrid neural discriminant model for predicting proteins structural classes. Biophys Chem 2007, 128(1):87–93. 10.1016/j.bpc.2007.03.006

    CAS  PubMed  Google Scholar 

  58. Jahandideh S, Abdolmaleki P, Jahandideh M, Hayatshahi SHS: Novel hybrid method for the evaluation of parameters contributing in determination of protein structural classes. J Theor Biol 2007, 244: 275–281. 10.1016/j.jtbi.2006.08.011

    CAS  PubMed  Google Scholar 

  59. Jin L, Fang W, Tang H: Prediction of protein structural classes by a new measure of information discrepancy. Comput Biol and Chem 2003, 27: 373–380. 10.1016/S1476-9271(02)00087-7

    CAS  Google Scholar 

  60. Kedarisetti K, Kurgan LA, Dick S: A Comment on 'Prediction of protein structural classes by a new measure of information discrepancy'. Comput Biol and Chem 2006, 30: 393–394. 10.1016/j.compbiolchem.2006.06.003

    CAS  Google Scholar 

  61. Li ZC, Zhou XB, Dai Z, Zou XY: Prediction of protein structural classes by Chou's pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis. Amino Acids 2009, 37(2):415–425. 10.1007/s00726-008-0170-2

    PubMed  Google Scholar 

  62. Li ZC, Zhou XB, Lin YR, Zou XY: Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 2008, 35(3):581–590. 10.1007/s00726-008-0084-z

    CAS  PubMed  Google Scholar 

  63. Niu B, Cai YD, Lu WC, Li GZ, Chou KC: Predicting protein structural class with AdaBoost Learner. Protein Pept Lett 2006, 13(5):489–492. 10.2174/092986606776819619

    CAS  PubMed  Google Scholar 

  64. Xiao X, Lin WZ, Chou KC: Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008, 29(12):2018–2024. 10.1002/jcc.20955

    CAS  PubMed  Google Scholar 

  65. Xiao X, Shao SH, Huang ZD, Chou KC: Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor. J Comput Chem 2006, 27(4):478–482. 10.1002/jcc.20354

    PubMed  Google Scholar 

  66. Zhang TL, Ding YS: Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes. Amino Acids 2007, 33(4):623–629. 10.1007/s00726-007-0496-1

    CAS  PubMed  Google Scholar 

  67. Zhang TL, Ding YS, Chou KC: Prediction protein structural classes with pseudo amino acid composition: approximate entropy and hydrophobicity pattern. J Theor Biol 2008, 250: 186–193. 10.1016/j.jtbi.2007.09.014

    CAS  PubMed  Google Scholar 

  68. Zheng X, Li C, Wang J: An information-theoretic approach to the prediction of protein structural class. J Comput Chem 2009, in press.

    Google Scholar 

  69. Liu T, Zheng X, Wang J: Prediction of protein structural class using a complexity-based distance measure. Amino Acids 2009, in press.

    Google Scholar 

  70. Zhou GP: An intriguing controversy over protein structural class prediction. J Protein Chem 1998, 17: 729–738. 10.1023/A:1020713915365

    CAS  PubMed  Google Scholar 

  71. Feng KY, Cai YD, Chou KC: Boosting classifier for predicting protein domain structural class. Biochem Biophys Res Commun 2005, 334(1):213–217. 10.1016/j.bbrc.2005.06.075

    CAS  PubMed  Google Scholar 

  72. Kedarisetti K, Kurgan LA, Dick S: Classifier ensembles for protein structural class prediction with varying homology. Biochem Biophys Res Co 2006, 348(3):981–988. 10.1016/j.bbrc.2006.07.141

    CAS  Google Scholar 

  73. Kurgan LA, Homaeian L: Prediction of structural classes for protein sequences and domains - impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy. Pattern Recogn 2006, 39: 2323–2343. 10.1016/j.patcog.2006.02.014

    Google Scholar 

  74. Chen C, Chen LX, Zou XY, Cai PX: Predicting protein structural class based on multi-features fusion. J Theor Biol 2008, 253(2):388–392. 10.1016/j.jtbi.2008.03.009

    CAS  PubMed  Google Scholar 

  75. Chen K, Kurgan L, Ruan J: Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem 2008, 29: 1596–1604. 10.1002/jcc.20918

    CAS  PubMed  Google Scholar 

  76. Gupta R, Mittal A, Singh K: A time-series-based feature extraction approach for prediction of protein structural class. EURASIP J Bioinform Syst Biol 2008, 35451.

    Google Scholar 

  77. Xiao X, Wang P, Chou KC: Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008, 254(3):691–696. 10.1016/j.jtbi.2008.06.016

    CAS  PubMed  Google Scholar 

  78. Yang JY, Peng ZL, Yu ZG, Zhang RJ, Anh V, Wang D: Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 2009, 257(4):618–626. 10.1016/j.jtbi.2008.12.027

    CAS  PubMed  Google Scholar 

  79. Kurgan LA, Cios KJ, Chen K: SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008, 9: 226. 10.1186/1471-2105-9-226

    PubMed Central  PubMed  Google Scholar 

  80. Cai Y, Zhou G: Prediction of protein structural classes by neural network. Biochimie 2000, 82: 783–785. 10.1016/S0300-9084(00)01161-5

    CAS  PubMed  Google Scholar 

  81. Wang ZX, Yuan Z: How good is the prediction of protein structural class by the component-coupled method. Proteins 2000, 38: 165–175. 10.1002/(SICI)1097-0134(20000201)38:2<165::AID-PROT5>3.0.CO;2-V

    CAS  PubMed  Google Scholar 

  82. Cai Y, Liu X, Xu X, Zhou G: Support vector machines for predicting protein structural class. BMC Bioinformatics 2001, 2: 3. 10.1186/1471-2105-2-3

    PubMed Central  CAS  PubMed  Google Scholar 

  83. Chen W, Zhang S, Yang H, Zhao K, Chou K: Prediction of seven protein structural classes by fusing multi-feature information including protein evolutionary conservation information. Proceedings of the Second International Conference on Bioinformatics and Biomedical Engineering: 16–18 May 2008; Shanghai 2008, 17–20. full_text

    Google Scholar 

  84. Chou KC, Cai Y: Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 2004, 321: 1007–1009. 10.1016/j.bbrc.2004.07.059

    CAS  PubMed  Google Scholar 

  85. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001, 29: 37–40. 10.1093/nar/29.1.37

    PubMed Central  CAS  PubMed  Google Scholar 

  86. Elofsson A, von Heijne G: Membrane protein structure: prediction versus reality. Annu Rev Biochem 2007, 76: 125–140. 10.1146/annurev.biochem.76.052705.163539

    CAS  PubMed  Google Scholar 

  87. Punta M, Forrest LR, Bigelow H, Kernytsky A, Liu J, Rost B: Membrane protein prediction methods. Methods 2007, 41(4):460–74. 10.1016/j.ymeth.2006.07.026

    PubMed Central  CAS  PubMed  Google Scholar 

  88. Bigelow H, Rost B: Online tools for predicting integral membrane proteins. In Membrane Proteomics: Methods and Protocols. Volume 528. Edited by: Peirce MJ, Wait R. New York: Humana Press; 2009:3–23.

    Google Scholar 

  89. Marsden RL, McGuffin LJ, Jones DT: Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Sci 2002, 11(12):2814–2824. 10.1110/ps.0209902

    PubMed Central  CAS  PubMed  Google Scholar 

  90. Bryson K, Cozzetto D, Jones DT: Computer-assisted protein domain boundary prediction using the DomPred server. Curr Protein Pept Sci 2007, 8(2):181–8. 10.2174/138920307780363415

    CAS  PubMed  Google Scholar 

  91. Raman P, Cherezov V, Caffrey M: The membrane protein data bank. Cell Mol Life Sci 2006, 63(1):36–51. 10.1007/s00018-005-5350-6

    PubMed Central  CAS  PubMed  Google Scholar 

  92. Majumdar I, Kinch LN, Grishin NV: A database of domain definitions for proteins with complex interdomain geometry. PLoS ONE 2009, 4(4):e5084. 10.1371/journal.pone.0005084

    PubMed Central  PubMed  Google Scholar 

  93. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001, 305(3):567–580. 10.1006/jmbi.2000.4315

    CAS  PubMed  Google Scholar 

  94. Jones D: Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 2007, 23(5):538–544. 10.1093/bioinformatics/btl677

    CAS  PubMed  Google Scholar 

  95. Jones D: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292(2):195–202. 10.1006/jmbi.1999.3091

    CAS  PubMed  Google Scholar 

  96. Kim H, Park H: Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Protein Struct Funct Bioinformatics 2003, 54(3):557–562. 10.1002/prot.10602

    Google Scholar 

  97. Brenner S, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Res 2000, 28: 254–256. 10.1093/nar/28.1.254

    PubMed Central  CAS  PubMed  Google Scholar 

  98. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5

    CAS  PubMed  Google Scholar 

  99. Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol 1982, 162: 705–708. 10.1016/0022-2836(82)90398-9

    CAS  PubMed  Google Scholar 

  100. Bryson K, McGuffin LJ, Marsden RL, Ward JJ, Sodhi JS, Jones DT: Protein structure prediction servers at University College London. Nucleic Acids Res 2005, (33 Web server):W36–38. 10.1093/nar/gki410

  101. Zheng C, Kurgan LA: Prediction of beta-turns at over 80% accuracy based on an ensemble of predicted secondary structures and multiple alignments. BMC Bioinformatics 2008, 9: 430. 10.1186/1471-2105-9-430

    PubMed Central  PubMed  Google Scholar 

  102. Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan LA: Sequence based residue depth prediction using evolutionary information and predicted secondary structure. BMC Bioinformatics 2008, 9: 388. 10.1186/1471-2105-9-388

    PubMed Central  PubMed  Google Scholar 

  103. Song J, Burrage K: Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics 2006, 7: 425. 10.1186/1471-2105-7-425

    PubMed Central  PubMed  Google Scholar 

  104. Witten I, Frank E: Data Mining: Practical machine learning tools and techniques. 2nd edition. San Francisco: Morgan Kaufmann; 2005.

    Google Scholar 

  105. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, (35 Database):D61–5. 10.1093/nar/gkl842

  106. Przybylski D, Rost B: Alignments grow, secondary structure prediction improves. Proteins 2002, 46: 197–205. 10.1002/prot.10029

    CAS  PubMed  Google Scholar 

  107. Liu H, Setiono R: A probabilistic approach to feature selection - A filter solution. In Proceedings of the 13th International Conference on Machine Learning: 3–6 July 1996; Bari. Edited by: Saitta L. San Francisco: Morgan Kaufmann; 1996:319–327.

    Google Scholar 

  108. Kohavi R, John GH: Wrappers for feature subset selection. Arti Intell 1997, 97(1–2):273–324. 10.1016/S0004-3702(97)00043-X

    Google Scholar 

  109. Hall MA: Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning: 29 June - 2 July 2000; San Francisco. Edited by: Langley P. San Francisco: Morgan Kaufmann; 2000:359–366.

    Google Scholar 

  110. Robnik-Sikonja M, Kononenko I: An adaptation of Relief for attribute estimation in regression. In Proceedings of the 14th International Conference on Machine Learning: 8–12 July 1997; Nashville. Edited by: Fisher DH. San Francisco: Morgan Kaufmann; 1997:296–304.

    Google Scholar 

  111. Langley P: Selection of relevant features in machine learning. In Proceedings of the AAAI Fall Symposium on Relevance: 4–6 November 1994; New Orleans. Menlo Park: AAAI Press; 1994:140–144.

    Google Scholar 

  112. Keerthi S, Shevade S, Bhattacharyya C, Murthy K: Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Comput 2001, 13(3):637–649. 10.1162/089976601300014493

    Google Scholar 

  113. John G, Langley P: Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence: 18–20 August 1995; Montreal. Edited by: Besnard P, Hanks S. San Mateo: Morgan Kaufmann Publishers; 1995:338–345.

    Google Scholar 

  114. Cessie S, Houwelingen J: Ridge estimators in logistic regression. Appl Stat 1992, 41(1):191–201. 10.2307/2347628

    Google Scholar 

  115. Aha D, Kibler D: Instance-based learning algorithms. Mach Learn 1991, 6: 37–66.

    Google Scholar 

  116. Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H: Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 2000, 16: 412–424. 10.1093/bioinformatics/16.5.412

    CAS  PubMed  Google Scholar 

  117. Costantini S, Colonna G, Facchiano AM: Amino acid propensities for secondary structures are influenced by the protein structural class. Biochem Biophys Res Co 2006, 342(2):441–451. 10.1016/j.bbrc.2006.01.159

    CAS  Google Scholar 

  118. von Heijne G: Principles of membrane protein assembly and structure. Prog Biophys Mol Biol 1996, 66(2):113–139. 10.1016/S0079-6107(97)85627-1

    CAS  PubMed  Google Scholar 

  119. Amirova SR, Milchevsky JV, Filatov IV, Esipova NG, Tumanyan VG: Study and prediction of secondary structure for membrane proteins. J Biomol Struct Dyn 2007, 24(4):421–428.

    CAS  PubMed  Google Scholar 

  120. Punta M, Maritan A: A knowledge-based scale for amino acid membrane propensity. Proteins 2003, 50(1):114–121. 10.1002/prot.10247

    CAS  PubMed  Google Scholar 

  121. Chou KC, Shen HB: Recent progresses in protein subcellular location prediction. Anal Biochem 2007, 370: 1–16. 10.1016/j.ab.2007.07.006

    CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukasz Kurgan.

Additional information

Authors' contributions

MJM contributed to the conception of the proposed method, designed and implemented the feature sets and the classifiers, performed the tests, implemented the web server, contributed to the evaluation and interpretation of the results, and wrote the manuscript. LK contributed to the conception of the proposed method and the design of the feature sets and the classifier, helped in performing the tests, contributed to the evaluation and interpretation of the results, and wrote the manuscript. Both authors have read and approved the final version of the manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Mizianty, M.J., Kurgan, L. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 10, 414 (2009). https://doi.org/10.1186/1471-2105-10-414

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-10-414

Keywords