Abstract
Background
The structure of proteins may change as a result of the inherent flexibility of some protein regions. We develop and explore probabilistic machine learning methods for predicting a continuum secondary structure, i.e. assigning probabilities to the conformational states of a residue. We train our methods using data derived from highquality NMR models.
Results
Several probabilistic models not only successfully estimate the continuum secondary structure, but also provide a categorical output on par with models directly trained on categorical data. Importantly, models trained on the continuum secondary structure are also better than their categorical counterparts at identifying the conformational state for structurally ambivalent residues.
Conclusion
Cascaded probabilistic neural networks trained on the continuum secondary structure exhibit better accuracy in structurally ambivalent regions of proteins, while sustaining an overall classification accuracy on par with standard, categorical prediction methods.
Background
Protein structures can be characterized by regular folding patterns. Descriptions at the level of local folding pattern (e.g., alpha helix or beta sheet) are known as the protein's secondary structure, as opposed to its full tertiary (3dimensional) structure. It is common practise to describe each residue as belonging to either one of eight secondary structure environment classes:
C_{8 }= {G, H, I, E, B, T, S, C},
or to one of three classes:
C_{3 }= {H, E, C}.
The set C_{8 }consists of the eight DSSP [1] classes: 3_{10}helix (G), alpha helix (H), pi helix (I), helixturn (T), extended beta sheet (E), beta bridge (B), bend (S) and other/loop (C). In set C_{3}, class H contains the 3_{10}helix and alpha helix classes, class E contains the extended beta sheet and beta bridge classes and class C contains the remaining four DSSP classes.
In addition to the dominating covalent polypeptide backbone, the stability of a protein structure is determined by the collective strength of many covalent and ionic bonds, as well as van der Waals attractions. However, it is well established that protein structures are not entirely rigid. As the tertiary structure of a protein changes due to thermal motion or outside in influences, a residue may also change secondary structure states. In stark contrast to the typical definition of secondary structures in which a residue can only have a single state, continuum secondary structures allow a residue to be in all states, indicated by a probability distribution over the possible secondary structure states. It has been contended that specifying protein structure this way allows regions of transition in secondary structure (e.g., caps) to be characterised more accurately [2]. Moreover, a probabilistic representation of secondary structures sheds light on the conformational flexibility of proteins [2,3].
Solved by Xray crystallography, a protein has conformational variation mainly due to different experimental conditions. On the other hand, NMR solution of a protein structure always provides a number of models with structural variation due, at least in part, to intrinsic motions of the protein [2]. Andersen and colleagues developed a schemeDSSPCONTwhere the secondary state probability distribution for each residue in a protein is estimated from the variation amongst an ensemble of NMR models of the protein. The DSSPCONT assignment thus directly distinguishes between less flexible regions and more flexible ones [2].
In this work, we use the same NMR dataset as Anderson et al. [2] to develop probabilistic models that are able to predict the continuum secondary structure from the amino acid sequence. We test these models using a dataset of continuum secondary structures developed by us and having very low sequence similarity with the Anderson et al. dataset. Given a target protein sequence, for each residue, our models are trained to predict the probability distribution over all possible secondary structure environments for that residue. Importantly, the probabilistic models are thus directly provided with prediction targets that reflect the variability of conformation.
There are a large number of prediction methods that take as input a protein sequence and predict the secondary structure of each of its residues. Current best methods (including PSIPRED, SSPro, PROFsec and others) achieve a 3class accuracy (Q_{3}) of 75–80% [411]. These and other previous secondary structure prediction methods implicitly assume that each residue in the protein belongs to a definite secondary structure. The target secondary structure used by most models are categorically determined by DSSP [1] or STRIDE [12].
Most previous prediction methods provide continuous (rather than categorical) outputs, and it is tempting to interpret these as probabilities. What distinguishes our approach is that we train our models with probabilistic data, so it is entirely natural to interpret their predictions as probabilities. Previous approaches train models using categorical data, so noncategorical outputs often do not represent probabilities at all. In most cases (e.g., with neural nets trained on categorical data), noncategorical outputs represent the distance from an internal decision boundary, which may be correlated with the certainty of the prediction, but is not a probability in the strict sense of the word. It is also unclear whether such outputs bear direct physical or biological meanings (like thermal motion or conformational switching), or if they merely reflect the confidence that the model has in the prediction. Moreover, it is estimated that 5–15% of the current prediction errors can be attributed to the current rigid definition of secondary structure and how it is derived from experimental models [13].
In this work, we study three models of increasing complexity: Naive Bayes' Density Predictors (NBDP), Probabilistic Neural Networks (PNN), and Cascaded Probabilistic Neural Networks (CPNN). The first of these methods (NBDP) is the simplest to implement, but makes the most assumptions about the data. In particular, NBDP assumes that the identities of adjacent residues in the protein are independent given the secondary structure. This is obviously not true in general, but we include results using NBDP because it is often a surprisingly effective method for learning probability distributions [14]. The neural network models are known to be effective for categorical secondary structure prediction [4,6,15] and are thus explored here, too.
To quantify the accuracy of our models, we measure the divergence between the probabilities derived from highquality NMR models and the predicted probabilities. In combination with each of the three model types, we also examine the effect of describing the amino acids in the input sequence in two different ways. We refer to the two residue description methods as the amino acid identity and PSIBLAST profile methods, where the latter is employed in the top performing categorical secondary structure predictors.
Our main concern is to establish how well the continuity of structure can be captured by machine learning models from limited datasets. We are specifically interested to see if secondary structure prediction can be improved by training the model with the finegrained structural data from NMR.
To compare our work with previous studies, we also convert the continuum predictions made by our models into categorical predictions by selecting the class with the highest predicted probability. We compare these results to those of a categorical prediction method trained on the data obtained by a similar conversion of the continuum secondary structure data to categorical. The categorical method we compare to is a Cascaded Categorical Neural Network (CCNN), as employed by the topperforming PSIPRED [4] algorithm.
Results
The probabilistic models studied here are more accurate at predicting continuum secondary structure (residue class density) than models trained on categorical data. The difference in accuracy is most pronounced for residues with structural ambivalence. Furthermore, these probabilistic models can be used to predict categorical secondary (residue class) with accuracy comparable to a successful categorical method. These results hold when accuracy is measured by crossvalidation on the training data as well as when validated with an test dataset containing only proteins with low sequence similarity to proteins in training set, and for both the 3 and 8class prediction problems.
Probabilistic models
The PNN and CPNN are the most successful models at predicting continuum secondary structure in our study. Accuracy is highest when the residues in the sequence are described using the PSIBLAST profile method. The accuracy of the PNN and CPNN methods is also sensitive to the number of hidden nodes in the model and to the width of the sequence window presented to the model. The accuracy of the Naive Bayes model is substantially inferior to the that of the PNN and CPNN models.
We use the KullbackLeibler (KL) divergence to measure the accuracy of our models at predicting continuum secondary structure (see Methods section). Accuracy increases with decreasing KL divergence. The predictive accuracy of the PNN and CPNN models generally improves as the number of hidden nodes increases (Table 1), although improvement is slight beyond 25 hidden nodes. The optimal window size is 15 residues for both the 3 and 8class prediction problems. The KL divergence of the best PNNs is 0.49 and 0.88 for the 3 and 8class problems, respectively. The CPNN improves this accuracy to 0.47 and 0.84, respectively.
Table 1. Crossvalidated density prediction accuracy of PNN and CPNN models. Average KL divergence for probabilistic and cascaded probabilistic neural network models predicting continuum 3 and 8class secondary structure from PSIBlAST encoded sequence data. Window size and number of hidden nodes are varied. All predictions are for 10fold crossvalidation on the training set (set174). When standard errors are given in parentheses, the predicted value is the mean of five randomized repeats of crossvalidation. in parenthesis.
For the 3class problem, the best Naive Bayes' model (with an 11residue window) achieves a KL divergence of 0.74. For the 8class problem, KL divergence is 1.19 for a model with a 7 and 9residue window. The NBDP model is far inferior to the other models when residues are described using the amino acid identity method (Table 2), and fails almost totally with the PSIBLAST profile description method (data not shown). The optimal window size for NBDP, of approximately 10 residues, is smaller than for the PNN model.
Table 2. Crossvalidated density prediction accuracy of NBDP models. Average KL divergence for the Naive Bayes' Density Predictor is shown for the 3 and 8class prediction tasks with varying sequence window sizes. The residues in the window were described using the amino acid identity method. All predictions are for 10fold crossvalidation on the training set (set174). Best results are shown in bold.
To give the reader a better qualitative feeling for the meaning of various KL divergences, we show the output of the most accurate 3class NBDP and CPNN models on the Rasbinding domain of CRaf1 (PDB:1RFA) in Figure 1. The average KL divergence of the NBDP prediction on this sequence is 0.67, and is slightly worse than the average 3class prediction accuracy for the PNN model (see Table 1), and about average for the NBDP model (see Table 2). The average divergence of the CPNN prediction for this sequence 0.27, slightly more accurate than the overall average achieved by the CPNN model on all test sequences (see Table 1). The data in the figure is for NBDP using the amino acid identity residue description method, and for CPNN using the PSIBLAST profile method.
Figure 1. Example 3class continuum secondary structure predictions. The 3class predictions of the best NBDP and CPNN models for positions 1–50 of protein PDB:1RFA are plotted. The target (known) probabilities are plotted as a dotted black line. The dashed red line is the NBDP predictions and the solid blue line is the CPNN predictions.
Comparing with categorical models
Even though our probabilistic models are not explicitly trained to produce categorical output, they perform competitively with a stateoftheart classification method. We train Cascaded Categorical Neural Networks (CCNNs) (similar to PSIPRED) to predict the categorical targets using a configuration identical to the best CPNN in this study (30 hidden nodes, 15residue window). These CCNNs are trained using categorical data derived from the continuum data, as described in the Methods section.
The classification accuracy of the probabilistic model (CPNN) is comparable to that of the categorical model (CCNN) in the 3class problem using several popular accuracy metrics including Q_{3 }and SOV (Table 3). We observe that, with a Q_{3 }of 77.3, the CPNN is on par with the CCNN (Q_{3 }= 77.2). (Q_{3 }measures classificaton accuracy on a scale of 0 (worst) to 100 (best).) Similarly, the two models have segment overlapbased SOV measures [16] of 73.8 (CPNN) and 72.8 (CCNN). (SOV measures a segmentbased precision of prediction ranging from 0 (worst) to 100 (best).) Compared with the cascaded probabilisitic model (CPNN), the PNN model has similar Q_{3 }accuracy but notably inferior SOV. The best Naive Bayes' Density Predictors with Boolean input features manage a Q_{3 }of 61.2 and an SOV of 52.9, significantly worse than the other probabilistic models.
Table 3. Crossvalidated classification accuracy of all models. Average accuracy of categorical prediction in the 3 and 8class problems is given as measured by the accuracy metric Q_{k}, the Matthews correlation, r(), and SOV. All predictions are for 10fold crossvalidation on the training set (set174). When standard errors are given in parentheses, the predicted value is the mean of five randomized repeats of crossvalidation. The best results are shown in bold.
As an independent test, we also test the models on the all sequences in the small CAFASP3 data used to benchmark a range of public predictors [17]. None of the sequences in CAFASP3 are included in our training set (set174). We find that both CPNN and CCNN achieve a Q_{3 }of 76.2, only slightly worse than their accuracies measured by crossvalidation on the training set (Table 3). For comparison, Q_{3 }accuracies reported for other categorical models in the Eyrich study [17] ranged from 67.5 to 78.9 (78.6 for PSIPRED). Similarly, the CPNN and CCNN models have SOV measures of 73.5 and 73.9, respectively, which are slightly better than their crossvalidated accuracies. The classification accuracy of the CPNN is slightly lower than the best results reported in the literature, but this is to be expected because our training set is considerably smaller than those used in many previous studies [7]. It also bears noting that the probabilistic model (CPNN) is not specifically trained to produce categorical targets.
Conversely, many models trained on categorical data also offer continuous predictions. For the CCNN fitted with the softmax output function, we can evaluate its ability to directly produce an output close to the continuum secondary structure. On the 3class problem, the CCNN achieves KL divergence (averaged over all residues) that is nearly identical to that of the CPNN (Table 4, columns labeled "entropy ≥ 0.0"). The crossvalidated KL divergence on all residues is 0.48 (SE = 0.002) for the CPNN, which is close to that of CCNN (0.47). Likewise, the CCNN and CPNN have very similar KL divergence on the test dataset (test286): 0.51 and 0.50, respectively. On the 8class problem, the CCNN and CPNN have very similar overall KL divergence on the test dataset (0.99 vs 0.98), but CCNN appears slightly inferior when crossvalidated on the training set: 0.87 (SE = 0.003) and 0.84, respectively. We note that the categorical model (CCNN) is not specifically trained to produce continuous targets.
Table 4. Density prediction accuracy (KL divergence) for structurally ambivalent residues. Average KL divergence of prediction of continuum secondary structure for residues that have a structural ambivalence equal to or exceeding an entropy of 0.0 (all residues), 0.3 and 0.5. "CV": average (standard error) of five randomized repeats of 10fold crossvalidation on the training set (set174). "test": average error on the test dataset (set286).
To investigate the qualitative difference between models that are trained on probabilistic targets and categorical targets, we focus on residues in "fuzzy" regions. In particular, we identified "fuzzy" residues as those with an observed secondary structure state entropy of 0.3 or above (15% of all residues), and "very fuzzy" residues as those with target entropy of 0.5 or above (8% of all residues). We investigate the accuracy of the models created by crossvalidation on the full training set on each of these low entropy residue subsets (Tables 4 and 5).
Table 5. Classification accuracy (Q_{3}) for structurally ambivalent residues. Average accuracy as measured by Q_{3 }of 3class categorical prediction of residues that have a structural ambivalence equal to or exceeding an entropy of 0.0 (all residues), 0.3 and 0.5. "CV": average (standard error) of five randomized repeats of 10fold crossvalidation on the training set (set174). "test": average error on the test dataset (set286).
Both probabilistic models (PNN and CPNN) have significantly lower KL divergence on residues with structural entropy greater than 0.3 when compared with the categorical model (CCNN). This conclusion holds when KL divergence is measured using crossvalidation on the training set as well as when measured on the test dataset (Table 4, last four columns). The probabilistic models have significantly superior (lower) KL divergence than the categorical model for both the 3 and 8class problems. For example, the crossvalidated 3class KL divergence for residues with entropy at least 0.5 is 0.59 for the CCNN, but only 0.53 for the CCNN. Similarly, the average 3class KL divergence on the test dataset residues with entropy at least 0.5 is, 0.58 and 0.53, respectively for the CCNN and CPNN models. The best probabilistic model (CPNN) is superior to the best categorical model (CPNN) for predicting continuum secondary structure for test dataset residues with entropy greater than 0.1 (Figure 2). This result holds for both the 3 and 8class problems.
Figure 2. KL divergence as a function of test dataset residue entropy. The dashed red and solid blue lines show the KL divergence of predictions on the test dataset (set286) made by the CCNN and CPNN models, respectively. Residues are binned by secondary structure entropy, and the mean KL divergence of residues in a bin is plotted at the midpoint of the bin. Error bars show plus and minus one standard error around each mean. The numbers of residues in the bins for the 3class problem are (in order of increasing entropy) 26357, 948, 839, 525, 14, 1036, 489, 4 and 2. For the 8class problem bin occupancies are 25657, 1878, 777, 1728, 127, 43, 4 and 0.
For the 3class problem, KL divergence on the test dataset is only slightly higher than the crossvalidated value (Table 4). This shows that overfitting is not a serious problem with the 3class models. On the other hand, the 8class models show significantly worse KL divergence on the test dataset than during crossvalidation. This may be caused by overfitting given the small size of the training dataset (approximately 17000 residues) compared to the number of parameters in the models (approximately 9000). However, since overfitting does not occur with the (equally complex) 3class models, it is likely that the more complex output space in the 8class problem is the true culprit.
For classifying structurally ambivalent residues (Table 5, last four columns), the probabilistic networks (PNN and CPNN) are on par with the categorical neural network (CCNN).
Threeclass classification accuracy (Q_{3}) on residues with entropy at least 0.3 is 55.7% (SE = 0.22) and 55.4% (SE = 0.17) for the CPNN and CCNN models, respectively It is worth noting that the probabilistic networks achieve good classification despite not being specifically trained for this task.
Discussion
Our results support the existence of higherorder dependencies between the residues within the input window as the Naive Bayes' models and singlelayer Probabilistic Neural Networks perform relatively poorly.
In agreement with previous work on both neural networks and support vector machines [4], we note that the PSIBLAST profile description of the sequence data works much better than the amino acid identity method for all types of Probabilistic Neural Networks. However, the opposite holds for the Naive Bayes' Density Predictor. With the PSIBLAST profile description method, the performance of NBDP drops considerably. On closer inspection, the class conditioned distributions of input values are strongly overlapping and consequently result in poor discrimination.
To investigate whether the input values follow a nonGaussian distribution we also tried dividing each numeric input value into 5 and 10 bins. The performance of the Naive Bayes' Density Predictor then drops even further. The naive assumption of independence among the very large number of random variables (resulting from the profile description method) is clearly violated. Each residue profile column reflects a single piece of information involving all of the twenty amino acids, and the random variables making up the column are highly dependent. This fact is most likely responsible for the failure of NBDP with the PSIBLAST profile residue description method.
Even though the average KL divergence between the probabilistic target and predicted values are almost the same for both the CCNN and CPNN, they seem to handle structurally ambivalent residues differently. The discrepancy as measured on these challenge subsets indicates that training with continuum data results in more accurate prediction for residues with high structural ambivalence.
Our simulations indicate that it is not sufficient to train on the categorical targets if structurally ambivalent states need to be characterised precisely. This precision may be particularly important for identifying conformational flexibility, e.g. Young et al. [18] rely on a reliability index of a categorical prediction to identify conformational switches.
Conclusion
The models we present are adapted to predict a continuum secondary structure, i.e. to predict the probability of a residue belonging to any of the three or eight secondary structure classes. The probabilities derive from NMR models that capture some aspects of protein flexibilityin contrast to most categorical predictors which are trained on categorical data usually derived from Xray crystal structures.
Cascaded Probabilistic Neural Networks using a 15residue input window (involving primary sequence data only) are able to produce 3class predictions that, on average, measure 0.47 using the KullbackLeibler divergence from target distributions. A similar model produces 8class secondary structure predictions measuring an average KL divergence of 0.84 from the observed targets.
To illustrate the performance and utility of probabilistic models, we also convert the probabilistic predictions to categorical classifications and note that the probabilistic models are on par with models that are directly trained on categorical data. In particular, structurally ambivalent residues (e.g. caps of regular folds) are predicted more accurately by the best probabilistic models than by their categorical counterparts. So far, the scarcity of NMR data renders the continuum secondary structure prediction less accurate for classification than categorical models directly trained on much larger sets of crystallographic data.
Methods
Overview
A typical machine learning approach might view the secondary structure environment of a protein residue as a multinomial random variable, C, having k possible values. Our view is that the secondary structure environment of a residue does not have a single, fixed value, but may be in any one of k classes. The occupation of the structural classes follows a multinomial distribution, and is measured by the variation among NMR models. We start from a training set, S, of examples each of the form E =<X, T >. The goal is to use the training set to learn a function Y = f(X) that approximates the posterior probabilities, T =<T_{1}, T_{2}, ..., T_{k }>, where
T_{j }= Pr(C = jX), 1 ≤ j ≤ k.
In common with many earlier secondary structure prediction methods, we use a sliding window approach. That is, the prediction for a particular residue will be based on a description (see below) of that residue and some number of residues on either side of it in the protein sequence. The sequence window,
X =<X_{1}, X_{2}, ..., X_{w1}, X_{w }>,
is used to describe the residue at the center of a wresidue window. The entry X_{i }in this vector is a description of the residue at the ith position (along the sequence) in the window.
Our approach outputs a probability vector,
Y =<Y_{1}, ..., Y_{k }>
in the kclass problem, where Y_{j }is the probability that the central residue in sequence window X is in the jth environment of C_{3 }(3class problem) or C_{8 }(8class problem). (For convenience, we use integers to refer to the environment classes, substituting the integer j for the jth entry in either the 3 or 8class sets.) Since Y is a probability vector, we have the constraints:
0 ≤ Y_{j }≤ 1, 1 ≤ j ≤ k,
and
Describing residues
As noted above, X_{i }is a description of the residue at the ith position the current sequence window. We study two ways to describe residues.
The amino acid identity method sets X_{i }to the name of the amino acid at position i in the sequence window. Thus, this method of description treats each X_{i }as a variable with nominal values. For convenience, we let the names of the 20 amino acids plus the endofsequence marker be represented by the set A = {1, ..., 21}.
The PSIBLAST profile method (first successfully applied by Jones [4]) requires that the target protein first be (multiply) aligned with orthologous sequences. X_{i }is set to the logodds score vector (over the 20 possible amino acids character) derived from the multiple alignment column corresponding to position i in the window. This method of description treats each X_{i }as a 21dimensional vector of real values, the extra dimension being used to indicate if X_{i }is off the end of the actual protein sequence (0 for within sequence, 0.5 for outside). The logodds alignment scores are obtained by running PSIBLAST against Genbank's standard nonredundant protein sequence database for three iterations. The elements in PSIBLAST positionspecific scoring matrices are divided by 10 so that most values appear between 1.0 and 1.0. The variation we use was successfully applied in the prediction of protein Bfactor profiles [19].
Models and algorithms
We mainly explore using two wellknown machine learning methods: naive Bayes and probabilistic neural networks. We also develop and evaluate a cascaded variant of the probabilistic neural network (cf. the layered architecture in [15]). Finally, we construct a cascaded categorical neural network as a representative categorical model.
The Naive Bayes' Density Predictor
The classical Naive Bayes' algorithm assumes (naively) that the input features (X_{i}, 1 ≤ i ≤ w) are independent random variables given the class of the residue. Despite this simplifying assumption, NB classifiers often perform surprisingly well on empirical data [14]. The key step in the NB algorithm is to estimate the classconditional probability of each feature given each class,
p_{ij}(x) = Pr(X_{i }= xC = j), for 1 ≤ i ≤ w and 1 ≤ j ≤ k.
That is, p_{ij}(x) is the probability that X_{i }= x given that the class is j. The joint classconditional probability is computed from these, using the assumption of independence, as
The probabilities we are afterthe posterior probabilities of the classes given the data are gotten using Bayes' rule:
NB classifiers are usually trained using examples of X labelled with a single class. We use an extension of the algorithm that allows training using examples labelled with probability vectors, T, describing the posterior probabilities. We call this model a Naive Bayes' Density Predictor (NBDP). As mentioned above, we use two methods for describing the input sequences. We will now describe how we estimate the classconditional probabilities for these two different input feature encoding methods.
With the amino acid identity method, the X_{i }are nominal random variables that take 21 possible values, x ∈ {1, 2, ..., 21}. Let X_{i}(E) and T_{j}(E) be the values of variables X_{i }and T_{j }for training example E =<X, T >. The classconditional probability for each combination of i, j, and x is estimated using the weighted maximum likelihood estimate,
With PSIBLAST profile method for encoding sequences, the X_{i }are themselves vectors of continuous random variables. In particular, each feature vector, X_{i}, is a vector of 21 realvalued subfeatures. That is, X_{i }=<v_{1}, v_{2}, ..., v_{21 }>. We make a further (naive) assumption of independence: all of the 21·w subfeature random variables are independent given the class. This allows us to simply expand the product in Equation 1 to be over all of the components of all the vectors in X. We then make an assumption that is commonly used with naive Bayes' classifiers when the input features are continuous: that each subfeature, X, is a Gaussian random variable when conditioned on the class [20]. That is,
Pr(X = xC = j) = g(x, μ, σ), 0 ≤ j ≤ k,
where g() is the Gaussian density function. We estimate the mean, μ and standard deviation, σ, from the training set data using the weighted maximum likelihood estimates
and
where X(E) is defined analogously to X_{i}(E), above.
Probabilistic neural networks
We also explore the use of Probabilistic Neural Networks for which weight parameters, W, are adapted to produce probability distributions in accordance with the observed data [21]. We use a network with at most one hidden layer. We ask, what is the most likely explanation (in terms of our representation W) of our training set data S? That is, we maximise
(maximum a posteriori). More specifically, since Pr(W) is the same for all configurations (all weights are initialised using a zeromean Gaussian) and since Pr(S) does not depend on W, we maximise only the likelihood Pr(SW). The optimisation is standardly implemented by gradient descent on the relative cross entropy [21].
The softmax output function is used, ensuring that all outputs sum up to 1. Neural networks require a lengthy training period and the setting of a learning rate. In preliminary trials we noted that the presentation of 20 000 sequences was sufficient to ensure convergence to a low training error. To achieve a stable descent on the error surface the learning rate was set to 0.001 (when the rate was 0.01 fluctuations occurred).
With the amino acid identity description method, feature X_{i }= k is encoded as a 21dimensional vector, V =<v_{1}, v_{2}, ..., v_{21 }>, with v_{k }= 1, and v_{j }= 0 for j ≠ k. For the PSIBLAST profile description method, feature X_{i }=<v_{1}, v_{2}, ..., v_{21 }> is already encoded as a 21dimensional realvalued vector. For each training example, X, the w encoded feature vectors are concatenated to create a single input vector of length 21·w.
Cascaded probabilistic neural networks
Similar to many successful categorical secondary structure predictors [15,4] we here investigate a layered architecture consisting of two coupled probabilistic neural networks (see Figure 3). The first is a sequencetostructure network (a PNN as described above), the second is a structuretostructure network, using consecutive predictions from the firstlevel network to predict the structure of the middle residue. We use the best PNN as our firstlevel network. New secondlevel networks are trained after the firstlevel networks and using the same learning method and parameters.
Figure 3. The architecture of the Cascaded Probabilistic Neural Network.
Categorical models: Cascaded categorical neural network
To put our work in a broader context, we explore the Cascaded Categorical Neural Network, a model representing the stateoftheart of secondary structure classification [4]. The CCNN is essentially the same neural network model as employed in PSIPRED [4]. Moreover, with the exception of transformation of training targets and model outputs as explained below, the CCNN is identical to the CPNN.
The probabilistic target data is converted to categorical targets by choosing the majority class (the class with the highest probability),
The categorical target data is used to train CCNNs.
In order to compare our probalistic models with categorical prediction methods, we convert their outputs to categorical predictions. The probabilities predicted by the probabilistic models and the CCNN (fitted with the softmax output function) are converted by assigning the class corresponding to the output with the highest activity,
Training and testing the models
For training our models, we created a nonredundant dataset of continuum secondary structure data for 174 protein chains (set174). This dataset was derived from a dataset used by Anderson et al. [2] for studying protein continuum secondary structures. All model design and testing of different model parameters described in this manuscript was done using only this dataset.
During model development, we used crossvalidation on the training set to measure their predictive accuracy. After all model development was complete, we used two additional datasets for independent validation of the various models. We used sequences in CAFASP3 [17] to evaluate categorical prediction accuracy. Although none of the sequences in the CAFASP3 set are included in our training set, because it is already relatively small, we chose not to attempt to remove any sequences that might be homologous with our training set. Instead, we developed an independent set of continuum secondary structures for 286 sequences (set286) for evaluation of the accuracy of the models in both the probabilistic and categorical prediction tasks.
Our training dataset (set174) was derived from Anderson and colleagues [2] dataset containing 210 structurally nonhomologous protein chains representing different families defined by FSSP [22]. The continuum secondary structure data in this dataset came from a selection of higher quality NMR protein chains, and at was based on at least 10 NMR models for each chain. We applied Hobohm and Sanders' redundancy reduction method [23] to select the largest set of representative chains with sequence identities less than 25% according to CLUSTALW [24]. This resulted in a dataset containing continuum secondary structure data for 174 protein chains containing approximately 17 thousand residues. This dataset was further divided into ten random groups with equal number of protein chains for crossvalidation tests.
We created a test dataset (set286) designed to have very low homology with our training set and very low redundancy within itself. We based the test dataset on all the NMR protein structures in the Protein Data Bank [25] published since 2003, after the publication date of the Anderson and colleagues dataset. To insure low homology with the training set, we removed all sequences from the test dataset with greater than 25% sequence identity with any sequence in the training set. To reduce the withinset redundancy, we then used Hobohm and Sanders method to select the largest subset of sequences with pairwise identity no more than 25%. We also removed all sequences shorter than 50 amino acids. These steps yielded a test dataset comprising 286 chains containing a total of 30214 residues. Their continuum secondary structures were obtained from the DSSPcont website [26].
All models (including both probabilistic and categorical models) are evaluated in two ways. Firstly, we perform 10fold crossvalidation on the training set (set174). Secondly, we measure the accuracy on the sequences in the test dataset (set286).
We use the KullbackLeibler (KL) divergence [27,28] to measure the accuracy of our continuum secondary structure predictions. The KL divergence is the standard means of measuring the distance between two probability distributions, and is defined as
where k is the number of classes to which an input can belong, T is the target probability vector, and Y is the predicted probability vector. A KL divergence value of 0 indicates perfect agreement between the two distributions, larger values indicate more divergence between them.
To evaluate the performance of the categorical versions of each of our models, we use several distinct accuracy metrics: Q_{k}, correlation coefficient and SOV. Each of these metrics is based on counting the numbers of times a sample of a known class is assigned to the correct or incorrect class. We use the quantities true positives, tp(C), which is the number of test samples in class C predicted to be in class C, true negatives, tn(C), which is the number of test samples not in class C predicted not to be in class C, false negatives, fn(C), which is the number of test samples in class C predicted not to be in class C, and false positives, fp(C), which is the number of test samples not in class C predicted to be in class C. The Q_{k }metric defines the accuracy of a kclass model as
The Matthews correlation coefficient is defined as
Finally, we also illustrate SOV [16], a segmentbased, standard measure of secondary structure prediction accuracy that is designed to capture the "usefulness" of the predictions. We used software provided by the authors to compute SOV, and refer the reader to the paper for details of its definition.
For crossvalidation, the training dataset is divided randomly into ten roughly equalsized subsets, each subset appearing as test subset in exactly once of ten separate training sessions (ensuring that each sample appears as a test case exactly once). For each run of crossvalidation, we compute the mean KL divergence (averaged over each residue in the sequences in the test subsets) and a single value for each categorical accuracy metric. In order to measure the standard error of the categorical metrics, we average their values over five independent crossvalidation runs. Standard error is the sample standard deviation divided by the square root of the number of samples (five in this case).
Predictive accuracy on the test dataset (set286) is measured using the models created and trained using the training dataset during the crossvalidation runs. Each categorical accuracy measure is computed independently for each model and then averaged. We report the mean KL divergence, averaged over all residues in the test dataset. The KL divergence for a single residue is computed by first averaging the predictions of all models (of a given type) for the residue, and computing the KL divergence of the average prediction and the target density.
We define the target entropy as
Similarly, the predicted entropy is based on the model's output
High entropy (max is 1) means high variability.
Authors' contributions
MB designed, programmed, performed and analysed most simulations, as well as drafted an early version of the manuscript. ZY initiated the project and participated in the design of the project and prepared the datasets. TLB contributed with technical and methodological ideas, developed the NBDP, and wrote and revised significant portions of the manuscript. All authors read and approved the final manuscript.
Acknowledgements
MB and ZY are supported by UQ Early Career Researcher grant. TLB acknowledges the ARC Centre for Bioinformatics. The authors also wish to acknowledge Queensland Parallel Supercomputing Foundation for providing facilities to run extensive simulations. Thanks to Shane Zhang for help in preparing the datasets. Thanks to Thomas Huber for reviewing the Background section of the manuscript. We would also like to thank the reviewers for many helpful comments.
References

Kabsch W, Sander C: Dictionary of protein secondary structure: Pattern recognition of hydrogen bonded and geometrical features.
Biopolymers 1983, 22:25772637. PubMed Abstract  Publisher Full Text

Andersen CAF, Palmer AG, Brunak S, Rost B: Continuum secondary structure captures protein flexibility.
Structure 2002, 10:175184. PubMed Abstract  Publisher Full Text

Carter P, Andersen CAF, Rost B: DSSPcont: continuous secondary structure assignments for proteins.
Nucleic Acids Research 2003, 31(13):32933295. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jones DT: Protein secondary structure prediction based on positionspecific scoring matrices.
Journal of Molecular Biology 1999, 292:195202. PubMed Abstract  Publisher Full Text

Pollastri G, Przybylski D, Rost B, Baldi P: Improving the Prediction of Protein Secondary Strucure in Three and Eight Classes Using Recurrent Neural Networks and Profiles.
Proteins: Structure, Function, and Genetics 2002, 47:228235. Publisher Full Text

NordahlPetersen T, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O: Prediction of protein secondary structure at 80% accuracy.
Proteins: Structure, Function and Genetics 2000, 41:1720. Publisher Full Text

Rost B: Protein Secondary Structure Prediction Continues to Rise.
Journal of Structural Biology 2001, 134(2–3):204218. PubMed Abstract  Publisher Full Text

Hua S, Sun Z: A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach.
Journal of Molecular Biology 2001, 308(2):397407. PubMed Abstract  Publisher Full Text

Ward JJ, McGuffin LJ, Buxton BF, Jones DT: Secondary structure prediction with support vector machines.
Bioinformatics 2003, 19(13):16501655. PubMed Abstract  Publisher Full Text

Solis AD, Rackovsky S: On the use of secondary structure in protein structure prediction: a bioinformatic analysis.
Polymer 2004, 45(2):525546. Publisher Full Text

Guermeur Y, Pollastri G, Elisseeff A, Zelus D, PaugamMoisy H, Baldi P: Combining protein secondary structure prediction models with ensemble methods of optimal complexity.
Neurocomputing 2004, 56:305327. Publisher Full Text

Frishman D, Argos P: Knowledgebased protein secondary structure assignment.
Proteins: Structure, Function, and Genetics 1995, 23(4):566579. Publisher Full Text

Kihara D: The effect of longrange interactions on the secondary structure formation of proteins.
Protein Sci 2005, 14(8):19551963. PubMed Abstract  Publisher Full Text

Domingos P, Pazzani M: On the optimality of the simple Bayesian classifier under zeroone loss.
Machine Learning 1997, 29(2–3):103130. Publisher Full Text

Rost B, Sander C: Prediction of Protein Secondary Structure at Better than 70% Accuracy.
Journal of Molecular Biology 1993, 232:584599. PubMed Abstract  Publisher Full Text

Zemla A, Venclovas C, Fidelis K, Rost B: A modified definition of SOV, a segmentbased measure for protein secondary structure prediction assessment.
Proteins 1999, 34:220223. PubMed Abstract  Publisher Full Text

Eyrich VA, Przybylski D, Koh IYY, Grana O, Pazos F, Valencia A, Rost B: CAFASP3 in the spotlight of EVA.
Proteins: Structure, Function, and Genetics 2003, 53:548560. Publisher Full Text

Young M, Kirshenbaum K, Dill K, Highsmith S: Predicting conformational switches in proteins.
Protein Sci 1999, 8(9):17521764. PubMed Abstract  Publisher Full Text

Yuan Z, Bailey TL, Teasdale R: Prediction of protein Bfactor profiles.
Proteins: Structure, Function, and Bioinformatics 2005, 58(4):905912. Publisher Full Text

John GH, Langley P: Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. San Mateo: Morgan Kaufmann Publishers; 1995.

Jordan MI, Bishop C: Neural networks. In CRC Handbook of Computer Science. Edited by Tucker AB. Boca Raton, FL: CRC Press; 1997.

Holm L, Sander C: Touring protein fold space with Dali/FSSP.
Nucleic Acids Research 1998, 26:318321. Publisher Full Text

Hobohm U, Scharf M, Schneider R, Sander C: Selection of representative protein data sets.
Protein Science 1992, 1:409417. PubMed Abstract  Publisher Full Text

Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice.

Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P: The Protein Data Bank.
Nucleic Acids Research 2000, 28:235242. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

DSSPcont [http://cubic.bioc.columbia.edu/services/DSSPcont] webcite

Baldi P, Brunak S, Chauvin Y, Andersen CAF, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview.
Bioinformatics 2000, 16(5):412424. PubMed Abstract  Publisher Full Text

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis. Cambridge, England: Cambridge University Press; 1998.