State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, Peking University, Beijing, 100871, China

Center for Theoretical Biology, Peking University, Beijing, 100871, China

Department of Mathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA

Abstract

Background

Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM).

Results

In this paper, we report a new method of probabilistic nature for protein secondary structure prediction, based on dynamic Bayesian networks (DBN). The new method models the PSI-BLAST profile of a protein sequence using a multivariate Gaussian distribution, and simultaneously takes into account the dependency between the profile and secondary structure and the dependency between profiles of neighboring residues. In addition, a segment length distribution is introduced for each secondary structure state. Tests show that the DBN method has made a significant improvement in the accuracy compared to other pure HMM-type methods. Further improvement is achieved by combining the DBN with an NN, a method called DBNN, which shows better _{3 }accuracy than many popular methods and is competitive to the current state-of-the-arts. The most interesting feature of DBN/DBNN is that a significant improvement in the prediction accuracy is achieved when combined with other methods by a simple consensus.

Conclusion

The DBN method using a Gaussian distribution for the PSI-BLAST profile and a high-ordered dependency between profiles of neighboring residues produces significantly better prediction accuracy than other HMM-type probabilistic methods. Owing to their different nature, the DBN and NN combine to form a more accurate method DBNN. Future improvement may be achieved by combining DBNN with a method of SVM type.

Background

Over past decades, the prediction accuracy of protein secondary structure has gained some improvements, largely due to the successful application of machine learning tools such as neural network (NN) and support vector machine (SVM). Qian and Sejnowski designed one of the earliest NN methods _{3 }of a modern NN or SVM-based method can reach over 76%.

In contrast to NN and SVM, probabilistic methods for protein secondary structure prediction such as those based on hidden Markov model (HMM) have had very limited accuracy

It would be interesting to break this apparent asymmetry in accuracy between machine learning-based methods and probabilistic model-based methods. The probabilistic model is of somewhat different nature from machine learning tools, and provides a complement to the latter. Thus, combining the two kinds of model is likely to produce a consensus prediction that has better accuracy than the prediction of individual program

In this paper we introduce a new probabilistic model, dynamic Bayesian network (DBN), for protein secondary structure prediction. DBN represents a directed graphical model of a stochastic process, often regarded as a generalized HMM capable of describing correlation structure in a more flexible way _{3 }accuracy than many other popular methods and is competitive to the current state-of-the-arts. The most interesting feature of DBN/DBNN is that a significant improvement in the prediction accuracy is achieved when combined with other methods by a simple consensus.

Results and Discussion

Training and testing datasets

Three public datasets are employed for training and testing, i.e. CB513

Furthermore, we have built a fourth dataset based on the known tertiary structural similarity from the SCOP

For all the datasets described above, the secondary structure is assigned by DSSP program

Window sizes

The window sizes, denoted by _{AA }and _{SS }for profile and secondary structure respectively, describe the range of dependency of current site on its neighbors. The correlation between the _{3 }accuracy of DBN and window sizes is studied via a set of seven-fold cross-validation tests of DBN_{sigmoid }(see Methods) on SD576 using different window sizes. Due to the limitation in the computational resources, the upper bounds of _{AA }and _{SS }are set to be 5 and 4, respectively.

As shown in Fig. _{3 }is improved significantly when _{SS }> 0, and saturated when _{SS }> 1, which indicates that there is strong short-range dependency between the profile of a residue and the secondary structure states of its neighbors. A similar phenomenon occurs for profiles' dependency of neighboring sites. Note that the model with either _{AA }= 0 or _{SS }= 0 is a special case of DBN, in which the distribution of the profile of each residue is independent from neighboring profiles or neighboring secondary structure states, respectively. As a result, its topology is different from that of a full-DBN version (_{AA }> 0 and _{SS }> 0) due to the removal of _{i }or _{i }nodes (see Fig.

The influence of window sizes on the _{3 }of DBN

**The influence of window sizes on the Q _{3 }of DBN**. L

Illustration of the DBN model

**Illustration of the DBN model**. (a) An example of PSSM, where rows represent residue sites and columns represent amino acids. The "SS" column contains the secondary structure of each site, classified as H (helix), E (sheet), and C (coil). (b) A graphical representation of the DBN. The shadow nodes represent observable random variables, while clear nodes represent hidden (in prediction) variables. The arcs with arrows represent dependency between nodes. The contents of the nodes _{i}, _{i}, _{i}, and _{i }are derived as illustrated by the connections of dashed lines, where the subscript indicates the residue site. More detailed description of _{i}, _{i}, _{i}, _{i}, _{i}, and _{i }can be found in the text. _{AA }and _{SS }are windows sizes for profile and secondary structure, respectively (in this example, _{AA }= 4 and _{SS }= 2). (c) Is a reduced version of (b) with _{AA }= 0 and _{SS }= 0.

Our results are in partial agreement with the conclusions of Crooks and Brenner, who claimed that each amino acid was dependent on the neighboring secondary structure states but was essentially independent from neighboring amino acids

Fig. _{AA }= 4, _{SS }= 4), for which _{3 }reaches about 77.5%. However, test shows that this model is very time-consuming. We choose a more economical set (_{AA }= 4, _{SS }= 3) which offers a similar _{3 }(see Fig.

The accuracy improvements through combinations

All the basic DBN- and NN-based models described in Methods are tested on the SD576 dataset, and the results shown in Table _{linear }(combination of DBN_{linear+NC }and DBN_{linear+CN}) and DBN_{sigmoid }(combination of DBN_{sigmoid+NC }and DBN_{sigmoid+CN}) have significantly improved the performance in all the measures, indicating that the two directions of the sequence (i.e. from N-terminus to C-terminus and reverse) contain complementary information. In addition, the combination of the two different PSSM-transformation strategies (i.e. the combination of DBN_{linear }and DBN_{sigmoid }to produce DBN_{final}) also contributes to the accuracy improvement, increasing _{3 }and

Performance of basic DBN and NN models and their combinations tested on SD576.

Model

_{3 }(%)

_{H}

_{E}

_{C}

DBN_{linear+NC}

75.1

74.0

0.69

0.60

0.55

DBN_{linear+CN}

74.6

73.3

0.68

0.61

0.53

DBN_{linear}

77.0

75.8

0.72

0.64

0.58

DBN_{sigmoid+NC}

75.8

74.5

0.72

0.60

0.56

DBN_{sigmoid+CN}

74.6

73.3

0.69

0.61

0.54

DBN_{sigmoid}

77.4

75.9

0.74

0.64

0.59

DBN_{final}

78.2

76.8

0.74

0.65

0.60

NN_{linear}

77.6

73.2

0.72

0.64

0.60

NN_{sigmoid}

77.1

71.0

0.72

0.63

0.59

NN_{final}

77.8

73.3

0.73

0.64

0.60

DBNN

80.0

78.1

0.77

0.68

0.63

All the eleven models listed in the table are described in Methods. The average results of seven-fold cross-validation are shown.

Table _{final }has improved by 3.5% over NN_{final }in

Finally, the combination of all the basic DBN- and NN-based models, which produces the resultant DBNN, has achieved further improvement in the accuracy, increasing _{3 }and _{final }(see Table

Secondary structure segment length distributions

To study the significance of the secondary structure segment length distributions introduced in DBN models, we define a degenerate DBN (denoted by DBN_{geo}), which has the same structure to DBN_{final }except _{max }= 1 [see Eq. (10)]. As described in Methods, _{max }= 1 implies a geometric distribution for the segment lengths. The segment length distributions of the predicted secondary structure by both DBN_{final }and DBN_{geo }are calculated and compared to the true distributions observed in the SD576 dataset, as shown in Fig. _{final }and DBN_{geo}. But longer segments are all predicted correctly by both models. Generally speaking, DBN_{final }has better performance than DBN_{geo}: the prediction of DBN_{final }for segments of 3 and 5–7 residues is much better than that of DBN_{geo}.

Segment length distributions of helices, sheets, and coils

**Segment length distributions of helices, sheets, and coils**. (a) The observed distributions calculated directly from SD576 dataset. Inset is lin-log plots of the distributions, where the lines show fitting exponential tails for the three types of secondary structure segments. (b) The comparison between the distribution of helices observed in the dataset and those predicted by DBN_{final }and DBN_{geo}. (c) The comparison of distributions between observation and prediction of sheets. (d) The comparison of distributions between observation and prediction of coils.

Fig. _{final }and DBN_{geo }have missed a rich population of one residue, and over-predicted segments of 3–5 residues, for sheets. DBN_{geo }has predicted a spurious peak for segments of 3 and 4 residues, which is absent in the true distribution. On the contrary, DBN_{final }gives a distribution closer to the observation, in which the peak is located at segments of about 5 residues. Fig. _{final }and DBN_{geo }have very similar performance for coils: both under-predict the segments of 1 and 2 residues and over-predict those of 3 and 4 residues. However, DBN_{final }predicts a much better distribution for long coils (over 8 residues) than DBN_{geo}.

It is interesting to study whether we can modify the _{α}(_{final}, denoted by DBN_{mod}, which is constructed as following: take the

where _{α}^{old}(_{α}^{pre}(_{α}^{obs}(_{max}. The quantity _{α}^{new}(_{final}, DBN_{geo}, and DBN_{mod}, are tested on SD576, and the performance on segment length distributions prediction is measured by "relative entropies", defined by

where _{α}^{obs}(_{α}^{pre}(_{max }have the same definitions as above, and

The results presented in Table _{geo }has much higher relative entropies indicating a strong deviation of the predicted distributions from the observation, than other two models. Note that _{3 }and _{geo }are also much lower than that of DBN_{final }(Table _{mod }shows the lowest relative entropies for all the three secondary structure states with almost the same _{3 }and _{final }(see Table

Performance of DBN_{geo}, DBN_{final}, and DBN_{mod }tested on SD576.

Model

_{3 }(%)

Relative entropy (bit)

Helix

Sheet

Coil

Average

DBN_{geo}

76.7

74.3

0.247

0.170

0.290

0.236

DBN_{final}

78.2

76.8

0.236

0.096

0.210

0.181

DBN_{mod}

78.2

76.3

0.214

0.038

0.110

0.121

The seven-fold cross-validation test results on three models with different segment length distributions are explained in the text. The performance is measured by _{3}, _{final }and DBN_{mod }have visible improvement over DBN_{geo}.

Comparison between DBN and leading HMM-type methods

The DBN method (DBN_{final}) developed in this work is also evaluated on the widely used CB513 dataset, and its performance is compared to two recently published HMM-type methods, denoted by HMMCrooks _{final }has made improvements for all measures compared to the two methods mentioned above. Specifically, DBN_{final }improves _{3 }by 3.5% over HMMCrooks and 4.1% over HMMChu, and improves _{3 }and _{final }is particularly good at the prediction of helices and sheets, compared to above two methods.

Comparative performance of DBN_{final }and DBN_{diag }against leading HMM-type methods tested on CB513.

Method

_{3 }(%)

_{H}

_{E}

_{C}

HMMCrooks

72.8

--

--

--

--

HMMChu

72.2

68.3

0.61

0.52

0.51

DBN_{diag}/ErrSig

72.5/0.42

65.9/0.63

0.66/0.01

0.55/0.01

0.51/0.01

DBN_{final}/ErrSig

76.3/0.41

72.7/0.63

0.71/0.01

0.61/0.01

0.57/0.01

DBN_{final }and DBN_{diag }are methods developed in this work and their descriptions can be found in the text. Entries marked with "--" mean that the data could not be obtained from the literature. HMMChu has been trained and tested on the CB480 dataset (a reduced version of CB513), while all other methods have been trained and tested on the CB513 dataset. The average results of seven-fold cross-validation are shown.

The improvements made by DBN_{final }are believed mainly due to the use of a conditional linear Gaussian distribution to model the PSI-BLAST profile of each residue, in which the correlation between the 20 entries in the profile is considered (see Methods). In contrast, both HMMCrooks and HMMChu employ a multinomial distribution to model the profile, which lacks the above correlation information _{diag}) that has the similar architecture to DBN_{final }but only has a diagonal covariance matrix for the distribution of _{i }[Eq. (7)], so that the correlation between entries of the profile is ignored. We have tested this model on the CB513 dataset, and the results (Table _{3 }of DBN_{diag }drops down to 72.5%, similar to those of HMMCrooks and HMMChu, which highlights the importance of the non-diagonal entries in the covariance matrix.

Comparison between DBNN and other popular methods

CB513 dataset

The best models developed in this work, DBNN, is then tested on the CB513 dataset and compared to other popular methods. Specifically, the methods SVM _{3 }accuracy among all the methods mentioned above, with improvements ranging from 0.3% to 4.6%. Since the ErrSig is 0.41/0.40, this indicates that for all methods except YASSPP, the improvement made by DBNN is significant. In _{H }while YASSPP has a better _{C}.

Comparative performance of DBNN against other popular methods tested on CB513.

Method

_{3 }(%)

_{H}

_{E}

_{C}

SVM

73.5

--

0.65

0.53

0.54

PMSVM

75.2

--

0.71

0.61

0.61

SVMpsi

76.6

73.5

0.68

0.60

0.56

JNET

76.9

--

--

--

--

YASSPP

77.8

75.1

0.58

0.64

0.71

^{†}SPINE

76.8

--

--

--

--

DBNN/ErrSig

78.1/0.41

74.0/0.62

0.74/0.01

0.64/0.01

0.60/0.01

^{†}DBNN/ErrSig

78.0/0.40

74.0/0.62

0.74/0.01

0.64/0.01

0.60/0.01

The description of DBNN can be found in Methods. Entries marked with "--" mean that the data could not be obtained from literatures. JNET has been trained and tested on the CB480 dataset (a reduced version of CB513), while all other methods have been trained and tested on the CB513 dataset. Methods marked with "†" have been evaluated using ten-fold cross-validation, while others have been evaluated using seven-fold cross-validation.

EVA dataset

DBNN is also compared to some live prediction servers by using the EVAc6 dataset and EVA website. The methods selected to compare are: Prospect

Comparative performance of DBNN and consensus methods against other leading methods tested on EVAc6.

Method

_{3 }(%)

_{H}

_{E}

_{C}

**Subset 1 (80 chains)**

Prospect

71.1

68.7

0.59

0.69

0.49

DBNN/ErrSig

78.8/1.34

74.8/1.74

0.72/0.03

0.64/0.04

0.62/0.02

**Subset 2 (175 chains)**

PROF_king

71.7

66.9

0.62

0.68

0.49

DBNN/ErrSig

77.3/0.86

71.9/1.27

0.71/0.02

0.64/0.03

0.57/0.02

**Subset 3 (179 chains)**

SAM-T99

77.1

74.4

0.66

0.68

0.53

DBNN/ErrSig

77.3/0.86

71.9/1.28

0.71/0.02

0.64/0.02

0.57/0.02

**Subset 4 (212 chains)**

PSIPRED

77.8

75.4

0.69

0.74

0.56

PROFsec

76.7

74.8

0.68

0.72

0.56

PHDpsi

75.0

70.9

0.66

0.69

0.53

DBNN/ErrSig

77.8/0.79

72.4/1.16

0.71/0.02

0.65/0.02

0.58/0.01

**Subset 5 (73 chains)**

SAM-T99

76.3

72.9

0.71

0.64

0.56

PSIPRED

75.8

72.1

0.70

0.64

0.57

PROFsec

75.3

73.0

0.68

0.61

0.54

PHDpsi

73.3

69.2

0.66

0.56

0.52

PROF_king

70.7

64.9

0.63

0.57

0.50

DBNN/ErrSig

76.4/1.48

72.4/2.06

0.73/0.04

0.67/0.04

0.59/0.03

CM1/ErrSig

77.2/1.14

73.2/1.87

0.73/0.04

0.66/0.04

0.58/0.02

CM2/ErrSig

77.7/1.17

73.4/1.78

0.74/0.04

0.67/0.04

0.60/0.02

CM3/ErrSig

78.1/1.17

74.4/1.76

0.75/0.04

0.67/0.04

0.60/0.02

DBNN and the three consensus methods (CM1, CM2, and CM3) developed in this work are compared with other leading methods on five subsets of EVAc6; each comparison is carried out with maximum number of common sequences. The results of the six existing methods, Prospect, PROF_king, SAM-T99, PROFsec, PHDpsi, and PSIPRED, are obtained directly from the EVA website.

Table _{3 }than all other existing methods. In addition, the ErrSigs indicate that, for Prospect, PROF_king, and PHDpsi, the improvement made by DBNN is significant. In _{H }among all the methods.

The _{3 }and

Calculated

Method Y

Method X

PROF_king

SAM-T99

PSIPRED

PROFsec

PHDpsi

DBNN

CM1

CM2

CM3

_{3}:

PROF_king

--

-4.70

-3.99

-3.56

-1.88

-4.52

-6.19

-6.93

-6.88

SAM_T99

- 4.70

--

0.50

0.93

- 2.45

-0.16

-1.41

-2.09

-3.02

PSIPRED

- 3.99

-0.50

--

0.53

- 2.18

-0.63

-2.01

-2.62

-3.38

PROFsec

- 3.56

-0.93

-0.53

--

- 2.31

-0.94

-2.87

-3.22

-3.72

PHDpsi

- 1.88

-2.45

-2.18

-2.31

--

-2.48

-4.55

-5.11

-5.10

DBNN

- 4.52

0.16

0.63

0.94

- 2.48

--

-0.91

-1.61

-2.50

CM1

- 6.19

1.41

- 2.01

- 2.87

- 4.55

0.91

--

-1.65

-2.82

CM2

- 6.93

- 2.09

- 2.62

- 3.22

- 5.11

1.61

1.65

--

-1.48

CM3

- 6.88

- 3.02

- 3.38

- 3.72

- 5.10

- 2.50

- 2.82

1.48

--

PROF_king

--

-4.05

-3.89

-3.80

-1.99

-3.69

-5.30

-5.66

-5.86

SAM_T99

- 4.05

--

0.54

-0.06

- 2.43

0.36

-0.20

-0.35

-1.21

PSIPRED

- 3.89

-0.54

--

-0.62

- 1.77

-0.19

-0.97

-1.22

-2.57

PROFsec

- 3.80

0.06

0.62

--

- 2.93

0.37

-0.15

-0.28

-1.12

PHDpsi

- 1.99

-2.43

-1.77

-2.93

--

-1.67

-3.30

-3.30

-3.82

DBNN

- 3.69

-0.36

0.19

-0.37

- 1.67

--

-0.58

-0.83

-1.83

CM1

- 5.30

0.20

0.97

0.15

- 3.30

0.58

--

-0.27

-2.03

CM2

- 5.66

0.35

1.22

0.28

- 3.30

0.83

0.27

--

-2.55

CM3

- 5.86

1.21

- 2.57

1.12

- 3.82

- 1.83

- 2.03

- 2.55

--

The

All the above evaluation work shows that prediction accuracy of protein secondary structure by any individual program seems to reach a limit, no better _{3 }than 78% (see Table _{final}, and CM3 is the same as CM2 except DBNN is in the place of DBN_{final}. The weight for the vote of each method is set to be the success rate of the method for each type of secondary structure, which is derived from an individual evaluation of its own. The CM-series are evaluated on the subset 5 of EVAc6. The results shown in Table _{3 }and

The _{3 }to CM1. On the other hand, the inclusion of DBN or DBNN (both CM2 and CM3) has given rise to significantly better _{3 }than all individual methods including SAM-T99. This is further enhanced by a direct comparison between CM3 and CM1; significant improvements in both _{3 }and

Conclusion

A new method for protein secondary structure prediction of probabilistic nature based on dynamic Bayesian networks is developed and evaluated by several measures, which has shown significantly better prediction accuracy than previous pure HMM-type methods such as HMMCrooks and HMMChu. The improvement is mainly due to the use of a multivariate Gaussian distribution for the PSI-BLAST profile of each residue and the consideration of dependency between profiles of neighboring residues. In addition, because of the introduction of secondary structure segment length distributions in the model, DBN shows much better

The essentially different nature of DBN and NN inspires a model that combines the two and forms the DBNN with significant further improvements in both _{3 }and

An interesting feature of our work here, compared to NN or SVM, is that it provides a set of distributions which have specific meanings and which can be studied further to improve our understanding of the model's behavior behind the prediction. An example is provided regarding the secondary structure segment length distributions used by the DBN, which is set to be an

It appears that the limits of secondary structure prediction are being reached as no new method over the past decade has shown any major improvement since PSIPRED. All of the top methods are between 77%–80% accurate, in terms of _{3}, depending on data set used. This implies that the complexity of the sequence-structure relationship is such that any single tool, when it attempts to extract (during learning) and to extrapolate (during predicting) the knowledge, can only represent some facets of this relationship, but not the whole. Further hope lies in the possibility that more facets are covered by new models, and that new models are integrated with the existing ones. The consensus methods reported above are just a simple approach in that direction; more sophisticated strategy for combining multiple scores can be sought in the future.

Methods

Generation of the PSI-BLAST profile

Each protein sequence in the datasets described above is used as query to search against the NR database

Transformation of the PSSM

Similar to other secondary structure prediction methods

and is referred to as "linear transformation"; the other follows the function

and is referred to as "sigmoid transformation".

Assessment of the prediction accuracy

Several measures are adopted to assess the performance of our methods in a comprehensive way. The first is the overall three-state prediction accuracy, _{3}, defined by

where

where _{i }is the number of residues correctly predicted to be secondary structure of class _{i }is the number of residues correctly not predicted to be secondary structure of class _{i }is the number of residues observed but not predicted to be secondary structure of class _{i }is the number of residues predicted but not observed to be secondary structure of class

The dynamic Bayesian network

DBN is a directed graphical model in which nodes represent random variables and arcs represent dependency between nodes. The architecture of our DBN model is illustrated in Fig. _{i }(_{i }stores replica of the profiles of a series of residues before _{AA}, as shown in Fig. _{AA }is a profile window size indicating the range of the dependency for the profiles. As shown in Fig. _{i }and its neighboring sites, _{i-1}, _{i-2}, ... _{i-LAA, }can be summarized into one single connection to _{i}, simplifying the topology of the graph. The state-space of _{i }is 21·_{AA}-dimensional, with 20·_{AA }storing the profiles of the past residues and extra _{AA }dimensions representing the "over-terminus" state.

The node _{i }is used to describe the secondary structure state of residue _{i }has a similar role as _{i}, but describes here the joint distribution with the secondary structure states of residues _{SS}, where _{SS }is the secondary structure window size indicating the range of the dependency, as shown in Fig. _{i }is introduced to simplify the topology of the graph, yet to keep a long-range dependency between profile (_{i}) and secondary structure (_{i-1}, _{i-2}, ...). The dimension of _{i }is 4·_{SS}, where 3·_{SS }are from the joint past secondary structure states and the extra _{SS }from the "over-terminus" situation.

The nodes _{i }and _{i }are introduced to mimic a duration-HMM _{max }and two elements, respectively. Specifically, _{i }represents the distance (measured by the number of residues) from the position _{i }is set to be _{i }requires that the maximum length of segments should not exceed _{max}. In order to cope with longer segments, a modified definition of _{i }is introduced as following: when the length of a segment ≤ _{max}, the value of _{i }is set as described above; when the length of the segment > _{max}, for example _{max}+3, the _{i }is set to be _{max }for the first four residues of the segment and is set to be _{max}-1, _{max}-2, ... 1 for the rest. In this way, the lengths of segments longer than _{max }are modeled by a geometric distribution (see below). The value of the node _{i }is deterministically dependent on _{i}: if _{i }> 1, _{i }= 1; if _{i }= 1, _{i }= 2.

Each node described above is assigned a specific conditional probability distribution (CPD) function according to the connections' pattern shown in Fig. _{i}, which is a "root" node _{i }(

_{i }= **y **| _{i }= **u**, _{i }= _{i} = **y**;**w**_{α,γ}**u **+ **c**_{α,γ}, Σ_{α,γ}),

where **y**;**μ**, **Σ**) represents a Gaussian distribution with mean **μ **and covariance **Σ**, **u **is a 21·_{AA}-dimensional vector, _{SS}-tuples formed by four elements: O, H, E, and C (O represents the "over-terminus" state). The distribution function is characterized by the mean **μ**_{α,γ }= **w**_{α,γ}**u **+ **c**_{α,γ}, where **w**_{α,γ }is a 20 × 21 _{AA }matrix and **c**_{α,γ }is a 20-dimensional vector, and the covariance **Σ**_{α,γ}. The subscripts **w**_{α,γ}, **c**_{α,γ}, and **Σ**_{α,γ }are dependent on the states of _{i }and _{i}. Second, the CPD of _{i }(

where _{α}(_{i }(

where _{j }and _{j }(_{SS}) are the _{SS}-tuples _{i }(

where _{α}(_{α }is the probability for _{i }to maintain the value _{max }given _{i }= _{i-1 }= _{max}. Using this function, the probability of producing a segment with length _{max}) is proportional to (1-_{α})_{α}^{n-Dmax}, i.e. a geometric distribution. The validity of using such a distribution to model segments of length longer than _{max }is supported by Fig. _{max }should be 13, after which all the distributions can be fitted well to exponential functions (see the inset of Fig. _{i }(

Note that the CPDs of _{1}, _{1}, and _{1 }have similar definition to CPDs of _{i}, _{i}, and _{i }(

The parameters of the CPDs described above are derived by applying the maximum likelihood (ML) method to the training set. In prediction, the marginal probability distribution of _{i }(_{i }with the maximum probability is the prediction of residue

The neural network

The typical three-layered feed-forward back-propagation architecture is used in our NN-based models. The sliding window-based training and testing strategy are employed with an optimal window size of 15 derived from an empirical evaluation of varying window sizes from 7 to 19. The momentum terms and learning rates of the network are set to be 0.9 and 0.005, respectively, and the number of hidden units is set to be 75.

Training and combinations

Training is done in two different ways, depending on datasets involved. For the dataset CB513 and SD576, the standard

Note that the DBN and NN models are usually trained on the same training set, in order to make a comparison and to be combined later to form DBNN. However, the detailed training process of DBN is somewhat different from NN, owing to different architectures of the model. The DBN takes two sets of data as input, one for profile and the other for secondary structure; each set is a sliding window with the "current" residue located at the right end. The correlation information between "current" residue and its neighbors is stored in the data, but depends on the direction in which the window slides (from N-terminus to C-terminus or reverse). We actually run the DBN model in both directions and then average the results (see below). On the other hand, the NN takes only one sliding-window, with the "current" residue located at the center of the window. Finally, the training for DBNN is simple the training of DBN and NN on the same dataset.

When a sequence is selected for either training or testing, the original PSSM generated by PSI-BLAST can be transformed into [0 1] in two strategies: linear transformation [Eq. (3)] or sigmoid transformation [Eq. (4)]. In addition, as mentioned above, the direction from either N-terminus to C-terminus (NC) or the reverse (CN) gives rise to different correlation structure, so we treat them separately. As a result, four basic DBN models are generated corresponding to four above combinations: (i) DBN_{linear+NC}, (ii) DBN_{linear+CN}, (iii) DBN_{sigmoid+NC}, and (iv) DBN_{sigmoid+CN}, where the subscripts are self-explanatory. On the other hand, NN is split into two kinds according to the transformation for PSSM, and the corresponding models are denoted by NN_{linear }and NN_{sigmoid}, respectively.

The six basic models described above are believed to contain complementary information and need to be combined to form three final models. Two strategies for forming the final models are used. The first is a simple averaging of the output scores and is used to form the two architecture-based final models, DBN_{final }and NN_{final}. It is done in two steps. One first averages the outputs of DBN_{linear+NC }and DBN_{linear+CN }to form DBN_{linear}, and of DBN_{sigmoid+NC }and DBN_{sigmoid+CN }to form DBN_{sigmoid}. Then, DBN_{linear }and DBN_{sigmoid }are further combined to form DBN_{final}. Similarly, NN_{linear }and NN_{sigmoid }are combined to form NN_{final}.

The second strategy consists in using a new neural network, which has the same architecture to basic NN models except that it takes as inputs, the outputs of all the other scores (DBN_{linear+NC}, DBN_{linear+CN}, DBN_{sigmoid+NC}, DBN_{sigmoid+CN}, NN_{linear}, and NN_{sigmoid}). This final model is named DBNN, and is the one that shows the best performance among the models mentioned above.

Availability

All the codes and datasets described above are available from our homepage

Authors' contributions

ZSS and HQZ supervised the whole process of the work. XQY wrote the codes and did the tests. XQY, HQZ, and ZSS draft the manuscript.

Acknowledgements

We acknowledge the support by the National Natural Science Foundation of China (No. 10225210 and No. 30300071), and the National Basic Research Program of China (973 Program) under grant No. 2003CB715905.