Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK

e-Science Regional Knowledge Centre, Eötvös Loránd University, Pázmány Péter sétány 1/a. 1117 Budapest, Hungary

Abstract

Background

Comparative methods have been the standard techniques for

Results

Stochastic sequence alignment methods define a posterior distribution of possible multiple alignments. They can highlight the most likely alignment, and above that, they can give posterior probabilities for each alignment column. We made a comprehensive study on the HOMSTRAD database of structural alignments, predicting secondary structures in four different ways. We showed that alignment posterior probabilities correlate with the reliability of secondary structure predictions, though the strength of the correlation is different for different protocols. The correspondence between the reliability of secondary structure predictions and alignment posterior probabilities is the closest to the identity function when the secondary structure posterior probabilities are calculated from the posterior distribution of multiple alignments. The largest deviation from the identity function has been obtained in the case of predicting secondary structures from a single optimal pairwise alignment. We also showed that alignment posterior probabilities correlate with the 3D distances between _{α }amino acids in superimposed tertiary structures.

Conclusion

Alignment posterior probabilities can be used to a

Background

Due to the increasing speed and number of genome sequencing projects, the gap between the number of known structures and the number of known protein sequences keeps increasing. As a result, demand for reliable computational methods today is higher than ever, while

The central assumption of comparative bioinformatics methods for proteins is that the structures of proteins are more conserved than their amino-acid sequences. This allows homology modelling, namely, mapping the structure of a sequence onto homologous sequences. As insertions and deletions separating two homologous sequences accumulate, homologous characters in the two sequences will occupy different positions, which causes a non-trivial problem of identifying homologous positions. This problem can be solved by sequence alignment algorithms

The relationship between gap-penalties and similarity scores can be set such that they maximise the number of correctly aligned positions in a benchmark set of alignments

The uncertainty in the sequence alignment can be slightly reduced when more than two sequences are simultaneously aligned together, and hence, much effort has been put in developing accurate multiple sequence alignment methods. Although efficient algorithms exist for any type of pairwise alignment problem, the multiple sequence alignment problem is hard. It has been proved that the optimal multiple sequence alignment problem under the sum-of-pairs scoring scheme is NP-hard

Iterative approaches have been introduced for score-based methods in the eighties

The Markov chain Monte Carlo (MCMC) method represents a third way to attack the multiple stochastic alignment problem. It was first introduced for assessing the Bayesian distribution of evolutionary parameters of the TKF91 model aligning two sequences

Since the above mentioned methods for multiple stochastic sequence alignment problems have been introduced only recently, no large-scale, comprehensive analysis on the performance of methods for protein structure prediction has been published yet. In this paper, we present a survey on how stochastic alignment methods can be used for protein secondary structure predictions. The prediction can be based on pairwise or multiple alignments and in both cases, either only a single, optimal alignment or the whole posterior distribution of alignments is used for prediction. We are interested in the question how much one can gain by involving more sequences and the posterior distribution of the alignments into the secondary structure prediction.

Results

Implementation of the methods

We implemented a stochastic pairwise and a stochastic multiple sequence alignment method in Java programming language (see Additional file

Structure Projector package. Java source code and public licence in a tar.gz archive

Click here for file

The stochastic pairwise alignment method was tested on all the possible 9494 pairs of sequences belonging to the same family. The analysis took two days on an Intel Xeon 3.0 GHz computer with SUSE Linux 9.3 operating system and JVM 1.5.0. The most time-consuming part of the analysis was the Maximum Likelihood parameter optimisation, which took approximately 90% of the total running time.

12 families have been selected for testing the stochastic multiple sequence alignment method, see Table

Selected families from the HOMSTRAD database for testing the performance of stochastic multiple sequence alignment methods

Family name

Class

Number of sequences

Average length

Average sequence id

Xylose isomerase

Alpha beta barrel

6

388

69%

Annexin

All alpha

6

317

57%

Calcium-binding protein – parvalbumin-like

All alpha

7

107

56%

Starch binding domain

All beta

8

105

52%

Glycosyl hydrolase family 22 (lysozyme)

Alpha+beta

12

126

51%

Legume lectin

All beta

12

234

50%

Papain family cysteine proteinase

Alpha+beta

13

223

40%

Subtilase

Alpha/beta

11

294

40%

Src homology 2 domains

Alpha+beta

11

105

35%

C-type lectin

Alpha+beta

8

126

27%

Halo-peroxidase

Alpha/beta

9

286

25%

Response regulator receiver domain

Alpha/beta

13

122

25%

Maximum Posterior Decoding estimations for the multiple sequence alignment of the subtilase family in the HOMSTRAD database

Maximum Posterior Decoding estimations for the multiple sequence alignment of the subtilase family in the HOMSTRAD database. The two estimations were given based on samples from two Markov chains with different starting points. The similarity between the two independent estimations shows good convergence and mixing of the Markov chain.

Post-processing the data

Secondary structure predictions have been given in four ways:

• Based on the Viterbi alignment (referred to as "Viterbi"). In this case, the most likely – a.k.a. Viterbi – alignment was obtained for all pairs of sequences and was used to map the secondary structure of one of the sequences onto the other sequence.

• Based on the posterior distribution of pairwise alignments using the Forward-Backward algorithm ("Forward"). In this case, the posterior probabilities that two amino acids are aligned together were obtained for all pairs of sequences and all pairs of amino acids. The secondary structure of one of the sequences was mapped onto the other sequence in a fuzzy way using the posterior probabilities.

• Based on the Maximum Posterior Decoding estimation from samples of a Markov chain Monte Carlo (MCMC) stochastic multiple alignment ("MPD"). In this case, the Maximum Posterior Decoding (MPD) alignments were predicted from MCMC samples and were used to map the secondary structure of one of the sequences onto the other sequences. The MPD alignment maximizes the product of the posterior probabilities of its alignment columns. See the Methods section for an explanation why the MPD alignment can be more accurately estimated from MCMC samples than the Viterbi alignment.

• Based on the posterior distribution of multiple alignments obtained by MCMC stochastic multiple alignment ("Bayesian"). In this case, the posterior probabilities that two amino acids are aligned together were estimated from the MCMC samples for all pair of sequences choosable from a multiple alignment and all pair of amino acids. The secondary structure of one of the sequences was mapped onto the other sequence in a fuzzy way using the posterior probabilities.

Amino acid sequences were divided into 100 categories based on their alignment posterior probabilities in the case of pairwise sequence alignments – or on their posterior structure prediction probabilities (see Methods, Eqn. 1.) in the case of Viterbi and Forward estimations, respectively. The 100 categories were evenly distributed on the [0, 1] interval. For each category and the three general types of secondary structures (alpha helices, beta sheets and 3_{10 }helices), the percentage of the correctly estimated secondary structure types was calculated and plotted on Fig.

Posterior probabilities of correctly predicting secondary structure types with stochastic pairwise alignment methods as a function of alignment posterior probabilities

Posterior probabilities of correctly predicting secondary structure types with stochastic pairwise alignment methods as a function of alignment posterior probabilities. The black diagonal shows the identity function. The statistics have been generated on the whole HOMSTRAD database, 'Viterbi' means estimation based on a single, optimal alignment obtained by the Viterbi algorithm, 'Forward' means estimation based on the posterior distribution of alignments obtained by the Forward algorithm.

Amino acid sequences of the selected 12 families were divided into 10 categories based on their alignment posterior probabilities in the case of multiple sequence alignments – or on their posterior structure prediction probabilities (see Methods, Eqn. 2.) in the case of MPD and Bayesian estimation, respectively. The 10 categories were evenly distributed on the [0, 1] interval. For each category and the three general types of secondary structures, the percentage of the correctly estimated secondary structure types was calculated and plotted on Fig.

Posterior probabilities of correctly predicting secondary structure types with stochastic multiple sequence alignment methods as a function of alignment posterior probabilities

Posterior probabilities of correctly predicting secondary structure types with stochastic multiple sequence alignment methods as a function of alignment posterior probabilities. The black diagonal shows the identity function. The statistics have been generated on 12 families from the HOMSTRAD database, see Table 1.

For a fair comparison, we repeated the pairwise sequence comparison protocols on the selected 12 families, the generated statistics are shown on Fig.

Posterior probabilities of correctly predicting secondary structure types with stochastic pairwise alignment methods as a function of alignment posterior probabilities

Posterior probabilities of correctly predicting secondary structure types with stochastic pairwise alignment methods as a function of alignment posterior probabilities. The black diagonal shows the identity function. The statistics have been generated on 12 families from the HOMSTRAD database, see Table 1.

Our results indicate that methods predicting secondary structures based on a single alignment are over-pessimistic about their performance on alpha helices and beta sheets, namely, the posterior probabilities associated to the prediction are lower than the actual probability that the prediction is correct. Methods that predict structures based on the whole distribution of sequence alignments are less pessimistic – the alignment posterior probabilities better approximate the observed probabilities that the prediction is correct. All pairwise alignment methods proved to be over-optimistic estimating the reliability of their predictions for alpha helices and beta sheets with posterior probability above 0.8.

Predicting the correctness of 3_{10 }helix predictions turned out to be the toughest of all secondary structure types. Each method except the Bayesian estimation on multiple sequence alignments is much over-optimistic on their power of predicting 3_{10 }helices. MPD is less optimistic than pairwise methods.

Among all methods studied, Bayesian estimation based on multiple alignments was the only one that was able to correctly predict its prediction power of all secondary structure types, including 3_{10 }helices, which makes MCMC-based multiple alignment methods successful candidates for promotion to a fundamental tool in protein structure prediction.

To show that the alignment posterior probabilities correlate not only with the goodness of secondary structure predictions but they also correlate with the similarities in the 3D structures, we calculated from the HOMSTRAD superimposed 3D structures the 3D distances between the _{α }atoms for each aligned pair of amino acids. The alignment posterior probabilities were evenly divided into 10 categories, and the average 3D distances as well as the low and high quartiles have been plotted for each category.

Fig. _{10 }helices.

3D distances between the aligned _{α }amino acids as a function of pairwise alignment posterior probabilities

3D distances between the aligned _{α }amino acids as a function of pairwise alignment posterior probabilities. The 3D distances were calculated from the HOMSTRAD pdb files containing the superimposed structures of sequence families. Pairwise alignments were obtained by the Viterbi algorithm on the entire HOMSTRAD database (black) as well as on the 12 selected families described in Table 1. (light green). Boxes show the average distances, lines show the range between the low and high quartiles.

3D distances between the aligned _{α }amino acids as a function of multiple alignment posterior probabilities

3D distances between the aligned _{α }amino acids as a function of multiple alignment posterior probabilities. The 3D distances were calculated from the HOMSTRAD pdb files containing the superimposed structures of sequence families. Multiple alignments are MPD estimations for the 12 selected families described in Table 1. based on MCMC samples. Boxes show the average distances, lines show the range between the low and high quartiles.

Sensitivity of secondary structure predictions as a function of alignment posterior probabilities

Sensitivity of secondary structure predictions as a function of alignment posterior probabilities. Sensitivity is defined as

Discussion

Comparing predictions on different secondary structure types

The differences between the predictions of different secondary structure elements can be explained by their general attributes. Alpha helices are typically formed by 10 amino acids or more. Substitutions are frequent in alpha helices and they are surrounded by loop sequences where insertions and deletions often occur, therefore stochastic alignment methods realise some uncertainty, which yields relatively low posterior probabilities when aligning these regions. However, since alpha helices are relatively long, and the substitutions that occur in them rarely change the chemical behaviour of the affected amino acids, the long runs of chemically similar amino acids in the two sequences to be aligned give a strong statistical signal that helps align alpha helices.

Beta sheet elements are typically shorter than alpha helices, and are also surrounded by non-structured fragments accumulating insertions and deletions, which also yields relatively low alignment posterior probabilities. However, beta sheet elements are more likely to be misaligned, since their short length keeps them from carrying a statistical signal that alpha helices do.

The 3_{10 }helices are the least conserved secondary structure elements. Even if the actual amino acid sequence does not change, mutations at other parts of the sequence might indicate a conformation change that can shift the 3_{10 }helix or transform it into a different structure type, see for example, Fig. _{10 }structure is mapped onto other sequences that do not contain this secondary structure motif. The fact that different secondary structure motifs can build up the same region of a functional protein implies that the given region might not be crucial to maintaining the structure and function of the protein and thus mutations can accumulate in the vicinity of the given region. Stochastic multiple sequence alignment can reveal the uncertainty in aligning that region, which explains why multiple alignment methods improve in predicting their predicting power on 3_{10 }helices.

Part of the HOMSTRAD subtilase alignment in JOY format

Part of the HOMSTRAD subtilase alignment in JOY format. In the middle of the alignment, the TSA motif might be both alpha helix and 3_{10 }helix.

There is a similar explanation for the overoptimism in the region of 0.8 and higher posterior probabilities in the case of alpha helices and beta sheets: slight structural changes might shift the position where an alpha helix or a beta sheet starts or ends, even if the amino acids in the positions of question do not change. Fig.

Comparing predictions of different protocols

Predictions based on a single, optimal pairwise or multiple alignment are over-pessimistic: alignment columns from both the Viterbi alignments and the MPD multiple alignments are labelled with posterior probabilities that are typically lower than the actual probability that the secondary structure predictions are correct for these columns. When the whole posterior ensemble of alignments is the basis of the secondary structure prediction, the posterior probabilities are closer to the actual probabilities that the prediction is correct. One main difference between the two strategies – prediction based on a single optimal alignment and prediction based on the posterior distribution of alignments – is that in the latter case posterior probabilities of all secondary structure types are given for each amino acid, while in the former case, the Viterbi or MPD alignment assigns at most one secondary structure element to each amino acid. This suggests the hypothesis that prediction methods based on the posterior distribution of alignments are less over-pessimistic due to possessing such false positive predictions with small posterior probabilities that are not part of a Viterbi or MPD alignment-based estimation.

To test this hypothesis, we predicted alpha helices and beta sheets from the posterior distribution of pairwise alignments in an alternative way. In this alternative prediction, each amino acid has been assigned to at most one secondary structure element that had maximal posterior probability (if the posterior probability of not harbouring a secondary structure type was maximal, then no secondary structure has been associated to the amino acid in question).

The correlation between alignment posterior probabilities and probabilities of correctly predicting a secondary structure type is obviously the same under the two different protocols if the posterior probability is greater than 0.5, since an event having probability greater than 0.5 must be the most likely event. The two types of curves split very soon below 0.5 (data not shown), and the second type of prediction protocol (considering at most one secondary structure type prediction for an amino acid) gets less over-pessimistic than the other protocol. This means that there are more true positive predictions than false positive predictions with non-maximal posterior probabilities.

This result is just the opposite of what our hypothesis suggested, therefore we also plotted the number of false positive and true positive predictions for each secondary structure type and prediction methods, see Fig.

Number of true positive and false positive predictions as function of the alignment posterior probabilities

Number of true positive and false positive predictions as function of the alignment posterior probabilities.

Correlation between 3D structure similarities and alignment posterior probabilities

High alignment posterior probabilities indicate that the aligned residues are close to each other in the superimposed 3D structures. The average 3D distance between the aligned residues increases as the alignment posterior probability decreases. However, the distribution of residue distances become flatter for small alignment posterior probabilities, namely, a small alignment posterior probability does not necessarily mean that the aligned residues are far from each other. For example, 0.5 alignment posterior probability in a pairwise alignment means that there is still about 25% probability that the aligned residues are closer to each other than the average distance between amino acids that are aligned together with more than 0.9 posterior probability. The distance distribution is even flatter in case of multiple alignments. One possible explanation is that the alignment posterior probabilities are calculated for multiple alignment columns while distances are calculated for all possible pairs of amino acids in alignment columns. A small alignment posterior probability indicates possible differences in the 3D structures, however, some of the 3D structures might be still similar. Averaging the 3D distances in alignment columns naturally makes the distribution more centred (data not shown).

Conclusion

In this paper, we studied how posterior probabilities of aligning characters in pairwise or multiple alignments might indicate whether secondary structure predictions based on the alignments in question are correct. We found that pairwise alignment methods are over-pessimistic on predicting alpha helices and beta sheets, namely, posterior probabilities of alignment columns are lower than the actual probability that the structure prediction based on the alignment column is correct, while they are overoptimistic on predicting 3_{10 }helices, i.e., posterior probabilities for these alignment columns are greater than the probabilities that the secondary structure prediction for these amino acids is correct. Multiple alignment methods provide slightly more reliable predictions about their reliability of secondary structure predictions – they are less overoptimistic on 3_{10 }helix predictions.

Secondary structure predictions can be given based on single, optimal pairwise or multiple alignments and also based on the posterior ensemble of alignments. In the latter case, posterior probabilities are closer to the probabilities that the secondary structure prediction is correct, especially when the structure prediction is based on the posterior distribution of multiple sequence alignments.

The multiple sequence alignment is the Holy Grail of bioinformatics

It is worth mentioning that the alignment methods we applied in this work do not consider any information about how secondary structures evolve. It is well-known that different secondary structure elements follow different substitution processes, and this difference in the substitution pattern can be used for secondary structure prediction

The running time of the methods obviously increases with the complexity of the background models, and analyses utilising such combined methods currently take too long to be applicable for everyday use on personal computers. However, the speed of processors keeps increasing exponentially following Moore's law, and will soon reach a level when it won't pose barrier to such combined approaches. Nevertheless, there are also promising channels to improve the running time of the methods. The standard approach for statistical multiple alignment is going to be MCMC, and current implementations make use of very basic tricks only, like the alignment window cut algorithm described in the Methods section. Several groups are working on making MCMC alignment methods more efficient and quickly mixing, and significant improvements are expected in the coming years.

Methods

The HOMSTRAD database _{10 }helix or none). We predicted the secondary structures of the sequences as described below.

Pairwise sequence alignments

The stochastic model

We used a simplified version of the TKF92 model

The TKF92 model

The TKF92 model [9], presented as a Hidden Markov Model.

Predicting secondary structure based on a single optimal alignment ("Viterbi")

For each family in the HOMSTRAD database, each pair of sequences has been aligned using the above described pair-HMM. Since the jumping probabilities in the pair-HMM are interdependent via common parameters, the usual EM algorithm

Predicting secondary structure based on the distribution of alignments ("Forward")

We also predicted secondary structures based on the distribution of alignments in the stochastic model. Posterior probabilities for each pair of characters from the two sequences have been obtained with the Forward and Backward algorithms using the Maximum Likelihood parameters. The posterior probability _{s}(_{i}) that a particular amino acid _{i }from sequence

where _{i}, _{j}) is the posterior probability that characters _{i }and _{j }are aligned, and _{j }is

Multiple sequence alignment methods

Bayesian model for sequence alignments, evolutionary trees and model parameters

The transducer theory

Markov chain Monte Carlo inferring of sequence alignments

Since the joint distribution of alignments, trees and parameters is a high dimensional distribution that is too complicated for direct, analytical inferring, Markov chain Monte Carlo

The Markov chain performs a random walk on the space comprising the following components:

• Edge lengths of the tree

• Model parameters

• Extended alignment, described above

• Tree topology

We applied Metropolis-Hastings moves to change one of the components randomly, each component selected with a fixed, prescribed probability that was chosen to maximise the mixing of the Markov chain. Standard techniques were used for modifying edge lengths and parameters in the model, for a reference, see

Changing the alignment is the most time-consuming event, since the running time of proposing a new alignment is proportional to the product of the lengths of the aligned sequences. A possible solution is modifying only a part of the alignment ("subalignment"), which decreases the running time of this type of proposal. Although it also decreases the mixing of the Markov chain, the overall performance of the Markov chain in terms of total computational time improves

**a) **If the window borders are indicated by the first and last ancestral Felsenstein wildcard within the window (indicated as underlined), a proposed alignment could lead to a situation from which the original alignment could not be obtained by the same rules

**a) **If the window borders are indicated by the first and last ancestral Felsenstein wildcard within the window (indicated as underlined), a proposed alignment could lead to a situation from which the original alignment could not be obtained by the same rules. **b) **If the window borders are indicated by neighbouring ancestral Felsenstein wildcards that are not within the window and will not to be realigned, no possible alignment will lead to such a situation, the original alignment will always be proposed back with a positive probability.

Sequences are iteratively realigned on the selected subtree within the selected window. In each iteration, the new alignment is drawn by the Forward-Backward sampling algorithm

The pair-HMM that is used to realign sequences of the selected subtree

The pair-HMM that is used to realign sequences of the selected subtree. In all runs,

We used nearest neighbour interchanges (NNI) for altering the topology as described in

Effect of a single NNI step on a rooted subtree

Effect of a single NNI step on a rooted subtree. A, C, E and F may or may not be leaf nodes.

Because the MCMC analysis is time-consuming, we selected 12 families from the HOMSTRAD database, see Table

Predicting secondary structures based on the MPD estimation of multiple sequence alignment ("MPD")

In an earlier work

Predicting secondary structures based the posterior distribution of multiple alignments ("Bayesian")

We also predicted secondary structures based on all the alignments sampled from the Markov chain. The estimation for the posterior probability for a particular amino acid _{i }from sequence A having a secondary structure

where _{k}(_{i}) is the amino acid in sequence _{i }is aligned in the _{s,x }is 1 if the known secondary structure of character

Authors' contributions

IM proposed the research, contributed to the MCMC code and wrote some parts of the software for posterior analysis. AN wrote the majority of the MCMC code and the majority of the software for posterior processing. BD wrote the software for pairwise analysis. JH encouraged the research and wrote the manuscript.

Acknowledgements

This research was supported by BBSRC grant BB/C509566/1. IM was also supported by a Bolyai postdoctoral fellowship and an OTKA grant F61730. The authors would like to thank the two anonymous referees for their valuable comments.