Abstract
Background
Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields.
Results
We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residueresidue contact information into the currently most accurate posterior probabilitybased MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3DCoffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, Tcoffee, MAFFT and MUSCLE. And the performance of the method is comparable to the stateoftheart method PROMALS of using structural features and additional homologous sequences by slightly lower scores.
Conclusion
MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/ webcite.
Background
Aligning multiple evolutionarily related protein sequences is a fundamental technique for studying protein function, structure, and evolution. Multiple sequence alignment methods are often an essential component for solving challenging bioinformatics problems such as protein function prediction, protein homology identification, protein structure prediction, protein interaction study, mutagenesis analysis, and phylogenetic tree construction. During the last thirty years or so, a number of methods and tools have been developed for multiple sequence alignment, which have made fundamental contributions to the development of the bioinformatics field.
State of the art multiple sequence alignment methods adapt some popular techniques to improve alignment accuracy, such as iterative alignment [1], progressive alignment [2], alignment based on profile hidden Markov models [3], and posterior alignment probability transformation [4,5]. Some alignment methods, such as 3DCoffee [6] and PROMALS3D [7], use 3D structure information to improve multiple sequence alignment, which cannot be applied to the majority of protein sequences without tertiary structures. In order to overcome this problem, we have developed a method to incorporate secondary structure, relative solvent accessibility, and contact map information predicted from protein sequences into multiple sequence alignment. Predicted secondary structure information has been used to improve pairwise sequence alignment [8,9], but few attempts had been made to use predicted secondary structure information in multiple sequence alignment [1015]. To the best of our knowledge, applying predicted relative solvent accessibility and residueresidue contact map to multiple sequence alignment is novel.
In order to use the predicted structural information to advance the state of the art of multiple sequence alignment, we first compared the existing multiple sequence alignment tools [16,4,537] on the standard benchmark data sets such as BAliBASE [38], SABmark [39] and OXBENCH [40], which showed that MAFFT [30], Tcoffee [31], MSAProbs [4], and ProbCons [5] yielded the best performance. Then we developed MSACompro, a new multiple sequence alignment method, which effectively utilizes predicted secondary structure, relative solvent accessibility, and residueresidue contact map together with posterior alignment probabilities produced by both pair hidden Markov models and partition function as in MSAProbs [4]. The assessment results of MSACompro compared to the benchmark data sets from BAliBASE, SABmark and OXBENCH showed that incorporating predicted structural information has improved the accuracy of multiple sequence alignment over most existing tools without using structural features and sometimes the improvement is substantial.
Method
Following the general scheme in MSAProbs [4], MSACompro has five main steps: (1) compute the pairwise posterior alignment probability matrices based on both pairHMM and partition function, considering the similarity in amino acids, secondary structure, and relative solvent accessibility; (2) generate the pairwise distance matrix from both the pairwise posterior probability matrices constructed in the first step and the pairwise contact map similarity matrices; (3) construct a guide tree based on pairwise distance matrix, and calculate sequence weights; (4) transform all the pairwise posterior matrices by a weighting scheme; (5) perform a progressive alignment by computing the profileprofile alignment from the probability matrices of all sequence pairs, and then an iterative alignment to refine the results from progressive alignment. Our method is different from MSAProbs in that it adds secondary structure and solvent accessibility information to the calculation of the posterior residueresidue alignment probabilities and computes the pairwise distance matrix with the help of predicted residueresidue contact information.
Construction of pairwise posterior probability matrices based on amino acid sequence, secondary structure and solvent accessibility information
For two protein sequences X and Y in a sequence group S to be aligned, we denote X = (x_{1}, x_{2},......,x_{n}_{1}), Y = (y_{1}, y_{2},......,y_{n}_{2}), where x_{1}, x_{2},......, x_{n}_{1 }and y_{1}, y_{2},......,y_{n}_{2 }are lists of the residues in X and Y, respectively. n_{1 }is the length of sequence X, and n_{2 }is the length of sequence Y. Suppose x_{i }is the ith amino acid in sequence X, and y_{j }is the jth amino acid in sequence Y. We let aln denote a global alignment between X and Y, ALN the set of of all the possible global alignments of X and Y, and aln* ∈ ALN the true pairwise alignment of X and Y. The posterior probability that the ith residue in X (x_{i}) is aligned to the jth residue (y_{j}) in Y in aln* is defined as:
P(aln  X, Y) denotes the probability that aln is the true alignment aln*: Thus, the posterior probability n_{1 }× n_{2 }matrix P_{XY }is a collection of all the values p(x_{i }~ y_{j }∈ aln*  X, Y) (p(x_{i }~ y_{j}) for short) for 1 ≤ x_{i }≤ n_{1}, 1 ≤ y_{j }≤ n2. The calculation process of the pairwise posterior probability matrix is described as follows.
As in MSAProbs, two different methods (a pair hidden Markov model and a partition function) are used to compute the pairwise posterior probability matrices ( and ), respectively. The first kind of pairwise probability matrix is calculated by a partition function (F) of alignments based on dynamic programming. F(i, j) denotes the probability of all partial global alignments of X and Y ending at position (i, j). F_{M }(i, j) is the probability of all partial global alignments with x_{i }aligned to y_{j}, F_{y}(i, j), is the probability of all partial global alignments with y_{j }aligned to a gap, and F_{X}(i, j) is the probability of all partial global alignments with x_{i }aligned to a gap. Accordingly, the partition function can be calculated recursively as follows:
Subject to the constraint W_{1 }+ W_{2 }+ W_{3 }= 1.
In the formula above, s(x_{i}, y_{j}) is the amino acid similarity score between x_{i }and y_{j}. One element of the substitution matrix s, SS(ss(x_{i}), ss(y_{j})) is the similarity score between the secondary structure (ss(x_{i})) of residue x_{i }in protein X and that of residue y_{j }in protein Y according to the secondary structure similarity matrix SS, SA(sa(x_{i}), sa(y_{j})) is the similarity score between the relative solvent accessibility (sa(x_{i})) of residue x_{i }in protein X and that of residue y_{j }in protein Y according to the solvent accessibility similarity matrix SA. W_{1}, W_{2}, W_{3 }are weights used to control the influence of the amino acid substitution score, secondary structure similarity score, and solvent accessibility similarity score. The secondary structure and solvent accessibility can be automatically predicted by SSpro/ACCpro [41] (http://sysbio.rnet.missouri.edu/multicom_toolbox/ webcite) using a multithreading technique implemented in MSACompro, or alternatively be provided by a user. The values of the three weights are set to 0.4, 0.5, and 0.1 by default, and can be adjusted by users. The ensembles of bidirectional recurrent neural network architectures in ACCpro are used to discriminate between two different states of relative solvent accessibility, higher or lower than the accessibility cutoff  25% of the total surface area of a residue [42], corresponding to e or b. As in MSAprobs, β is a parameter measuring the deviation between suboptimal and optimal alignments, gap(gap ≤ 0) is the gap open penalty, and ext(ext ≤ 0) is the gap extension penalty.
We used the Gonnet 160 matrix as a substitution matrix to generate the similarity scores between two amino acids in proteins [43]. The 3 × 3 secondary structure similarity matrix SS contains the similarity scores of three kinds of secondary structures (E, H, C) as follows:
, where two identical secondary structures receive a score of 1 and different ones receive a score of 0.
The 2 × 2 solvent accessibility similarity matrix SA contains the similarity scores of two kinds of relative solvent accessibilities (e, b) as follows:
, where two identical solvent accessibilities receive a score of 1 and different ones receive a 0. It is worth noting that we used the simple identity scoring matrix for secondary structure and solvent accessibility here. Employing more advance scoring matrices defined in [44] may lead to further improvement. Each posterior residueresidue alignment probability element in the first kind of posterior probability matrix () can be calculated from the partition function as:
, where denotes the partition function of all the reverse alignments starting from the position (n_{1}, n_{2}) till position (i, j) with x_{i }aligned to y_{j}.
As in MSAProbs, the second kind of pairwise probability matrix is calculated by a pair hidden Markov model (HMM) combining both Forward and Backward algorithm [4,5,45]. The pairwise probabilities can be generated under the guidance of pair HMM involving state emissions and transitions. is only derived from protein sequences without using secondary structure and solvent accessibility, which is different from PROMALS [15] that lets HMM emit both amino acids and secondary structure alphabets.
The final posterior probability matrix P_{XY }is calculated as the root mean square of the corresponding values in and as follows.
where p^{1}(x_{i }~ y_{i}) and p^{2}(x_{i }~ y_{i}) denote a posterior probability element in two kinds of posterior probability matrices ( and ), respectively.
Construction of pairwise distance matrices based on pairwise posterior probabilities and pairwise contact map scores
The posterior probability matrix P_{XY }is used as a scoring function to generate a pairwise global alignment between sequences X and Y. The optimal global alignment score Opt(X,Y) of the global alignment is computed according to an optimal subalignment score matrix AS. The optimal subalignment score AS(i, j) denotes the score of the optimal subalignment ending at residues i and j in X and Y. The AS matrix is recursively calculated as:
AS (n_{1}, n_{2}) is the optimal score of the full global alignment between X and Y, which is denoted as Optscore(X,Y).
In addition to the optimal alignment score, we introduce a contact map score, CMscore(X, Y), for the optimal pairwise alignment of X and Y, assuming that the spatially neighboring residues of two aligned residues should have a higher tendency to be aligned together. CMscore(X, Y) is calculated from the contact map correlation score matrix CMap_{XY }based on the residueresidue contact map matrices CMap_{X }and CMap_{Y }of X and Y.
Assuming the optimal global alignment of X and Y is represented as,
we can generate a new alignment after removing the pairs containing gaps:
, which can be denoted as
, where n is the length of the new alignment without gaps
From this alignment, we can construct two contact map matrices, CMap_{X }and CMap_{Y}, shown below:
is the contact probability score between amino acid and in protein sequence X, and is the contact probability score between amino acid and in protein sequence Y. The residueresidue contact probabilities are predicted from the sequence by NNcon [46] (http://sysbio.rnet.missouri.edu/multicom_toolbox/ webcite). The contact map correlation score matrix CMap_{XY }is calculated as the multiplication of CMap_{X }and CMap_{Y}:
is the contact map score for an aligned residue pair (amino acid in protein X and amino acid in protein Y). The contact map score for the global alignment of two sequences X and Y is calculated as
In practice, we only need to calculate the diagonal values in CMap_{XY}.
Finally, we define the pairwise distance between sequences X and Y as
, where W_{4 }+ W_{5 }= 1. The weights W_{4 }and W_{5 }are used to control the influence of sequences X and Y.
Construction of guide tree and transformation of posterior probability
Akin to MSAProbs [4], a guide tree is constructed by the UPGMA method that uses the linear combinatorial strategy [47]. The distance between a new cluster Z formed by merging clusters X and Y, and another cluster W is calculated as (10):
In which Num(X) is the number of leafs in cluster X.
After the guide tree is constructed, sequences are weighted according to the schemes inferred in [4].
To reduce the bias of sampling similar sequences, we use a weighted scheme to transform the former posterior probability as
w_{X }and w_{Y }are, respectively, the weight of sequences X and Y, w_{Z }is the weight of a sequence Z other than X or Y in the given group of sequences, and wN is the sum of sequence weights in dataset S.
Combination of progressive and iterative alignment
We first use the guide tree to generate a multiple sequence alignment by progressively aligning two clusters of the most similar sequences together. As in MSAProbs [4], we also apply a weighted profileprofile alignment to align two clusters of sequences. The sequence weights are the same as in the previous step. The posterior alignment probability matrix of two clusters/profiles is averaged from the probability matrices of all sequence pairs (X, Y), where x and y are from the two different clusters. Formula (5) used to generate the global profileprofile alignment is based on the posterior alignment probability matrices of the profiles. In order to further improve the alignment accuracy, we then use a randomized iterative alignment to refine the initial alignment. This randomized iterative refinement randomly partitions the given sequence group S into two separate groups, and performs a profileprofile alignment of the two groups. The iterative refinement can be completed after 10 iterations by default, or a fixed number of iterations set by users. Generally speaking, the final progressive alignment orders sequences along the guide tree from closely related to distantly related. To improve the alignment accuracy, a final iterative alignment is applied to refine the results from progressive alignment. In addition, a multithread technology based on OpenMP is also used to improve the efficiency of the program [48].
Results and discussion
Evaluation of MSACompro and other tools on the standard benchmarks
We tested MSACompro in comparison to three benchmarks: BAliBASE, SABmark and OXBENCH, and evaluated the alignment results in terms of sumofpairs (SP) score and true column (TC) score. The SP score is the number of correctly aligned pairs of residue in the test alignment divided by the total number of aligned pairs of residues in core blocks of the reference alignment [49]. The TC score is the number of correctly aligned columns in the test alignment divided by the total number of aligned columns in core blocks of the reference alignment [49]. We used the application bali_score provided by BAliBASE 3.0 to calculate these scores. We compared MSACompro to 11 other MSA tools which do not have access to the structural information, including ClustalW 2.0.12, DIALIGNTX 1.0.2 [27], FSA 1.15.5, MAFFT 6.818, MSAProbs 0.9.4, MUSCLE 3.8.31, Opal 0.2.0, POA 2, Probalign 1.3, Probcons and Tcoffee 8.93. It is worth noting that a fair comparison between our method with these multiple sequence alignment methods without using structural features is not possible because these methods use less input information. So, the goal of comparison is to present the idea that structural informationbased alignment may contain valuable information that is not available in sequencebased multiple sequence alignments and can therefore be a supplement to sequencebased alignments. And to make the evaluation more fair and comprehensive, we also compared MSACompro with four tools which use structural information, including MUMMALS 1.01 [14], PROMALS [15] and PROMALS3D [7].
To understand how various parameters of MSACompro affect alignment accuracy, some experiments were carried out to evaluate these variants based on two algorithm changes: (1) combining amino acids, secondary structure, and relative solvent accessibility information into the partition function calculation using respective weights for each of them; (2) computing the pairwise distance from both the pairwise posterior probability matrices and the pairwise contact map similarity matrices by introducing the weight wc for contact map information. To optimize the parameters, we used BAliBASE 3.0 data sets as training sets, and SABmark 1.65 and OXBENCH data sets as testing sets. Firstly, we focused on the effect of secondary structure and solvent accessibility information by testing different values of weight w_{1 }for amino acid similarity and weight w_{2 }for secondary structure information on BAliBASE 3.0 data sets. MSACompro worked wholly the best if the weight w_{1 }for amino acid similarity and the weight w_{2 }for secondary structure information were 0.4 and 0.5, respectively. Since the sum of w_{1}, w_{2 }and w_{c }is 1, we can deduce that w_{c }is 0.1 if w_{1 }and w_{2 }are 0.4 and 0.5. Then we focused on the effect of residueresidue contact map information under two different scenarios: using secondary structure and relevant solvent accessibility information by keeping the w_{1}, w_{2}, and w_{3 }at their optimum values (0.4, 0.5, 0.1), or excluding that information by setting both w_{2 }and w_{3 }as 0. Evaluation results on BAliBASE 3.0 database were found to improve the most when w_{c }is 0.9 by integrating both secondary structure and relevant solvent accessibility information. Additionally, to avoid overfitting, we tested MSACompro against SABmark 1.65 and OXBENCH data sets using this set of parameters independently, and found that a significant improvement was also gained in comparison to other leading protein multiple sequence alignment tools. More details can be found in the next section, "A comprehensive study on the effect of predicted structural information on the alignment accuracy". Consequently, the weights w_{1}, w_{2}, w_{3 }and w_{c }are respectively set at 0.4, 0.5, 0.1 and 0.9 in MSACompro by default. All other tools were also evaluated under default parameters.
Firstly, we evaluated these methods on BAliBASE [16]  the most widely used multiple sequence alignment benchmark. The latest version, BAliBASE 3.0, contains 218 reference alignments, which are distributed into five reference sets. Reference set 1 is a set of equaldistant sequences, which are organized into two reference subsets, RV11 and RV12. RV11 contains sequences sharing >20% identity and RV12 contains sequences sharing 20% to 40% identity. Reference set 2 contains families with >40% identity and a significantly divergent orphan sequence that shares <20% identity with the rest of the family members. Reference set 3 contains families with >40% identity that share <20% identity between each two different subfamilies. Reference set 4 is a set of sequences with large N/Cterminal extensions. Reference set 5 is a set of sequences with large internal insertions. Tables 1, 2, and 3 report the mean SP scores and TC scores of MSACompro and the tools without using structural information for the six subsets and the whole database. All the scores in the tables are multiplied by 100, and the highest scores in each column are marked in bold. The results show that MSACompro received the highest SP and TC scores on the whole database and all the subsets except for the SP score for the subset RV40. In some cases, MSACompro's improvement was substantial.
Table 1. Total SP scores on the fulllength BAliBASE 3.0 subsets.
Table 2. Total TC scores on the fulllength BAliBASE 3.0 subsets.
Table 3. Overall mean SP and TC scores on the fulllength BAliBASE 3.0 subsets.
Secondly, we evaluated MSACompro and other tools without the help of structural information on the SABmark database [4], which is a very challenging data set for multiple sequence alignment according to a comprehensive study [50]. SABmark is an automatically generated data set consisting of two sets. One set is from SOFI [51] and the other is from the ASTRAL database [52], which contains remote homologous sequences in twilightzone or superfamily. Since some pairwise reference alignments in SABmark are not generally consistent with multiple alignments, a subset of SABmark, 1.65 called SABRE [53], has been widely used as a multiple sequence alignment benchmark database. SABRE was constructed by identifying mutually consistent columns (MCCs) in the pairwise reference structure alignment. MCCs are considered similar to BAliBASE core blocks. SABRE contains 423 out of 634 SABmark groups that have eight or more MCCs. Table 4 shows the overall mean SP and TC scores of the alignments. The mean SP and TC scores of MSACompro are 8.3 and 9.1 points higher than those of the second bestperformer, MSAProbs, demonstrating that incorporating predicted structural features into multiple sequence alignments can substantially improve alignment accuracy for even remotely related homologous sequences. Figure 1 shows an example comparison between the alignments generated by our method, MSACompro, and MSAProbs from the SABRE database. The SP and TC scores significantly improved from 0.307 to 0.853 and 0 to 0.780, respectively. This case demonstrates that taking predicted structural information can help avert aligning unmatched regions, especially when the sequence similarity is unrecognizable.
Table 4. Overall mean SP and TC scores on the SABmark 1.65.
Figure 1. an example in SABRE database comparing the alignments generated by our method and MSAProbs. The reference alignment and resulting alignments generated by both methods are respectively shown in the figure. The correct alignment regions significantly improved by our MSACompro after taking structural information are marked in red rectangles. In contrast, the corresponding incorrect alignment regions generated by MSAProbs are represented in green rectangles. The predicted secondary structure and solvent accessibility information for the correctly aligned regions are shown in circles.
Thirdly, we also assessed all the tools without using the structural information on the OXBENCH database [54]. OXBENCH is also a popular benchmark database generated by the AMPS multiple alignment method from the 3Dee database of protein structural domains [55]. The conserved columns in OXBENCH can be considered similar to BAliBASE core blocks. The mean SP and TC scores over the whole database are shown in Table 5. The results show that MSACompro improves the alignment accuracy over all other methods.
Table 5. Overall mean SP and TC scores on the OXBENCH. Bold denotes highest scores.
Finally, we also compared the SP scores and TC scores of MSACompro and other tools which adopt the structural information on the six subsets of BAliBASE database, SABmark database and OXBENCH database. Tables 6 and 7 demonstrate the SP and TC scores across the three databases. The results show that MSACompro gained the highest scores on three out of six subsets of BAliBASE and achieved the third highest scores on other data sets, which are lower than PROMALS3D that used true experimental structures as input and PROMALS that used both predicted secondary structures and additional homologous protein sequences found by PSIBLAST search's on a large protein sequence database [15]. Overall, MSACompro performed similarly as PROMALS, whereas the latter has an advantage on a remote homologous protein sequence data set SABmark since it directly incorporates additional homologous protein sequences to improve the alignment of remotely related target sequences during the progressive alignment process. Moreover, the accuracy of MSACompro on the BAliBASE 3.0 data sets seems to be higher than the published results of another alignment tool of using secondary structure information  DIALIGNSEC [12], which was not directly tested in our experiment because it is only available as a web server other than a downloadable software package. Therefore, MSACompro is useful to improve the accuracy of multiple sequence alignment in general and particularly for most cases in reality where experimental structures are not available.
Table 6. Total SP scores of the tools which use the structural information on BAliBASE 3.0 subsets, SABmark data sets and OXBENCH data sets.
Table 7. Total TC scores of the tools which use the structural information on BAliBASE 3.0 subsets, SABmark data sets and OXBENCH data sets.
In order to check if alignment score differences between MSACompro and the other alignment methods are statistically significant, we carried out the Wilcoxon matchedpair signedrank test [56] on both SP and TC scores of these methods on the three data sets. The pvalues of alignment score differences calculated by the Wilcoxon matchedpair signedrank test are reported in Table 8. Generally speaking, the alignment scores of MSACompro are significantly higher than all the alignment methods without using structural information and MUMMALS of using structural information in all but one case according to the significance threshold of 0.05. The exception is that MSACompro's TC score is higher than MSAProbs on the BAliBASE, but not statistically significant. However, the alignment scores of MSACompro are mostly statistically lower than the other two alignment methods (PROMALS or PROMALS3D) of using predicted structural features, more homologous sequences, or tertiary structures.
Table 8. The statistical significance (i.e. pvalues) of SP and TC alignment score differences between MSACompro and the other tools on three benchmark data sets.
In addition to alignment accuracy, alignment speed is also a factor to consider in timecritical applications. Because it is difficult to rigorously compare the speed of different methods due to the difference in implementation and inputs, we only report the roughly estimated running time of the different methods on BAliBASE based our empirical observations. The fastest methods are ClustalW, MAFFT, MUSCLE, and POA, which used less than one hour. The mediumspeed methods that used a few hours to less than one day include FSA, Opal, Probalign, MSAProbs, ProbCons, Tcoffee, MUMMALS, and DIALIGNTX. The more time demanding methods are MSACompro, PROMALS, and PROMALS3D because they need to generate extra information for alignment. We ran both PROMALS and MSACompro on the BAliBASE 3.0 database on an 4 eightcore (i.e. 32 CPU cores) Linux server to calculate their running time. It took about 4 days and 6 hours for PROMALS to run on the whole BAliBASE 3.0 data sets, and about 9 hours and 13 minutes for MSACompro to run on the same data sets. MSACompro was faster because it used a multiplethreading implementation to call SSpro/ACCpro to predict secondary structure and solvent accessibility in parallel. Out of about 9 hours and 13 minutes, about four hours and 17 minutes were used by MSACompro to align sequences if secondary structure and solvent accessibility information was provided. However, if only one CPU core is used, it took around 6 days and 14 hours for SSpro and ACCpro called by MSACompro to predict secondary structure and solvent accessibility information alone, which is timeconsuming. Therefore, MSACompro will be slower than PROMALS if it runs a single CPU core, but faster on multiple (> = 3) CPU cores. As for PROMALS3D, it used about 9 days to extract tertiary structure information and make alignments.
A comprehensive study of the effect of predicted structural information on the alignment accuracy
To understand the impact of predicted secondary structure, relative solvent accessibility, and contact map on the accuracy of multiple sequence alignment, we tested their effects on alignments individually or in combination by adjusting the values of their weights used in the partition function (i.e. for secondary structure and solvent accessibility) or in the distance calculation (i.e. for contact map).
I. Effect of secondary structure information
We studied the effect of secondary structure information by adjusting the values of w_{1 }(weight for amino acid sequence information) and w_{2 }(weight for secondary structure information), the sum of which was kept as 1, and setting the values of w_{3 }(weight for relative solvent accessibility) and w_{c }(weight for contact map) to 0. The results for different w_{2 }values on the SABmark data sets are shown in Table 9. The highest score is denoted in bold and by a superscript of star, and the second highest is denoted in bold. The results show that incorporating secondary structure information always improves alignment accuracy over the baseline established without using secondary structure information (w_{2 }= 0). The highest accuracy is achieved when w_{2 }is set to .5, at which point the score is 8 points greater than the baseline. w_{2 }= 1 means that only secondary structure is used to calculate the posterior alignment probability in the partition function (i.e. equation set (2)), but amino acid sequence similarity is still used to calculate the other posterior alignment probability by the pair Hidden Markov Models. Figures 2 and 3 plot the SP and TC scores against weight values in Table 9 and Table 10, respectively.
Table 9. SP scores for different weights of secondary structures on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Figure 2. the 2D plot of SP scores against w_{2 }on the SABmark dataset.
Figure 3. the 2D plot of TC scores against w_{2 }on the SABmark dataset.
Table 10. TC scores for different weights of secondary structures on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
II. Effect of relative solvent accessibility information
Similarly, we studied the effect of relative solvent accessibility on the SABmark by adjusting the values of w_{1 }and w_{3 }and setting the values of w_{2 }and w_{c }to 0. The SP and TC scores with respect to different weight values are shown in Tables 11 and 12, respectively. The scores are also plotted against the weights in Figures 4 and 5, respectively. The highest SP and TC scores were achieved when w_{3 }was set to 0.5 or 0.6.
Table 11. SP scores for different weights of relative solvent accessibility on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Table 12. TC scores for different weights of relative solvent accessibility on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Figure 4. the 2D plot of SP scores against w_{3 }on the SABmark dataset.
Figure 5. the 2D plot of TC scores against w3 against the SABmark dataset.
III. Effect of residueresidue contact map information
We investigated the effect of contact map information on the BAliBASE 3.0 data set by adjusting w_{c }and setting w_{2 }and w_{3 }to 0. We used NNcon to successfully predict the contact maps for subset RV11, RV30, 42 out of 44 alignments in RV12, 38 out of 40 in RV20, 33 out of 46 in RV40, and 14 out of 16 in RV50. We tested the MSACompro method against this data with contact predictions. Tables 13 and 14 show the SP and TC scores for different w_{c }values on the subsets of the BAliBASE dataset. The results show that using contact information improved the alignment accuracy on some, but not all, subsets.
Table 13. SP scores for different weights for contact map on the BAliBASE3.0 database. Red color highlights the improved scores on each BAliBASE subset. Bold denotes the increased scores.
Table 14. TC scores for different weights for contact map on the BAliBASE 3.0 database. Red highlights the improved scores on each BAliBASE subset. Bold denotes the increased scores.
IV. Effect of combining secondary structure and solvent accessibility information
We adjusted the values of w1 (weight for amino acid), w2 (weight for secondary structure) and w3 (weight for relative solvent accessibility) simultaneously to investigate the effect of using secondary structure and relative solvent accessibility together. SP and TC scores on different parameter combinations are shown in Tables 15 and 15. The highest score is denoted in bold and by a superscript of 1, the second in bold and by a superscript of 2, and the third in bold and by a superscript of 3. The results show that the highest scores are achieved when w1 ranges from 0.4 to 0.5, w2 from 0.4 to 0.5, and w3 from 0.1 to 0.2. Also, using both secondary structure and solvent accessibility improves alignment accuracy over using either one. The best alignment score, which uses both secondary structure and solvent accessibility, is >8 points higher than the baseline approach, which does not use them. The changes of SP scores and TC scores with respect to the weights are visualized by the 3D plots in Figures 6 and 7. We conducted similar experiments on BAliBASE 3.0 and OXBENCH and got the similar results (data not shown).
Table 15. SP scores for different weight combinations (w_{1 } amino acid, w_{2 } secondary structure, w_{3 } solvent accessibility) on the SABmark 1.65 dataset.
Table 16. TC scores scores for different weight combinations (w_{1 } amino acid, w_{2 } secondary structure, w_{3 } solvent accessibility) on the SABmark 1.65 dataset.
Figure 6. 3D plot of SP scores against secondary structure weight w_{2 }and relative solvent accessibility weight w_{3}.
Figure 7. 3D plot of TC scores against secondary structure weight w_{2 }and relative solvent accessibility weight w_{3}.
V. Effect of using contact map information together with secondary structure and solvent accessibility information
In order to study whether or not contact information can be used effectively with secondary structure and solvent accessibility, we adjusted the weight w_{c }for contact information, while keeping the w1, w2, and w3 at their optimum values (0.4, 0.5, and 0.1 respectively). Tables 17 and 18 report the SP and TC scores on the BAliBASE 3.0 data set for different w_{c }values from no contact information (w_{c }= 0) to maximum contact information (w_{c }= 1). The results show that the improvement caused by contact information seems not to be substantial and significant.
Table 17. SP scores for different contact map weight w_{c }on the BAliBASE3.0 database while keeping the weights for amino acid, secondary structure, solvent accessibility to 0.4, 0.5, and 0.1, respectively.
Table 18. TC scores for different contact map weight w_{c }on the BAliBASE3.0 database while keeping the weights for amino acid, secondary structure, solvent accessibility to 0.4, 0.5, and 0.1, respectively.
Conclusion
In this work, we designed a new method to incorporate predicted secondary structure, relative solvent accessibility, and residueresidue contact information into multiple protein sequence alignment. Our experiments on three standard benchmarks showed that the method improved multiple sequence alignment accuracy over most existing methods without using secondary structure and solvent accessibility information. However, the performance of the method is comparable to PROMALS and PROMALS3D by slightly lower scores on some subsets and behind it by a large margin on SABMARK probably because these two methods used homologous sequences or tertiary structure information in addition to secondary structure information. Since multiple sequence alignment is often a crucial step for bioinformatics analysis, this new method may help improve the solutions to many bioinformatics problems such as protein sequence analysis, protein structure prediction, protein function prediction, protein interaction analysis, protein mutagenesis and protein engineering.
Authors' contributions
JC and XD designed the algorithm. XD implemented the algorithm and carried out the experiments. XD and JC analyzed the data. XD and JC wrote the manuscript. XD and JC approved it.
Acknowledgements
The work was partially supported by a NIH R01 grant (no. 1R01GM093123) to JC. We thank Angela Zhang for English editing.
References

Barton GJ, Sternberg MJ: A strategy for the rapid multiple alignment of protein sequences. confidence levels from tertiary structure comparisons.
J Mol Biol 1987, 198:327337. PubMed Abstract  Publisher Full Text

Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees.
J Mol Evol 1987, 25:351361. PubMed Abstract  Publisher Full Text

Krogh A, et al.: Hidden markov models in computational biology: applications to protein modeling.

Liu YC, Schmidt B, DouglasLM : MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities.
Bioinformatics 2010, 26(16):19581964. PubMed Abstract  Publisher Full Text

Do CB, et al.: ProbCons: probabilistic consistencybased multiple sequence alignment.
Genome Res 2005, 15:330340. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Poirot O, Suhre K, Abergel C, Eamonn OT, Notredame C: 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment.

Pei J, Kim B, Grishin NV: PROMALS3D: a tool for multiple sequence and structure alignment.
Nucleic Acids Res 2008, 36(7):22952300. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Söding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction.
Nucleic Acids Research 2005, 33:W244W248. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Söding J: Protein homology detection by HMMHMM comparison.
Bioinformatics 2005, 21:951960. PubMed Abstract  Publisher Full Text

Heringa J: Two strategies for sequence comparison: profilepreprocessed and secondary structureinduced multiple alignment.
Comput Chem 1999, 23:341364. PubMed Abstract  Publisher Full Text

Kim NK, Xie J: Protein multiple alignment incorporating primary and secondary structure information.

Amarendran RS, Suvrat H, Rasmus S, Peter M, Eduardo C, Burkhard M: DIALIGNTX and multiple protein alignment using secondary structure information at GOBICS.
Nucleic Acids Research 2010, 38(suppl 2):W19W22. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou HY, Zhou YQ: SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures.
Bioinformatics 2005, 21:36153621. PubMed Abstract  Publisher Full Text

Pei J, Grishin NV: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information.
Nucleic Acids Res 2006, 34(16):43644374. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pei J, Grishin NV: PROMALS: towards accurate multiple sequence alignments of distantly related proteins.
Bioinformatics 2007, 23:802808. PubMed Abstract  Publisher Full Text

Brudno M, Steinkamp R, Morgenstern B: The CHAOS/DIALIGN www server for multiple alignment of genomic sequences.

Larkin M, et al.: Clustal W and Clustal X version 2.0.
Bioinformatics 2007, 23(21):29472948. PubMed Abstract  Publisher Full Text

Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs.
Nucleic Acids Res 2003, 31:34973500. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ: Multiple sequence alignment with Clustal X.
Trends Biochem Sci 1998, 23:403405. PubMed Abstract  Publisher Full Text

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools.
Nucleic Acids Res 1997, 25:48764882. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Higgins DG, Thompson JD, Gibson TJ: Using CLUSTAL for multiple sequence alignments.
Methods Enzymol 1996, 266:383402. PubMed Abstract

Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice.
Nucleic Acids Res 1994, 22:46734680. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Higgins DG: CLUSTAL V: multiple alignment of DNA and protein sequences.
Methods Mol Biol 1994, 25:307318. PubMed Abstract

Higgins DG, Bleasby AJ, Fuchs R: CLUSTAL V: improved software for multiple sequence alignment.
Comput Appl Biosci 1992, 8:189191. PubMed Abstract

Higgins DG, Sharp PM: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer.
Gene 1988, 73:237244. PubMed Abstract  Publisher Full Text

Bailey TL, Noble WS: Searching for statistically significant regulatory modules.

Amarendran RS, Kaufmann M, Morgenstern B: DIALIGNTX: greedy and progressive approaches for segmentbased multiple sequence alignment.
Algorithms for Molecular Biology 2008, 3:6. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Amarendran RS, Jan WM, Kaufmann M, Morgenstern B: DIALIGNT: An improved algorithm for segmentbased multiple sequence alignment.
BMC Bioinformatics 2005, 6:66. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L: Fast Statistical Alignment.
PLoS Computational Biology 2009, 5:e1000392. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.
Nucleic Acids Res 2002, 30(14):305966. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Notredame C, Higgins D, Heringa J: TCoffee: A novel method for multiple sequence alignments.
JMB 2000, 302:205217. Publisher Full Text

Brudno M, Do CB, Cooper G, Michael FK, Davydov E, Eric DG, Sidow A, Batzoglou S: LAGAN and MultiLAGAN: efficient tools for largescale multiple alignment of genomic DNA.

Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Research 2004, 32(5):179297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity.
BMC Bioinformatics 2004, 5(1):113. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Chikkagoudar S, Roshan U, Livesay DR: eProbalign: generation and manipulation of multiple sequence alignments using partition function posterior probabilities.
Nucleic Acids Research 2007, 35:W675W677. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sze SH, Lu Y, Yang Q: A polynomial time solvable formulation of multiple sequence alignment.
Journal of Computational Biology 2006, 13:309319. PubMed Abstract  Publisher Full Text

Roshan U, Livesay DR: Probalign: multiple sequence alignment using partition function posterior probabilities.
Bioinformatics 2006, 22(22):271521. PubMed Abstract  Publisher Full Text

Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark.
Proteins: Structure, Function, and Bioinformatics 2005, 61:127136. Publisher Full Text

Walle V, et al.: Alignma new algorithm for multiple alignment of highly divergent sequences.
Bioinformatics 2004, 20:14281435. PubMed Abstract  Publisher Full Text

Raghava GP, et al.: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.
BMC Bioinformatics 2003, 4:47. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Cheng J, Randall A, Sweredoski M, Baldi P: SCRATCH: a Protein Structure and Structural Feature Prediction Server.
Nucleic Acids Research 2005, 33(Web Server):7276. Publisher Full Text

Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins.
Proteins 2002, 47:142153. PubMed Abstract  Publisher Full Text

Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database.
Science 1992, 256:14431445. PubMed Abstract  Publisher Full Text

Kawabata T, Nishikawa K: Protein structure comparison using the Markov transition model of evolution.
Proteins 2000, 41:108122. PubMed Abstract  Publisher Full Text

Durbin R, et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press Cambridge, UK; 1998.

Tegge AN, Wang Z, Eickholt J, Cheng J: NNcon: Improved Protein Contact Map Prediction Using 2DRecursive Neural Networks.
Nucleic Acids Research 2009, 37:w515w518. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sneath PHA, Sokal RP: Numerical taxonomy. In Freeman. San Francisco,USA; 1973.

OpenMP tutorial [https://computing.llnl.gov/tutorials/openMP] webcite

Thompson JD, Frederic P, Olivier P: A comprehensive comparison of multiple sequence alignment programs.
Nucleic Acids Research 1999, 27:26822690. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Walle V, et al.: Alignma new algorithm for multiple alignment of highly divergent sequences.
Bioinformatics 2004, 20:14281435. PubMed Abstract  Publisher Full Text

Boutonnet NS, et al.: Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins.
Protein Eng 1995, 8:647662. PubMed Abstract  Publisher Full Text

Brenner SE, et al.: The ASTRAL compendium for protein structure and sequence analysis.
Nucleic Acids Res 2000, 28:254256. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edgar RC [http://www.drive5.com/bench] webcite

Raghava GP, et al.: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.
BMC Bioinformatics 2003, 4:47. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Poirot O, Suhre K, Abergel C, Eamonn OT, Notredame C: 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment.

Wilcoxon F: Probability tables for individual comparisons by ranking methods.
Biometrics 1947, 3:119122. PubMed Abstract  Publisher Full Text