Skip to main content
  • Research article
  • Open access
  • Published:

Predicting and improving the protein sequence alignment quality by support vector regression

Abstract

Background

For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment.

Results

In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs.

Conclusion

The present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at http://pbil.kaist.ac.kr/forecast

Background

As the number of protein sequences is exponentially growing, knowledge on their structures and functions is lagging far behind the growth rate of the number of new protein sequences because the experiments to determine structures and functions are difficult and time-consuming. One way to resolve this problem is computational methods such as structure and function prediction. In the case of protein structure prediction, computational methods fall into two categories; ab initio folding method and comparative modeling. Ab initio folding method is based on physical principles and does not require prior knowledge on protein structures, but comparative modeling [1] has shown superior performance throughout recent experiments assessing the effectiveness of structure prediction methods such as CASP (Critical Assessment of Structure Prediction) [2].

The first step in comparative modeling is the fold recognition in which one searches for homologous proteins with known structure and chooses the best one that can be used as a template. After this process, the alignment between the selected template and the query protein is generated. Finally the alignment is used to build the 3-dimensional structure models by using 3D model building tools such as MODELLER [1, 3]. High-quality query-template alignments are, therefore, essential for successful homology modeling. Thus, there are two factors that essentially determine the quality of predicted protein structures; good templates and high quality query-template alignments. There have been many approaches to increase the performance of fold recognition. Progress in fold recognition has made it possible to increase the structural coverage of newly sequenced genomes [4] and to improve our ability to predict the protein structures as demonstrated in recent CASP experiments.

Importance of alignment accuracy for comparative modeling has been already addressed [5]. Among many sequence alignment methods, the easiest way is to use sequence-sequence alignments such as Smith-Waterman [6] or BLAST algorithm [7]. Other ways are to utilize evolutionary information: profile-sequence alignments such as PSI-BLAST [8] and sequence-profile alignments such as IMPALA [9]. To get better alignments, it has been shown in many studies that using profiles of both the query and the template, named profile-profile alignment, are superior to sequence-profile methods and profile-sequence methods [10]. Even though profile-profile alignments are better, they do not always provide the optimal alignments [11]. Profile-profile alignments can be carried out in many different ways [12–14] and the alignment results change as alignment options vary. There is no single best profile-profile method and the universal alignment option that always generates the optimal alignment.

To overcome this problem, some methods such as Consensus [15], ESyPred3D [16], Multiple Mapping Method (MMM) [17], and methods using genetic algorithm [18, 19] have used population of suboptimal alignments. ESyPred3D fixes the redundant results from suboptimal alignments and finds optimal alignments by moving anchor point. Consensus make alignments by consensus of several alignments based on the consensus strength and by discarding the residues where alternative alignments differ. These two methods use limited number of alternative alignments. On the other hands, other two methods have used genetic algorithm to generate sub alignments as many as possible. After sets of model structure are constructed from alignments, score of each model is calculated by fitness function such as atom-atom potential [20] and Z-score [21]. However, these approaches take longer time, and alignments made by crossover are likely to be biologically meaningless. MMM, the recent study, focused on minimizing alignment errors based on its own scoring function by combining differently alignment segments from alternative alignments. MMM outperformed other methods and showed significant improvements.

We introduce here a novel method not only to predict the alignment quality but also to improve the alignment quality by support vector regression (SVR) [22]. Machine learning technique such as the artificial neural network (ANN) or support vector machine (SVM) [23] has been a popular tool for fold recognition, but is only available for feature vectors of fixed length. A new method in which all templates in template library have feature vectors of different lengths with profile-profile alignments scores has been recently developed [24]. In our work, a modified version has been used. Among many different kinds of measures for the alignment quality, MaxSub [25], which has been used as a measure in assessment experiments of structure prediction such as CASP [26], CAFASP [27], and LiveBench [28], is used to represent a measure of alignment quality. MaxSub is a good measure of alignment quality in that it is a normalized single numeric and reflects structure-level quality.

Our attempt to develop a method to predict the alignment quality is not entirely new. A related work [29] has been published, but the alignment quality prediction was not their final research goal. Rather, in the work by Xu [29], the predicted alignment quality was used to improve performance of fold recognition. In the present work, we develop a highly accurate method to predict the alignment quality, and we utilize the method not only to maximize the alignment quality and but also to choose good templates. In our work, an alignment of a query protein against its template of length n is converted into a feature vector of length n + 1 composed of profile-profile alignment scores and the length of the query protein. The predicted MaxSub score is calculated by the SVR model specifically built for that template. The test results show highly accurate regression performance. For a pair of a query and a template, various alignments are generated by using many different combinations of alignment parameters. The SVR model for the template is then used to find the optimal alignment parameters which are specific to that pair. We name this method 'adaptive selection' method. The adaptive selection method outperforms the method which uses the universal alignment option for large-scale testing set.

Results and discussion

Performance measures of SVRs

Alignments are converted into (n + 1) dimensional feature vectors which are input of SVRs where n is the length of the templates (Figure 1). In order to evaluate the performance of the method, trained SVR models are evaluated for the testing set. The correlation between observed and predicted MaxSub values is presented in the density map (Figure 2a). Each column in the figure2a is normalized independently by dividing the number of alignments with a specific range of MaxSub scores by the total number of alignments in that column. The number of alignments in each column is plotted on Figure 2b. The highest density is represented by black squares; the lowest density is represented by white squares. The Pearson correlation coefficient is calculated from the pairs of predicted MaxSub scores and observed MaxSub scores. The calculated correlation coefficient is 0.945. A previous related work [29] has reported the correlation coefficient of 0.71, which is lower than that of the present method. However, because the testing set and the measure of alignment quality in the previous work (the measure of alignment quality was calculated by comparing the sequence alignment and the structural alignments generated by SARF [30] that were assumed to be the gold standard) are different from those used in this work, direct comparison between the two methods may not have much meaning, although much higher correlation coefficient of our work seems to suggest that the present method is apparently better at predicting the alignment quality than the previous method. The good correlation coefficient and the density diagram with good diagonal shape imply that the MaxSub scores as a measure of alignment quality can be accurately predicted. Moreover, the results suggest that for each query-template pair it is possible to find its own optimal alignment parameters that would maximize the alignment quality.

Figure 1
figure 1

Generation of the input feature vectors from alignments. (a) The sequence of a template of length n is aligned to the sequences of examples by profile-profile alignment method. (b) Each alignment is transformed to (n + 1)-dimensional feature vector composed of the alignment scores at n positions and the total alignment score. (c) These feature vectors are used to train SVR model for the target template.

Figure 2
figure 2

Performance of SVR models. (a) Correlations between observed and predicted MaxSub scores with correlation coefficient of 0.9453. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution.

In addition to the Pearson correlation coefficient, three different measures of errors are also calculated. The first one is the mean absolute error (MAE) which is given by

MAE = 1 N ∑ i = 1 N | y i − o i | MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeeyta0KaeeyqaeKaeeyrauKaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGobGtaaGcdaaeWbqaamaaemaabaGaemyEaK3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcqWGVbWBdaWgaaWcbaGaemyAaKgabeaaaOGaay5bSlaawIa7aaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaa@43EA@

where y i is the predicted value, o i is the observed value, and N the total number of the predictions. The normalized MAE (NMAE) is defined as follows

NMAE = 1 N ∑ i = 1 N | y i − o i | o i . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeeOta4Kaeeyta0KaeeyqaeKaeeyrauKaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqWGobGtaaGcdaaeWbqcfayaamaalaaabaWaaqWaaeaacqWG5bqEdaWgaaqaaiabdMgaPbqabaGaeyOeI0Iaem4Ba82aaSbaaeaacqWGPbqAaeqaaaGaay5bSlaawIa7aaqaaiabd+gaVnaaBaaabaGaemyAaKgabeaaaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoakiabc6caUaaa@4952@

The last one is the root-mean-square error (RMSE) given by

RMSE = 1 N ∑ i = 1 N ( y i − o i ) 2 . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaeeOuaiLaeeyta0Kaee4uamLaeeyrauKaeyypa0ZaaOaaaeaajuaGdaWcaaqaaiabigdaXaqaaiabd6eaobaakmaaqahabaWaaeWaaeaacqWG5bqEdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabd+gaVnaaBaaaleaacqWGPbqAaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaleqaaOGaeiOla4caaa@45BD@

MAE, NMAE, and RMSE are 0.0775, 1.877, and 0.0969, respectively, also shown in Table 1 and distributions of MAE and NMAE are shown in Figure 2c and Figure 2d, respectively. MAE is always lower than 0.2 for all the range of observed MaxSub scores when the window size is set to 0.5.

Table 1 Performance of SVR models for overall test set and at three levels of SCOP hierarchy. Pearson stands for Pearson correlation coefficient. MAE, NMAE, and RMSE are types of error

Adaptive selection of the alignment options having the best MaxSub score

The ultimate objective of predicting alignment quality is to find the best alignment. One straightforward, although not the best, way to do this is to choose a set of the optimal alignment parameters, such as gap opening penalty, gap extension penalty, baseline score, and the amount of secondary structure term, that would yield the best alignments overall. However, as seen in Table 2 where the average MaxSub scores for the alignments generated with various different combination of the alignment parameters are shown, there is no such single set of parameters that are universally optimal for all query-template pairs. For example, for the query-template protein pairs that are related at the family level, the optimal alignment parameters are 9, 1, 1, and 0.5 for gap opening penalty, gap extension penalty, baseline score, and the secondary structure information, respectively, while those parameters change to 12, 2, 0, and 2 for the protein pairs that are related at the fold level. Overall, the maximum of average MaxSub scores is 0.2386 with the optimal alignment parameters of 9, 1, 1, and 1, which interestingly are not the optimal parameters for the protein pairs related at any level of similarity.

Table 2 Average MaxSub scores of the alignments generated by using various combinations of alignment parameters for the protein pairs related at the three SCOP levels. Open, Extension, and Baseline column shows gap open penalty, gap extension penalty and baseline value, respectively. '2nd' stands for the weight of predicted secondary structure. The best option showing highest MaxSub at each level is bolded.

The results suggest the following alignment strategy. Instead of using single universal set of alignment parameters for all query-template pairs, by simply picking up a different set of the alignment parameters that are uniquely optimal for a query-template pair, the alignment can be improved. If we do so, as seen in Table 3, the average of the overall MaxSub scores improves from 0.2386 to 0.2887 (0.0501 point improvement, corresponding to roughly 21% improvement).

Table 3 Comparison of average MaxSub scores. The values in the first row "Overall best option" are retrieved from Table 2.

Obviously, we do not know a priori which set of alignment parameters is optimal for a given query-template pair because the structure of a query protein is not known. Therefore, here we propose the 'adaptive selection' method. The adaptive selection procedure is carried out as follows. (1) Generate the alignments using many different combinations of alignment parameters. (2) Predict MaxSub scores of alignments using the trained SVR models. (3) Select the alignment that gives the highest predicted MaxSub score.

When we follow the adaptive selection procedure, the average of actual MaxSub scores of the alignments selected by the adaptive selection procedure improves to 0.2563 (Table 3), which corresponds to 0.0177 point or 7.42% improvement, compared to the single best option procedure. This improvement is statistically significant (p-value < 10-300 calculated by Wilcoxon signed rank test [31]). It also indicates that the adaptive selection method can scoop roughly 35.3% (0.0177 vs. 0.0501) of the maximum improvement that can be achievable by selecting the optimal alignment parameters unique to each query-template pair. Moreover, it also implies that it is possible to improve the alignment quality even more by developing more accurate alignment quality prediction method.

Performance at three levels of SCOP hierarchy

In this section, we describe performance at three levels of SCOP hierarchy (family, superfamily, and fold) to closely examine where the improvement is achieved. All the experiments carried out in the previous section are done for testing sets at the three different levels.

The density diagram in Additional file 1 shows the correlation at the family level. It looks similar to Figure 2a except that it shows weak correlation in low MaxSub score region. The reason seems to be that alignments of pairs at the family level likely have high MaxSub scores, and SVR models have not experienced sufficient alignments that have low MaxSub scores during the training stage. The correlation coefficient, MAE, NMAE and RMSE is 0.9185, 0.0630, 0.3112 and 0.0936, respectively (Table 1). Additional file 1 shows the number of alignments in different regions of observed MaxSub score. Additional file 2 shows the correlation at the superfamily level. It shows rather weak correlation in high MaxSub score region. The correlation coefficient, MAE, NMAE and RMSE is 0.8318, 0.0773, 1.8344 and 0.0962, respectively (Table 1). Contrary to the case of the family level, there are not many examples in high observed MaxSub region, which is the reason for weak correlation in high score region. The density map in Additional file 3 represents the correlation at the fold level. The correlation coefficient, MAE, NMAE and RMSE is 0.6106, 0.0848, 2.6738 and 0.0988, respectively (Table 1). Like the case of the superfamily level, it seems to show weak correlation at high score region.

In Table 2, the MaxSub scores are presented at three different levels. The averages are 0.6090, 0.1713, and 0.0651, and the values for best options are 0.6183, 0.1888, and 0.0793 at the level of family, superfamily, and fold, respectively. These values are also compared with corresponding scores achieved by adaptive selection method (Table 3). It is also observed that adaptive selection method shows higher performance at the every SCOP level as for overall testing set showing an improvement of 1.16, 12.7, and 20.2% at the family, superfamily and fold level, respectively.

To check diversity of test set, sequence identities of query-template pairs are presented in Fig at each SCOP level, family (Figure 3a), superfamily (Figure 3b), and fold (Figure 3c). The average values of sequence identity are 30.95%, 13.03%, and 11.51% at each SCOP level, respectively. Except for some pairs in the test set at family level, the sequence identities of almost all pairs are under 35%, "twilight zone [32]." The distribution tells our results are not based on high sequence identity.

Figure 3
figure 3

Distribution of sequence identities on the test set. Distribution of sequence identities of the query-template pairs on the test set at (a) family (b) superfamily (c) fold level.

Alignments of pairs that are not related

All protein pairs in the testing set used in the experiments share the similar structure at least at the fold level (see Methods). It is, therefore, necessary to check whether the trained SVR models show reliable performance for proteins which do not share the same fold. In order to check this, 10 unrelated proteins per each template are randomly selected, aligned against the templates, and transformed into feature vectors. The vectors are then applied to SVR models of the templates to predict MaxSub scores. All the observed MaxSub scores are zero without exception. Thus all the predicted values should be zero in ideal situation. Histogram of predicted values is shown in Figure 4. Unfortunately, most predicted values are not zero. The mean is 0.1979 and the standard deviation is 0.1257. We can infer here that SVR models predict the MaxSub scores larger than the true values in low MaxSub score region. The histogram shows that the true MaxSub scores of alignments predicted to have MaxSub score near 0.1 might be zero. Thus, if a predicted MaxSub is low and is not zero, it should be carefully examined.

Figure 4
figure 4

Histogram of predicted MaxSub scores of the alignments of the pairs that are not related at the fold level.

Alignments of the pairs whose MaxSub scores are zero despite being in the same family

It is expected that two proteins in the same SCOP family have a similar 3D structure. There are, however, many alignments of the pairs in the same family for which observed MaxSub scores are zero (Additional file 1). When MaxSub score is zero, the alignment is completely incorrect by definition [25]. For these pairs, we check how much improvement can be achieved by adaptive selection method. Figure 5 shows histogram of MaxSub scores which is given by adaptive selection method for the alignments of those pairs. For about 37.3% of all pairs, there is no improvement, while about 62.7% of pairs achieve some improvement. In other words, around 63% of completely incorrect alignments between a pair of protein related at the family level are corrected into partially corrected alignments by changing alignment options by adaptive selection method.

Figure 5
figure 5

Histogram of MaxSub scores by adaptive selection method for the alignments of the pairs sharing the same family whose MaxSub score is zero when single best alignment option method is used.

Then, what are the reasons that remaining 37.3% of pairs gain no improvement? The most obvious one is regression error. Adaptive selection method might wrongly select an option due to regression error although there is another option that might give improved MaxSub score. When we examine the data, it appears that 17.9% constitute this type. Second, it may result from the limitation of profile-profile alignments. It has been well known that profile-profile alignment is not always the optimal alignment when compared to the structure alignment. It may fail to align a query against a particular template with any alignment options due to problem of alignment method itself. The third reason may be the lack of alignment options in our method. Although 48 options are used in our work, they may not be sufficient because the options used here do not cover all possible cases. For example, to align a particular pair of proteins, abnormally large gap open penalty might be necessary.

The fourth reason may be the limitation of MaxSub score as a measure of alignment quality. There have been a number of assessment methods for alignment quality. It has been controversial what evaluation method is the best. There are many alternative measures such as GDT_TS [33, 34], LGscore [35] and MAMMOTH [36]. Another aspect is that MaxSub score is basically sequence-dependent assessment. In sequence-dependent assessment, only corresponding residues in alignment are compared. It is stricter than sequence-independent assessment [37, 38] for alignments which are slightly shifted from the optimal alignment, which might make MaxSub scores of some alignments become zero. Our method might be improved by combining these sequence-dependent and sequence-independent methods.

Finally, some template structures may not be good for predicting the structure of a query protein, even though they are in the same family with a query protein. One example of this case is an alignment of a query protein, d1tsk__, against a template, d1chl__, both of which belong to the same family (g.3.7.2). All MaxSub scores of the alignments generated by using all 48 options are zero. To check whether it is caused by the problem of profile-profile method, we perform the structural alignment by CE algorithm [39], and we find that the MaxSub score of this structural alignment is also zero. Figure 6 shows a superposition of these two proteins. It can be inferred that there are bad templates for structure prediction although they are the same family member with a query protein. It might be caused by strict definition of MaxSub. However, in the view of MaxSub, the template d1chl__ is apparently a bad one for the query.

Figure 6
figure 6

Superposition of 2 SCOP domains. Superposition of SCOP domain d1tsk__ (bright) onto d1chl__ (dark), both of which belong to the same family (g.3.7.2).

Such alignments are tested by the fold recognition method developed in the previous study [24] to see their fold recognition scores. The raw SVM outputs are converted into posterior probabilities [40], ranging from zero to one, and the distribution of these probabilities is shown in Figure 7. The distribution exhibits two peaks, near zero and one. If we choose decision-threshold as 0.5, roughly 15% of pairs are classified into protein pairs sharing the same family. Let us consider a situation where one tries to predict the protein structure and chooses the templates by means of fold-recognition score only. For some cases, if a certain template is selected simply because it is predicted to be homologous at the family level, the final result of structure prediction might be failed due to wrong selection of the template. Adaptive selection method may help to filter this sort of templates out and can prevent ones from selecting these bad templates.

Figure 7
figure 7

Distribution of posterior probabilities of outputs of SVM for fold-recognition.

Benchmark test

The benchmark test of adaptive selection method is carried out on 62 targets of CASP7. We use EsyPred3D and Multiple Mapping Method (MMM) for the comparing. Both are publicly available web servers, and alignments and 3D models are provided. We used the default options of the servers.

Out of all 88 targets of CASP7, 77 targets have significantly close template in SCOP 1.69 according to the result of fold search by Proteinsilico [41]. The templates of 62 targets of those are trained in our dataset, and these target-template pairs are used in the benchmarking.

Table 4 shows the performances of MMM, EsyPred3D, and adaptive selection method. The greatest values of MaxSub, Mammoth Z-score, TM-score [42], GDT_TS for each pair are bolded. Our method gives better alignments having larger MaxSub than other two methods on average (0.264 vs. 0.203 and 0.182). In the aspect of other measures the adaptive method also shows the best performance. In addition, the values of our method are statistically significant according to p-values calculated by Wilcoxon signed rank test [31] with significance level 0.05.

Table 4 Alignment performances on CASP 7 using MMM, ESyPred3D, and Adaptive method. The highest value for each pair is bolded. P-values are calculated by Wilcoxon signed rank test.

Conclusion

In the process of protein sequence alignment, generally only one particular set of alignment parameters is used throughout the all protein pairs, regardless of their evolutionary relationship. In some cases, many alignments are generated using many different combinations of alignment parameters, and then the potentially optimal alignment is chosen purely based on experience or intuition. In this work, however, we select the alignment parameters which are predicted to give the highest MaxSub score specific to a pair of a query and a template. Our work is distinguishable to other efforts to improve the quality of protein sequence alignments in that we directly predict alignment quality with quite good accuracy. By predicting the alignment quality and then choosing the optimal alignment parameters based on the prediction, we show that the alignment quality can be improved significantly. Our method can be utilized to select not only the optimal alignment parameters for a chosen template but also good templates with which the structure of a query protein can be best predicted.

In summary, we develop a method to predict the MaxSub score as an alignment quality of a given profile-profile alignment between a query and a template. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector. These feature vectors are used to train the SVR models for the templates. We rigorously test the performance of the method using various evaluation measures such as Pearson correlation coefficient, MAE, NMAE, and RMSE. Results show the high correlation coefficient of 0.945 and low prediction errors. Trained SVR models are then applied to select the best alignment option which is chosen specifically to the pair of a query and a template. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to the scores when single best option is used for the all query-template pairs.

Methods

Data

To make a template library, classification by the SCOP version 1.69 [43] is used. First, the fold library composed of ~11,130 domains is constructed using domain subsets with less than 90% sequence identity to each other prepared by ASTRAL Compendium [44]. We choose the folds containing at least 20 members for training and testing the SVR models. A total of 7509 domains in 122 folds are selected as a result. Two thirds are used to train and the rest is used to test. To estimate the performance, we employ the three-fold cross-validation procedure.

MaxSub score as alignment quality (target of each SVR)

Conventionally, the alignment quality is calculated by comparing the sequence alignment and the structural alignments generated by various structure alignment programs such as SARF [30], CE and MAMOTH, assuming that the structure alignments are the gold standard. A problem of this approach is that depending on the specific choice of structure alignment program, the structure alignments can vary significantly, especially for distant homolog pairs. A different approach is that first the structure prediction model of a query protein is quickly generated by directly copying C-α positions of all aligned residues of the template protein using the sequence alignment, and then the protein structure model quality measure such as MaxSub [25] or TM-score [42] is calculated and used as a alignment quality score. The second approach is more relevant to the present study, because the main focus of this work is how to generate good sequence alignments that would eventually lead to better structure models. Specifically, we use MaxSub [25], a popular model quality measure which finds the largest subset of Cα atoms of a model that superimpose well over the experimental structure. At the stage of training, each alignment is converted into a structure model of the query protein. MaxSub score is then calculated using the model derived from the alignment and the correct structure, with d parameter set to 3.5 Å which has been found to be a good choice for the evaluation of fold-recognition models [25]. We have also considered to use TM-score [42], another popular model quality measure, as the alignment quality measure. However, it turned out that the correlation between MaxSub scores and TM-scores was as high as 0.95. Therefore, we expect that our specific choice of MaxSub score as the alignment quality measure does not affect the performance of our method and the main conclusion of this work.

Profile-profile alignments and SVR feature vectors

To train SVR models for all templates in the training set, feature vector scheme developed in previous work [24] is adopted with slight modification. We first generate all-against-all alignments within the set sharing the same fold by profile-profile alignment scheme with 48 different combinations of alignment parameters (gap open-penalty, gap extension-penalty, base-line score, and weight of predicted secondary structure). The profile-profile alignment score to align the position i of a query q and the position j of a template t is given by

m i j = ∑ k = 1 20 [ f i k q S j k t + S i k q f j k t ] + s ij + b MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemyBa02aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGH9aqpdaaeWbqaamaadmaabaGaemOzay2aa0baaSqaaiabdMgaPjabdUgaRbqaaiabdghaXbaakiabdofatnaaDaaaleaacqWGQbGAcqWGRbWAaeaacqWG0baDaaGccqGHRaWkcqWGtbWudaqhaaWcbaGaemyAaKMaem4AaSgabaGaemyCaehaaOGaemOzay2aa0baaSqaaiabdQgaQjabdUgaRbqaaiabdsha0baaaOGaay5waiaaw2faaaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaeGOmaiJaeGimaadaniabggHiLdGccqGHRaWkcqqGZbWCdaWgaaWcbaGaeeyAaKMaeeOAaOgabeaakiabgUcaRiabdkgaIbaa@59BD@

where f i k q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aa0baaSqaaiabdMgaPjabdUgaRbqaaiabdghaXbaaaaa@317A@ , f j k t MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOzay2aa0baaSqaaiabdQgaQjabdUgaRbqaaiabdsha0baaaaa@3182@ , S i k q MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdMgaPjabdUgaRbqaaiabdghaXbaaaaa@3154@ and S j k t MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdQgaQjabdUgaRbqaaiabdsha0baaaaa@315C@ are the frequencies and the position-specific score matrix (PSSM) scores of amino acid k and at position i of a template q and position j of a template t, respectively. For the secondary structure score (sij), a positive score is added (subtracted) if the predicted secondary structure of the query protein at the position i is the same (different) type of secondary structure of the template protein at position j. Finally, the constant baseline score (b) is added to the alignment score.

The frequency matrices and PSSMs are generated by running PSI-BLAST [8] with default parameters except for the number of iterations (j = 11) and the E-value cutoff (h = 0.001). For each template of length n in the training set, alignments with the other templates in the training set are generated. Then, these alignments are transformed, respectively, into (n + 1)-dimensional feature vectors,

(sa1, sa2, ..., sai, ..., san, queary _lenth)

where saiis the profile-profile alignment score at position i of a given template [45] and query_length is the length of the query protein (Figure 1). If gaps occur, fixed negative scores are arbitrarily assigned. This is the modified version of [24]. The difference is that we use query_length instead of total alignment score. Since the size of the vector, n is dependent on the length of template protein, we make the same number of SVRs for all templates.

SVR training

Only templates sharing at least the same fold with a target template are trained. To learn as many alignment examples as possible, 48 alignments are made per each pair of a query and a template (Table 2). Gap open penalty ranging from 5 to 13 is used; gap extension is one or two; baseline value is zero or one. The parameter for the predicted secondary structure information content is also varied. The input and the target of SVR are derived from the previous two sections. We would like to emphasize that there is no correct alignment example. Regression is basically a real value prediction. In training step for each input-target data of training sample, SVR models are trained with radial basis function (RBF) kernel without attempting serious performance optimization by SVMlight version 6.01 with the parameter gamma of 0.001 [46].

Availability and requirements

The method is implemented in the platform-independent web server, FORECAST as a part. It is freely available without any restriction at http://pbil.kaist.ac.kr/forecast

References

  1. Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000, 29: 291-325. 10.1146/annurev.biophys.29.1.291.

    Article  CAS  PubMed  Google Scholar 

  2. Kinch LN, Wrabl JO, Krishna SS, Majumdar I, Sadreyev RI, Qi Y, Pei J, Cheng H, Grishin NV: CASP5 assessment of fold recognition target predictions. Proteins. 2003, 53 Suppl 6: 395-409. 10.1002/prot.10557.

    Article  PubMed  Google Scholar 

  3. Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234 (3): 779-815. 10.1006/jmbi.1993.1626.

    Article  CAS  PubMed  Google Scholar 

  4. McGuffin LJ, Street SA, Bryson K, Sorensen SA, Jones DT: The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms. Nucleic Acids Res. 2004, 32 (Database issue): D196-9. 10.1093/nar/gkh043.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Jones DT: Progress in protein structure prediction. Curr Opin Struct Biol. 1997, 7 (3): 377-387. 10.1016/S0959-440X(97)80055-3.

    Article  CAS  PubMed  Google Scholar 

  6. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147 (1): 195-197. 10.1016/0022-2836(81)90087-5.

    Article  CAS  PubMed  Google Scholar 

  7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.

    Article  CAS  PubMed  Google Scholar 

  8. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF: IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics. 1999, 15 (12): 1000-1011. 10.1093/bioinformatics/15.12.1000.

    Article  CAS  PubMed  Google Scholar 

  10. Wallner B, Fang H, Ohlson T, Frey-Skott J, Elofsson A: Using evolutionary information for the query and target improves fold recognition. Proteins. 2004, 54 (2): 342-350. 10.1002/prot.10565.

    Article  CAS  PubMed  Google Scholar 

  11. Ohlson T, Wallner B, Elofsson A: Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods. Proteins. 2004, 57 (1): 188-197. 10.1002/prot.20184.

    Article  CAS  PubMed  Google Scholar 

  12. Heger A, Holm L: Exhaustive enumeration of protein domain families. J Mol Biol. 2003, 328 (3): 749-767. 10.1016/S0022-2836(03)00269-9.

    Article  CAS  PubMed  Google Scholar 

  13. Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 2000, 9 (2): 232-241.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol. 2002, 315 (5): 1257-1275. 10.1006/jmbi.2001.5293.

    Article  CAS  PubMed  Google Scholar 

  15. Prasad JC, Comeau SR, Vajda S, Camacho CJ: Consensus alignment for reliable framework prediction in homology modeling. Bioinformatics. 2003, 19 (13): 1682-1691. 10.1093/bioinformatics/btg211.

    Article  CAS  PubMed  Google Scholar 

  16. Lambert C, Leonard N, De Bolle X, Depiereux E: ESyPred3D: Prediction of proteins 3D structures. Bioinformatics. 2002, 18 (9): 1250-1256. 10.1093/bioinformatics/18.9.1250.

    Article  CAS  PubMed  Google Scholar 

  17. Rai BK, Fiser A: Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins. 2006, 63 (3): 644-661. 10.1002/prot.20835.

    Article  CAS  PubMed  Google Scholar 

  18. John B, Sali A: Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res. 2003, 31 (14): 3982-3992. 10.1093/nar/gkg460.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Contreras-Moreira B, Fitzjohn PW, Bates PA: In silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling. J Mol Biol. 2003, 328 (3): 593-608. 10.1016/S0022-2836(03)00309-7.

    Article  CAS  PubMed  Google Scholar 

  20. Robson B, Osguthorpe DJ: Refined models for computer simulation of protein folding. Applications to the study of conserved secondary structure and flexible hinge points during the folding of pancreatic trypsin inhibitor. J Mol Biol. 1979, 132 (1): 19-51. 10.1016/0022-2836(79)90494-7.

    Article  CAS  PubMed  Google Scholar 

  21. Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci. 2002, 11 (2): 430-448. 10.1110/ps.25502.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Smola AJ, Schölkopf B: A tutorial on support vector regression. Statistics and Computing. 2004, 14 (3): 199-222. 10.1023/B:STCO.0000035301.49549.88.

    Article  Google Scholar 

  23. Vapnik VN: Statistical learning theory. Adaptive and learning systems for signal processing, communications, and control. 1998, New York , Wiley, xxiv, 736 p.-

    Google Scholar 

  24. Han S, Lee BC, Yu ST, Jeong CS, Lee S, Kim D: Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics. 2005, 21 (11): 2667-2673. 10.1093/bioinformatics/bti384.

    Article  CAS  PubMed  Google Scholar 

  25. Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics. 2000, 16 (9): 776-785. 10.1093/bioinformatics/16.9.776.

    Article  CAS  PubMed  Google Scholar 

  26. Bradley P, Chivian D, Meiler J, Misura KM, Rohl CA, Schief WR, Wedemeyer WJ, Schueler-Furman O, Murphy P, Schonbrun J, Strauss CE, Baker D: Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins. 2003, 53 Suppl 6: 457-468. 10.1002/prot.10552.

    Article  PubMed  Google Scholar 

  27. Fischer D, Elofsson A, Rychlewski L, Pazos F, Valencia A, Rost B, Ortiz AR, Dunbrack RL: CAFASP2: the second critical assessment of fully automated structure prediction methods. Proteins. 2001, Suppl 5: 171-183. 10.1002/prot.10036.

    Article  CAS  PubMed  Google Scholar 

  28. Rychlewski L, Fischer D: LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci. 2005, 14 (1): 240-245. 10.1110/ps.04888805.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. Xu J: Fold recognition by predicted alignment accuracy. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2 (2): 157-165. 10.1109/TCBB.2005.24.

    Article  CAS  PubMed  Google Scholar 

  30. Alexandrov NN: SARFing the PDB. Protein Eng. 1996, 9 (9): 727-732. 10.1093/protein/9.9.727.

    Article  CAS  PubMed  Google Scholar 

  31. Wilcoxon F: Individual Comparisons by Ranking Methods. Biometrics Bulletin. 1945, JSTOR, 1 (6): 80-83. 10.2307/3001968.

  32. Rost B: Twilight zone of protein sequence alignments. Protein Eng. 1999, 12 (2): 85-94. 10.1093/protein/12.2.85.

    Article  CAS  PubMed  Google Scholar 

  33. Kryshtafovych A, Venclovas C, Fidelis K, Moult J: Progress over the first decade of CASP experiments. Proteins. 2005, 61 Suppl 7: 225-236. 10.1002/prot.20740.

    Article  PubMed  Google Scholar 

  34. Zemla A, Venclovas C, Moult J, Fidelis K: Processing and analysis of CASP3 protein structure predictions. Proteins. 1999, Suppl 3: 22-29. 10.1002/(SICI)1097-0134(1999)37:3+<22::AID-PROT5>3.0.CO;2-W.

    Article  CAS  PubMed  Google Scholar 

  35. Cristobal S, Zemla A, Fischer D, Rychlewski L, Elofsson A: A study of quality measures for protein threading models. BMC Bioinformatics. 2001, 2: 5-10.1186/1471-2105-2-5.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  36. Ortiz AR, Strauss CE, Olmea O: MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci. 2002, 11 (11): 2606-2621. 10.1110/ps.0215902.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  37. Lackner P, Koppensteiner WA, Domingues FS, Sippl MJ: Automated large scale evaluation of protein structure predictions. Proteins. 1999, Suppl 3: 7-14. 10.1002/(SICI)1097-0134(1999)37:3+<7::AID-PROT3>3.0.CO;2-V.

    Article  CAS  PubMed  Google Scholar 

  38. Marchler-Bauer A, Bryant SH: A measure of progress in fold recognition?. Proteins. 1999, Suppl 3: 218-225. 10.1002/(SICI)1097-0134(1999)37:3+<218::AID-PROT28>3.0.CO;2-X.

    Article  CAS  PubMed  Google Scholar 

  39. Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11 (9): 739-747. 10.1093/protein/11.9.739.

    Article  CAS  PubMed  Google Scholar 

  40. Platt JC: Probabilities for SV Machines. Advances in large margin classifiers. Edited by: Smola AJ, Bartlett PJ, Scholkopf B, Schuurmans D. 2000, Cambridge, Mass. , MIT Press, 61-74.

    Google Scholar 

  41. Fold Search of CASP7 Target against SCOP 1.69. [http://www.proteinsilico.org/ROKKO/casp7/native/casp7_zscore_ce.html]

  42. Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins. 2004, 57 (4): 702-710. 10.1002/prot.20264.

    Article  CAS  PubMed  Google Scholar 

  43. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247 (4): 536-540. 10.1006/jmbi.1995.0159.

    CAS  PubMed  Google Scholar 

  44. Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res. 2004, 32 (Database issue): D189-92. 10.1093/nar/gkh034.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  45. Tress ML, Jones D, Valencia A: Predicting reliable regions in protein alignments from sequence profiles. J Mol Biol. 2003, 330 (4): 705-718. 10.1016/S0022-2836(03)00622-3.

    Article  CAS  PubMed  Google Scholar 

  46. Joachims T: Making large-scale support vector machine learning practical. Advances in kernel methods : support vector learning. Edited by: Schölkopf B, Burges CJC, Smola AJ. 1999, Cambridge, Mass. , MIT Press, 169-184.

    Google Scholar 

Download references

Acknowledgements

This work is supported by Ministry of Science and Technology of Korea (grant number: 2007-03994) and KISTI Supercomputing Center (KSC-2007-S00-1025). ML was supported by Kim Bo Jeong Basic Science Scholarship of KAIST and thanks Byung-chul Lee for his kind help in making the figure showing a superposition of two proteins.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongsup Kim.

Additional information

Authors' contributions

ML wrote the code for the analysis, carried out the training and testing SVRs, and drafted the manuscript. CSJ wrote the code for profile-profile alignment and implemented the code which generates input feature vectors for SVRs. DK participated in the design of the work and collaborated in writing the manuscript. All authors have read and approved the manuscript.

Electronic supplementary material

12859_2007_1843_MOESM1_ESM.eps

Additional file 1: Performance of SVR models at the family level. (a) Correlations between observed and predicted MaxSub scores at the family level. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution (EPS 95 KB)

12859_2007_1843_MOESM2_ESM.eps

Additional file 2: Performance of SVR models at the superfamily level. (a) Correlations between observed and predicted MaxSub scores at the superfamily level. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution (EPS 96 KB)

12859_2007_1843_MOESM3_ESM.eps

Additional file 3: Performance of SVR models at the fold level. (a) Correlations between observed and predicted MaxSub scores at the fold level. Adjacent color bar shows the mapping of relative density. (b) Plot of frequency distribution. (c) Plot of MAE distribution. (d) Plot of NMAE distribution. (EPS 96 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Lee, M., Jeong, Cs. & Kim, D. Predicting and improving the protein sequence alignment quality by support vector regression. BMC Bioinformatics 8, 471 (2007). https://doi.org/10.1186/1471-2105-8-471

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-8-471

Keywords