Email updates

Keep up to date with the latest news and content from BMC Bioinformatics and BioMed Central.

Open Access Research article

Benchmarking consensus model quality assessment for protein fold recognition

Liam J McGuffin

Author Affiliations

The School of Biological Sciences, University of Reading, Whiteknights, Reading RG6 6AS, UK

BMC Bioinformatics 2007, 8:345  doi:10.1186/1471-2105-8-345

The electronic version of this article is the complete one and can be found online at:

Received:14 June 2007
Accepted:18 September 2007
Published:18 September 2007

© 2007 McGuffin; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.



Selecting the highest quality 3D model of a protein structure from a number of alternatives remains an important challenge in the field of structural bioinformatics. Many Model Quality Assessment Programs (MQAPs) have been developed which adopt various strategies in order to tackle this problem, ranging from the so called "true" MQAPs capable of producing a single energy score based on a single model, to methods which rely on structural comparisons of multiple models or additional information from meta-servers. However, it is clear that no current method can separate the highest accuracy models from the lowest consistently. In this paper, a number of the top performing MQAP methods are benchmarked in the context of the potential value that they add to protein fold recognition. Two novel methods are also described: ModSSEA, which based on the alignment of predicted secondary structure elements and ModFOLD which combines several true MQAP methods using an artificial neural network.


The ModSSEA method is found to be an effective model quality assessment program for ranking multiple models from many servers, however further accuracy can be gained by using the consensus approach of ModFOLD. The ModFOLD method is shown to significantly outperform the true MQAPs tested and is competitive with methods which make use of clustering or additional information from multiple servers. Several of the true MQAPs are also shown to add value to most individual fold recognition servers by improving model selection, when applied as a post filter in order to re-rank models.


MQAPs should be benchmarked appropriately for the practical context in which they are intended to be used. Clustering based methods are the top performing MQAPs where many models are available from many servers; however, they often do not add value to individual fold recognition servers when limited models are available. Conversely, the true MQAP methods tested can often be used as effective post filters for re-ranking few models from individual fold recognition servers and further improvements can be achieved using a consensus of these methods.


It is clear that one of the remaining challenges hindering the progress of protein fold recognition and comparative modelling is the selection of the highest quality 3D model of a protein structure from a number of alternatives [1]. The identification of appropriate templates used for building models has been significantly improved both through profile-profile alignments and meta-servers, to the extent that traditional threading methods are becoming less popular for fold recognition. Increasingly, for the majority of sequences with unknown structures, the problem is no longer one of template identification; rather it is the selection of the sequence to structure alignment that produces the most accurate model.

A number of methods have been developed over recent years in order to estimate the quality of models and improve selection. A popular technique has been to use methods such as PROCHECK [2] and WHATCHECK[3] in order to evaluate stereochemistry quality following comparative modelling. These methods were developed in order to check the extent to which a model deviates from real X-ray structures based on a number of observed measures. However, such evaluations are often insufficient to differentiate between stereochemically correct models. Traditionally, a variety of energy-based programs have been developed more specifically for the discrimination of native-like models from decoy structures. These programs were based either on empirically derived physical energy functions or statistical potentials derived from the analysis of known structures[4]. For some time, methods such as PROSAII [5] and VERIFY3D [6] have been in popular use for rating model quality. More recently, methods such as PROQ [7], FRST [8] and MODCHECK [9] have proved to be more effective at enhancing model selection.

During the 4th Critical Assessment of Fully Automated Structure Prediction (CAFASP4), such methods were collectively termed as Model Quality Assessment Programs (MQAPs) and a number of them were evaluated in a blind assessment [10]. For the purposes of CAFASP4, an MQAP was defined as a program which took as its input a single model and which outputted a single score representing the quality of that model. Developers were encouraged to submit MQAPs as executables, which were subsequently used to evaluate models by the assessors.

More recently, quality assessment (QA) was incorporated as a new "manual" prediction category in the 7th Critical Assessment of Techniques for Protein Structure Prediction (CASP7) [11]. The QA category was divided into two sub categories QMODE 1 referring to the prediction of the overall model quality and QMODE 2, in which the quality of individual residues in the model was predicted. In the QMODE 1 category, the format of the new experiment allowed users to run their methods in-house and then submit a list of server models with their associated predicted model quality scores. While this new format had certain advantages, it also allowed more flexibility in the type of methods which could be used for quality assessment. For example, this format allowed methods to be used which could not be evaluated as "true" MQAPs in the original sense, such as meta-servers approaches which may have used the clustering of multiple models or incorporated additional information about the confidence of models from the fold recognition servers.

In this paper, several of the top performing MQAPs are benchmarked in order to gauge their value in the enhancement of protein fold recognition. A number of top performing "true" MQAP methods are compared against some of the best clustering and meta-server approaches. In addition, two novel methods, which can be described as true MQAPs according to the original definition, are also benchmarked. Firstly, the ModSSEA method which is based on the secondary structure element alignment (SSEA) score previously benchmarked [12] and incorporated into versions of mGenTHREADER [13] and nFOLD [14]. Secondly, ModFOLD which combines the output scores from the ProQ methods[15], the MODCHECK method [9] and the ModSSEA method using an artificial neural network.

Results and discussion

Measurement of the correlation of predicted and observed model quality

The official CASP7 assessment of MQAP methods in the QMODE1 category involved measuring the performance of methods based on the correlation coefficients between predicted and observed model quality scores. In this section, the analysis is repeated both on a global and target-by-target basis. In Figure 1, each point on the plot represents a model submitted by a server to the CASP7 experiment. The models from all targets have been pooled together and so the "global correlation" is shown. The ModFOLD output score is clearly shown to correlate well with observed mean model quality score.

thumbnailFigure 1. Predicted model quality scores versus observed model quality scores. The ModFOLD scores are plotted against the observed combined model quality scores ((TM-score+MaxSub+GTD)/3), for models submitted by the automated fold recognition servers to the CASP7 tertiary structure category (TS1 and AL1 models have been included).

In Table 1, the global measures of Spearman's rank correlation coefficients (ρ) between predicted and observed model quality scores are shown for a number of the top performing MQAP methods. The Spearman's rank correlation is used in this analysis, as the data are not always found to be linear and normally distributed. The results shown here confirm the results in the official CASP7 assessment and show the LEE method and the ModFOLD method outperforming the other methods tested at CASP7 in terms of the global measure of correlation. Interestingly, the 3D-Jury method, which was not entered in the official assessment, is shown to outperform the LEE method based on all observed model quality scoring methods. The ModFOLD consensus approach appears to be working in this benchmark, as it is shown to outperform the individual constituent methods (MODCHECK, PROQMX, PROQLG and ModSSEA). The ModSSEA method, which was not individually benchmarked in the official assessment, also appears to be competitive with the established individual "true" MQAPs, which are capable of producing a single score based on a single model.

Table 1. Global measures of the Spearman's rank correlation coefficients (ρ)

The results in Table 2 again show the Spearman's rank correlation coefficients for each method, but in this instance the rho values are calculated for each target separately and then the mean overall rho value is taken. It is clear that the ordering of methods has changed and this was also shown to occur in the official assessment. The 3D-Jury method and the LEE method are still ranked as the top performing methods but there is a re-ordering of the other methods. Contrary to the results shown in Table 1, it would appear that there is no value from using the consensus approach of the ModFOLD method. How can these contradictory results be explained?

Table 2. Target-by-target measures of the Spearman's rank correlation coefficients (ρ)

The results in Figure 1 appear to show a roughly linear relationship between the predicted and observed model quality scores with few outliers based on the global measure where the models are pooled together for all targets. However, when the results are examined for individual targets (Figure 2) the relationship is often non-linear, the data are not always normally distributed and there are often a proportionately greater number of outliers which can influence the rho values. In developing MQAPs for the improvement of fold recognition the primary goal is to select the highest quality model as possible given a number of alternative models. Does the measurement of correlation coefficient on a target-by-target basis always help us to distinguish the best method for selecting the top model?

thumbnailFigure 2. Examples showing the difficulty with relying on correlation coefficients as performance measures. Predicted model quality scores are plotted against the observed combined model quality scores on a target-by-target basis, for models submitted by the automated fold recognition servers to the CASP7 tertiary structure category (AL and TS models are included). a) The scaled MODCHECK scores are compared with the ModSSEA scores for the target T0304 models. The Spearman's rank correlation coefficient (ρ) between the MODCHECK scores and observed model quality scores is 0.66 and the observed model quality of the top ranked model (m) is 0.27 (the data point is circled in blue). The correlation coefficient for the ModSSEA method is lower (ρ = 0.50), however the quality of the top ranked model is higher (m = 0.34) (the data point is circled in red). b) The ProQ scores are compared with the ModSSEA scores for the target T0283 models. For ProQ ρ = 0.50 is and m = 0.01, whereas for ModSSEA, ρ = 0.40 is and m = 0.48. c) The scaled MODCHECK scores are compared with the ModFOLD scores for the target T0289 models. For MODCHECK, ρ = 0.61 is and m = 0.13, whereas for ModFOLD ρ = 0.53 is and m = 0.47. d) The ProQ scores are compared with the ModFOLD scores for the target T0321 models. For ProQ, ρ = 0.48 is and m = 0.11, whereas for ModFOLD, ρ = 0.17 is and m = 0.24.

In Figure 2 (a-d), the scores from ModSSEA and ModFOLD are compared against MODCHECK and ProQ for four example CASP7 template based modelling targets. In these examples the rho values are higher for the MODCHECK and ProQ methods, however it can be seen that the observed quality scores for the top ranked models (which have been denoted m here) are shown to be higher for the ModFOLD and ModSSEA methods. Of course, there are also several cases where the the rho values for MODCHECK and ProQ are lower yet the m scores are higher than either ModFOLD or ModSSEA. Indeed by testing on a target-by-target basis, it was found that, on average, for each individual CASP7 target, the MQAP with the highest correlation coefficient between observed and predicted model quality was most often not the method with highest observed quality of the top ranked model.

From the scatter plots in Figure 2 it is apparent that the correlation between observed and predicted model quality may not necessarily be the best measure of performance if we are interested in methods which can identify the highest quality models. In real situations, developers and users of fold recognition servers would arguably be most concerned with the selection of the best model from a number of alternatives for a given target. The comparison of correlations coefficients should not necessarily replace the individual examination of the data. However, the individual examination of data for each method and for each individual target may not always be practical. It is therefore suggested that a more appropriate measure of the usefulness would be to simply measure the observed model quality of the top ranked models for each target (m) when benchmarking MQAPs for fold recognition.

Measurement of the observed model quality of the top ranked models (m)

Table 3 shows the cumulative model quality scores that can be achieved if each MQAP method is used to rank the top models from all servers for each target (results are highlighted in bold). In other words, the m scores are taken from each MQAP for each target and then the scores are added together. Higher cumulative observed model quality scores (Σm) can be achieved using the ModFOLD and ModSSEA methods than using the other true MQAPs, which are capable of producing a single score based on a single model (ProQ, ProQ-LG, PROQ-MX and MODCHECK).

Table 3. Cumulative observed model quality scores for each MQAP (TS1 and AL1 models)

The methods which rely on the comparison of multiple models and/or additional information from multiple servers (3D-Jury, LEE and Pcons) are shown to greatly outperform the individual true MQAPs, however the consensus approach taken by ModFOLD is shown to be competitive.

The cumulative model quality scores of the TS1 or AL1 models from each fold recognition server are also shown in Table 3. The 3D-Jury, Pcons, LEE and ModFOLD methods achieve a higher cumulative score than all fold recognition servers except the Zhang-Server. It must be noted that the cumulative scores which can be achieved by ranking models using any of the existing MQAP methods are still far lower than the maximum achievable MQAP score obtained if the best model were to be consistently selected for each target. Table 4 shows the cumulative observed model quality scores if MQAP methods are used to rank all models from all servers. For all of the methods, except the 3D-Jury method, there is a reduction in the cumulative observed model quality. The LEE method outperforms the Pcons method but the relative performance of all other methods is unchanged. However, are the differences in m scores from the different MQAP methods significant?

Table 4. Cumulative observed model quality scores for each MQAP (all models)

Often the differences observed between methods in terms of cumulative observed model quality scores (Σm), may not be significant. The results in Tables 5, 6, 7 are provided to demonstrate that the rankings between methods shown in Table 3 and 4 are only relevant if a significant difference is observed according to the Wilcoxon signed rank sum tests. The p-values for Wilcoxon signed ranks sum tests comparing the MQAP methods are shown in Tables 5, 6, 7. The null hypothesis is that the observed model quality scores of the top ranked models (m) from method x are less than or equal to those of method y. The alternative hypothesis is that the m scores for method x are greater than those of method y.

Table 5. Calculated p-values for Wilcoxon signed rank sum tests (TM-score)

Table 6. Calculated p-values for Wilcoxon signed rank sum tests (MaxSub)

Table 7. Calculated p-values for Wilcoxon signed rank sum tests (GDT)

The top models selected using the 3D-Jury method are shown to be of significantly higher quality (p < 0.01) than those selected using any other method according to the TM-score, MaxSub score and GDT score. The top models selected using the ModFOLD method are of significantly higher quality than those of PROQ-MX, PROQ-LG and MODCHECK according to the TM-score (p < 0.01), MaxSub score (p < 0.05) and GDT score (p < 0.01) (Table 5, 6 and 7). According to the MaxSub score the top models selected by both LEE and Pcons are significantly higher quality (p < 0.05) than those selected by ModFOLD (Table 6).

However, there is no significant increase in the quality of the top models selected by Pcons over those selected by ModFOLD according to the TM-score (Table 5). In addition there is no significant increase in the quality of models selected by the LEE method over the ModFOLD method according to GDT score (Table 7). Variation in the predicted secondary structures or other input parameters would explain the observed differences between the in house version of ProQ-LG and the ProQ scores downloaded from the CASP7 website, however the overall difference between scores is not shown to be significant (Table 5, 6 and 7).

The ModSSEA method was developed independently for the CASP7 experiment, prior to the publication of the comparable method developed by Eramian et al. [16]. Although the two methods are similar in that they both compare the DSSP assigned secondary structure of the model against the PSIPRED predicted secondary structure of the target, they differ in their scoring. The two methods were found to show differences in cumulative observed model quality scores (a mean difference of 1.08), however none of these were found to be significant according to the Wilcoxon signed rank sum test with each measure of observed model quality: using the TM-score the p-value was 0.1765, using the MaxSub score the p-value was 0.1625 and using the GDT score the p-value was 0.1355.

Measurement of the confidence in the true MQAP output scores

One of the advantages of the so called "true" MQAPs (e.g. ProQ, MODCHECK, ModSSEA and ModFOLD) over clustering methods (e.g. 3D-Jury and LEE) and those which use also use information from multiple fold recognition servers (e.g. Pcons), is that they provide a single consistent and absolute score for each individual model. This means that the models from different protein targets can be directly compared with one another on the same predicted model quality scale. Conversely, with clustering methods the scores for a given model are potentially variable as they are dependent on the relationship between many models of the same target protein. Similarly, the information which can be obtained from multiple fold recognition servers may vary from target to target. Therefore, the predicted model quality scores between different targets may not be directly comparable as they do not directly relate to model quality.

The consistency of the output scores from the true MQAPs is useful in the context of the structural annotation of proteomes, where it is important to be able estimate the coverage of modelled proteins at a particular level of confidence. In order to be able to measure the confidence of a prediction we must be able to directly compare model quality scores from different protein targets. In Figure 3, the confidence in output scores from the 5 true MQAPs are compared by ranking all models according to predicted model quality and then plotting the number of true positives versus false positives, according to observed model quality, as the output scores decrease. A TM-score of 0.5 is used as a stringent cut-off to define false positives. Models above this cut-off are likely to share the same fold as the native structure [17]. A higher true positive rate is shown for the ModFOLD method than for the other MQAP methods tested at low rates of false positives. This indicates that we can have a higher confidence in the ModFOLD output score over the other true MQAP methods, implying that ModFOLD method should be a more useful method in the context of proteome annotation using fold recognition. In other words, a higher coverage of high quality models can be selected with a lower number of errors.

thumbnailFigure 3. A benchmark of the consistency of the ModFOLD predicted model quality score. The proportion of true positives is plotted against the proportion of false positives. The CASP7 fold recognition server models (21714 models from 87 targets -see methods) were ranked by decreasing predicted model quality score using ModFOLD and the different MQAP methods that make up the ModFOLD method. False positives were defined as models with TM-scores ≤ 0.5, indicating models that have a different fold to the native structure. True positives were defined as models with TM-scores > 0.5 indicating models that share the same fold as the native structure [17]. The plot shows the proportion of true positives at the region of < = 10% false positives.

Benchmarking on standard decoy sets

It could be argued that data sets such as the CASP7 server models provide a more appropriate and larger test set for the benchmarking MQAP methods, particularly in the practical context of fold recognition. Methods such as ModFOLD, are often developed and tested for the selection of the best real fold recognition model rather than for the detection of the native fold amongst a set of artificial decoys.

However, in order to enable direct comparisons with additional published methods, benchmarking was carried out the using three commonly used standard decoy sets from the Decoys 'R' Us [18] database (4state_reduced [19], lattice_ssfit [20] and LMDS [21]) and the results are shown in Table 8. The ModFOLD method appears to be competitive with other MQAPs using the standard decoy sets according to standard measures of performance such as the rank and Z-score of the native structure (see Tosatto's recent paper for a comparison of methods using these sets and scoring [8]). However, due to the smaller number of targets in these sets it is not often possible to calculate significant differences between the methods. It is also observed that the relative performance of methods appears to be dependent on which dataset is used, although it is not possible to draw sound conclusions from this data.

Table 8. Benchmarking based on three standard decoy sets from the Decoys 'R' Us database

Measurement of the added value of re-ranking few models from individual servers

It is clear from the cumulative observed model quality scores (Σm) in Tables 3 and 4 and Wilcoxon signed rank sum tests (Tables 5, 6 and 7) that if we have many models from multiple servers then the best MQAP methods to use are those which carry out comparisons between multiple models for the same target (e.g. 3D-Jury). However, what if only few models are available from an individual server? Can developers and users of individual fold recognition servers gain any added value from re-ranking their models using an MQAP method?

Figure 4 shows the difference in observed mean model quality score, or the "added value", obtained if the ModFOLD method is used to select the best model out of the 5 submitted by each individual server compared against using the 3D-Jury clustering approach. For most of the fold recognition servers tested, the model quality scores can be improved if ModFOLD is used as a post filter in order to re-rank models. However, on average the model quality score is decreased if a clustering approach, such as 3D-Jury, is used to re-rank models from the individual servers.

thumbnailFigure 4. The added value of re-ranking models. The difference in the cumulative observed model quality score of the top ranked models is shown after the 5 models for each target provided by each server are re-ranked using the ModFOLD or 3D-Jury methods. Each bar represents Σ(mi-mj), where mi is the observed model quality of the top ranked model after the 5 server models are re-ranked and mj is the observed model quality of the original top ranked model submitted by the server. N.B. Only the common subset of servers which had submitted 5 models for all targets are included in the plot. The error bars show the standard error of the mean observed quality. Overall there is a mean increase of 0.44 in the cumulative observed model quality of the top ranked models if the ModFOLD method is used to re-rank the models provided by individual servers, however, there is a mean decrease of 0.56 if models are re-ranked using the 3D-Jury method (see Table 9). On the x axis, the first asterisk indicates a fold recognition server where the quality of the top ranking model can be significantly improved. An additional asterisk indicates a significant improvement of the ModFOLD method over the 3D-Jury method.

Table 9. The added value of re-ranking models measured by cumulative observed model quality

In the case of the CaspIta-FOX server, the cumulative quality score of the top selected models can be improved from 41.67 to 43.88, using ModFOLD, which would improve the overall ranking of the method by 8 places in Table 3. The Zhang-Server score can also be marginally improved upon from 53.00 to 53.23 if ModFOLD is used to re-rank models. Several individual servers can also be improved using the 3D-Jury method; however, for the majority of servers, there is less benefit to be gained from re-ranking very few models using the clustering approach.

On average the cumulative observed model quality score of an individual server is improved by 0.44 if the ModFOLD method is used to re-rank the 5 submitted models (Table 9). Table 9 also shows that on average the quality score of the top selected model is improved for individual servers using the ProQ, ProQ-LG and MODCHECK methods, confirming our previous results [9]. The ProQ-MX, ModSSEA and 3D-Jury methods on average show an overall decrease in the quality of the top selected models from each server, if these methods are used as post filters to re-rank models.

Table 10 shows the proportion of servers which can be improved by using each MQAP method to re-rank submitted models, according to each observed model quality score. The ModFOLD method is shown to improve ~66% (23/35) of the servers tested according to all measures of observed model quality and the ProQ method improves ~69% (24/35), according to the combined observed model quality score.

Table 10. The added value of re-ranking measured by the proportion of improved servers

What if we were also to use the information from the original server ranking in addition to the MQAP scores? Can further improvements to model ranking be made by using this information as an additional weighting to the MQAP score? The results in Table 11 and Table 12 show the additional improvement to model rankings made by combining the information from the original server ranking with that of the MQAP score. In this benchmark, models initially ranked by a server as the top model achieve a higher additional score than models initially ranked last. A useful additional score was found to be (6-r)/40, where r is the initial server ranking of the model between 1 and 5 (e.g. the additional score for a TS1 model would be 0.125, a TS2 model would have an additional score of 0.1 etc.).

Table 11. The added value of re-ranking with weighted scores (cumulative observed model quality)

Table 12. The added value of re-ranking with weighted scores (proportion of improved servers)

Table 11 shows that on average the cumulative observed model quality score for an individual server can be increased by 0.69, if the initial ranking score is added to the ModFOLD score and used as a post filter to re-rank models. The number of servers improved using the combined score also increases to 74% (26/35) (Table 12). For all other MQAP methods the scores are also be improved by using information from the server in addition to the MQAP scoring. This is a similar technique to that used in the Pcons method, albeit used here with a more basic scoring scheme and benchmarked on the few models produced by individual servers, rather than many models from multiple servers.

This is a stringent benchmark as there are few models to choose from each individual server. This means that there is less information to be gained from a comparison of the structural features shared between models. Therefore, the clustering approach (3D-Jury) does not perform well at this task. The ModSSEA method also performs badly at this task as it is also dependent on differentiating models based on structural features. If there is conservation of secondary structure among the top few models from the same server, then the ModSSEA method will perform badly. Indeed, many servers already include secondary structure scores and so the top models provided by the same server are often likely to share similar secondary structures. The value of randomly selecting the top models (through the assignment of a random score between 0 and 1) has also been included in Tables 9 to 12. A random selection of the top model on average shows a marked decrease in model quality as the probability of a correctly selecting the top model for a given target is 0.2.


The consensus MQAP method (ModFOLD) is shown to be competitive with methods which use clustering of multiple models or information from multiple servers (LEE and Pcons) according to the cumulative observed model quality scores of the top ranked models (Σm). Furthermore, according to this benchmark the ModFOLD method significantly outperforms some of the best "true" MQAP methods tested here (ProQ-MX, ProQ-LG and MODCHECK), all of which produce single consistent scores based on a single model.

Benchmarking based on correlation coefficients is not always helpful in measuring the usefulness of MQAP methods. There is not always a linear relationship between the MQAP score and the observed model quality score and scores for an individual target may not be normally distributed. Even with the non-parametric test, outliers can affect the results and so the correlation coefficient should not replace the individual examination of the data. It is therefore proposed that simply measuring the observed model quality scores of the top ranked model (m) on a target by target basis, or the cumulative scores (Σm) over all targets, may be more useful for benchmarking MQAPs in the context of protein fold recognition, followed by measures of the statistical significance. In practical terms, predictors require the best model to be selected for a given target and so m is an appropriate measure of the performance of an MQAP method in this context.

If there are many models available from multiple fold recognition servers then clustering models using the 3D-Jury approach is demonstrably the most effective tested method for ranking models. However, the method can perform poorly when there are very few models available and often no value is added by re-ranking of models from an individual sever. Furthermore, methods such as 3D-Jury, LEE and Pcons may not produce consistent scores and therefore scores of models from different targets cannot be directly compared against one another. Clustering methods, such as 3D-Jury, are also computationally intensive and the CPU time required for calculating a score increases quadratically with number of available models.

The so called "true" MQAP methods tested here (ModFOLD, ModSSEA, MODCHECK and the ProQ methods) are less computationally intensive as they consider only the individual model when producing a score. Therefore, the computational time for these methods scales linearly with the number of available models. They are also demonstrated here to add value to predictions when used as a post filter to re-rank even very few models from individual fold recognition servers.

In the context of a CASP assessment it is clear that the MQAP methods that make use of clustering of multiple models are currently superior to true MQAP methods that score individual models. Server developers wishing to perform well in CASP will therefore be more likely to use and develop the former methods as they will have access to many models produced by many different servers. However, in a practical context, experimentalists may have collected only very few models from the limited number of publicly accessible servers which remain available outside the context of CASP. Therefore, experimentalists would be advised to consider using the true MQAP methods in order to rank their models prior to investing valuable time in the laboratory. However, it is clear that there is room for the further improvement of both the true MQAP methods and the methods which make use of clustering and multiple servers, in the selection of the highest quality models. This is evidenced by the maximum possible score that could be achieved by consistently selecting the highest quality model.


A number of the top performing Model Quality Assessment Programs (MQAPs) were benchmarked using the fold recognition models submitted by servers in the CASP7 experiment. Several of the "true" MQAP methods, which can produce a single score based on a single model alone (MODCHECK and three versions of ProQ), were benchmarked against those methods which make use of the clustering of multiple models or information from multiple servers in order to calculate scores (3D-Jury, LEE and Pcons). In addition, two new true MQAP approaches were tested: ModSSEA, based on secondary structure element alignments and ModFOLD, a consensus of MODCHECK, ModSSEA and the ProQ methods.


The ProQ [7] and MODCHECK [9] methods have been shown previously to be the amongst the most effective of the "true" MQAP methods according to benchmarking carried out in a previous study [9]. Executables for each program were downloaded [22] and run in-house individually on the test data (see below), using the default parameters. The ProQ method produced two output scores per model, ProQ-MX and ProQ-LG, which were benchmarked separately. The ProQ scores from the version submitted for the CASP7 model quality assessment (QMODE 1) category were also downloaded via CASP7 results website[23].


The ModSSEA method was developed as a novel model quality assessment program based on secondary structure element alignments (SSEA). The ModSSEA score was determined in essentially the same way as the SSEA score which have been previously benchmarked [12-14], however, the PSIPRED [24] predicted secondary structure of the target protein was aligned against the DSSP [25] assigned secondary structure of the model, as opposed to the secondary structure of a fold template. The ModSSEA score was incorporated along with the MODCHECK and ProQ scores into the ModFOLD method described below.


Predictions for the CASP7 model quality assessment (QMODE 1) category were generated using the ModFOLD method. The method was loosely based on the nFOLD protocol [14] and combined the output from a number of model quality assessment programs (MQAPs) using an artificial neural network. The scaled output scores from the in house versions of MODCHECK [9], ProQ-LG, ProQ-MX [7] and ModSSEA were used as inputs to a feed forward back propagation network. The neural network was then trained to discriminate between models based on the TM-score [26]. The neural network architecture used for ModFOLD simply consisted of four input neurons, four hidden neurons and a single output neuron. The models for the training set were built from mGenTHREADER [27] alignments to > 6200 fold templates using an in-house program, which simply mapped aligned residues in the target to the full backbone coordinates of the template and carried out renumbering. The target-template pairs were then generated from an all against all comparison of the sequences from non-redundant fold library. Sequences within the training set had BLAST [28] E-values > 0.01 and < 30% identity to one another.

The four selected MQAPs were used to predict the quality of each of the structural models in the training set. The resulting MQAP scores were scaled to the range 0–1 and were fed in to the input layer. The network was trained using the observed quality of each model, which was calculated using the TM-score. The resulting neural network weight matrix was saved and subsequently used to provide in-house consensus predictions of model quality.

Pcons and LEE

The Pcons and LEE groups were the overall top performing groups at CASP7 according to the official assessment. The Pcons method has been described previously [15] and is widely used as a consensus fold recognition server. From the CASP7 abstracts it is understood that the method used by the LEE group was based on a combination of the clustering of models, an artificial neural network and energy functions. As the methods produced by these groups could not be tested in house, the scores submitted by these groups for the CASP7 model quality assessment (QMODE 1) category were downloaded via CASP7 results website [23].


The 3D-Jury method [29] is a popular and effective method of clustering models which was not tested in the CASP7 model quality assessment category. However, the simplicity of the approach allows it to be run in-house easily for comparison against the leading methods. Therefore, for each target, the models were also scored using an in-house approach similar to that of the 3D-Jury method [29], however, TM-scores were used to determine the similarities between models rather than MaxSub scores (using the TM-score instead of the MaxSub score was found to give a marginally better performance).

Testing Data

The fold recognition server models for each CASP7 target were downloaded via the CASP7 website [30]. The individual MQAPs which make up ModFOLD, were used to evaluate every server model (both AL and TS) for each CASP7 target. The ModFOLD predictions were then submitted to assessors prior to the expiry date for each target and therefore prior to the release of each experimental structure. After the CASP experiment, 87 of the non-cancelled official targets that had published experimental structures released into the PDB (as of 26/11/06) were used to provide a common set of models in order to benchmark the performance of each method.

In addition, several standard test sets were downloaded from the Decoys 'R' Us [18] database (4state_reduced [19], lattice_ssfit [20] and LMDS [21]) so that ModFOLD and ModSSEA may be compared with additional published methods. The ability of methods to identify the native structure from each set of decoys was tested using standard measures.

Measuring observed model quality

The TM-score program [26] was used to generate the TM-scores, MaxSub scores [31] and GDT scores [32], which were used to measure the observed model quality for each individual model. The combined score was also calculated for each individual model i.e. the TM-score, MaxSub and GDT scores were calculated for each model and the mean score was then taken for each model separately.

The ModFOLD server

The ModFOLD predictions were carried out entirely automatically for all targets throughout the CASP7 experiment. A web server has been implemented for the ModFOLD method, which is freely available for academic use [33]. The server accepts gzipped tar files of models – similar to the official CASP7 tarballs – and returns predictions in the CASP QA (QMODE1) format via email.

Authors' contributions

LJM carried out the entire study.


This work was supported by a Research Councils United Kingdom (RCUK) Academic Fellowship.


  1. Fischer D: Servers for protein structure prediction.

    Curr Opin Struct Biol 2006, 16(2):178-182. PubMed Abstract | Publisher Full Text OpenURL

  2. Laskowski RA, Rullmannn JA, MacArthur MW, Kaptein R, Thornton JM: AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR.

    J Biomol NMR 1996, 8(4):477-486. PubMed Abstract | Publisher Full Text OpenURL

  3. Hooft RW, Vriend G, Sander C, Abola EE: Errors in protein structures.

    Nature 1996, 381(6580):272. PubMed Abstract | Publisher Full Text OpenURL

  4. Lazaridis T, Karplus M: Effective energy functions for protein structure prediction.

    Curr Opin Struct Biol 2000, 10(2):139-145. PubMed Abstract | Publisher Full Text OpenURL

  5. Sippl MJ: Recognition of errors in three-dimensional structures of proteins.

    Proteins 1993, 17(4):355-362. PubMed Abstract | Publisher Full Text OpenURL

  6. Eisenberg D, Luthy R, Bowie JU: VERIFY3D: assessment of protein models with three-dimensional profiles.

    Methods Enzymol 1997, 277:396-404. PubMed Abstract OpenURL

  7. Wallner B, Elofsson A: Can correct protein models be identified?

    Protein Sci 2003, 12(5):1073-1086. PubMed Abstract | Publisher Full Text OpenURL

  8. Tosatto SC: The victor/FRST function for model quality estimation.

    J Comput Biol 2005, 12(10):1316-1327. PubMed Abstract | Publisher Full Text OpenURL

  9. Pettitt CS, McGuffin LJ, Jones DT: Improving sequence-based fold recognition by using 3D model quality assessment.

    Bioinformatics 2005, 21(17):3509-3515. PubMed Abstract | Publisher Full Text OpenURL

  10. CAFASP4 [] webcite

  11. CASP7 [] webcite

  12. McGuffin LJ, Bryson K, Jones DT: What are the baselines for protein fold recognition?

    Bioinformatics 2001, 17(1):63-72. PubMed Abstract | Publisher Full Text OpenURL

  13. McGuffin LJ, Jones DT: Improvement of the GenTHREADER method for genomic fold recognition.

    Bioinformatics 2003, 19(7):874-881. PubMed Abstract | Publisher Full Text OpenURL

  14. Jones DT, Bryson K, Coleman A, McGuffin LJ, Sadowski MI, Sodhi JS, Ward JJ: Prediction of novel and analogous folds using fragment assembly and fold recognition.

    Proteins 2005, 61 Suppl 7:143-151. PubMed Abstract | Publisher Full Text OpenURL

  15. Wallner B, Fang H, Elofsson A: Automatic consensus-based fold recognition using Pcons, ProQ, and Pmodeller.

    Proteins 2003, 53 Suppl 6:534-541. PubMed Abstract | Publisher Full Text OpenURL

  16. Eramian D, Shen MY, Devos D, Melo F, Sali A, Marti-Renom MA: A composite score for predicting errors in protein structure models.

    Protein Sci 2006, 15(7):1653-1666. PubMed Abstract | Publisher Full Text OpenURL

  17. Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score.

    Nucleic Acids Res 2005, 33(7):2302-2309. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Samudrala R, Levitt M: Decoys 'R' Us: a database of incorrect conformations to improve protein structure prediction.

    Protein Sci 2000, 9(7):1399-1401. PubMed Abstract | Publisher Full Text OpenURL

  19. Park B, Levitt M: Energy functions that discriminate X-ray and near native folds from well-constructed decoys.

    J Mol Biol 1996, 258(2):367-392. PubMed Abstract | Publisher Full Text OpenURL

  20. Xia Y, Huang ES, Levitt M, Samudrala R: Ab initio construction of protein tertiary structures using a hierarchical approach.

    J Mol Biol 2000, 300(1):171-185. PubMed Abstract | Publisher Full Text OpenURL

  21. Keasar C, Levitt M: A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics.

    J Mol Biol 2003, 329(1):159-174. PubMed Abstract | Publisher Full Text OpenURL

  22. MQAP Downloads [] webcite

  23. CASP7 Results [] webcite

  24. Jones DT: Protein secondary structure prediction based on position-specific scoring matrices.

    J Mol Biol 1999, 292(2):195-202. PubMed Abstract | Publisher Full Text OpenURL

  25. Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.

    Biopolymers 1983, 22(12):2577-2637. PubMed Abstract | Publisher Full Text OpenURL

  26. Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality.

    Proteins 2004, 57(4):702-710. PubMed Abstract | Publisher Full Text OpenURL

  27. McGuffin LJ, Smith RT, Bryson K, Sorensen SA, Jones DT: High throughput profile-profile based fold recognition for the entire human proteome.

    BMC Bioinformatics 2006, 7:288. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  28. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.

    J Mol Biol 1990, 215(3):403-410. PubMed Abstract | Publisher Full Text OpenURL

  29. Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach to improve protein structure predictions.

    Bioinformatics 2003, 19(8):1015-1018. PubMed Abstract | Publisher Full Text OpenURL

  30. CASP7 Server Models [] webcite

  31. Siew N, Elofsson A, Rychlewski L, Fischer D: MaxSub: an automated measure for the assessment of protein structure prediction quality.

    Bioinformatics 2000, 16(9):776-785. PubMed Abstract | Publisher Full Text OpenURL

  32. Zemla A, Venclovas C, Moult J, Fidelis K: Processing and analysis of CASP3 protein structure predictions.

    Proteins 1999, Suppl 3:22-29. PubMed Abstract | Publisher Full Text OpenURL

  33. The ModFOLD server [] webcite