Protein tertiary structure prediction is a fundamental problem in computational biology and identifying the most native-like model from a set of predicted models is a key sub-problem. Consensus methods work well when the redundant models in the set are the most native-like, but fail when the most native-like model is unique. In contrast, structure-based methods score models independently and can be applied to model sets of any size and redundancy level. Additionally, structure-based methods have a variety of important applications including analogous fold recognition, refinement of sequence-structure alignments, and de novo prediction. The purpose of this work was to develop a structure-based model selection method based on predicted structural features that could be applied successfully to any set of models.
Here we introduce SELECTpro, a novel structure-based model selection method derived from an energy function comprising physical, statistical, and predicted structural terms. Novel and unique energy terms include predicted secondary structure, predicted solvent accessibility, predicted contact map, β-strand pairing, and side-chain hydrogen bonding.
SELECTpro participated in the new model quality assessment (QA) category in CASP7, submitting predictions for all 95 targets and achieved top results. The average difference in GDT-TS between models ranked first by SELECTpro and the most native-like model was 5.07. This GDT-TS difference was less than 1% of the GDT-TS of the most native-like model for 18 targets, and less than 10% for 66 targets. SELECTpro also ranked the single most native-like first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets, more often than any other method. Because the ranking metric is skewed by model redundancy and ignores poor models with a better ranking than the most native-like model, the BLUNDER metric is introduced to overcome these limitations. SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein, where it outperforms the benchmarked method (I-TASSER).
SELECTpro is an effective model selection method that scores models independently and is appropriate for use on any model set. SELECTpro is available for download as a stand alone application at: http://www.igb.uci.edu/~baldig/selectpro.html webcite. SELECTpro is also available as a public server at the same site.
Selecting the most native-like model from a set of possible models is a crucial task in protein structure prediction. A variety of Model Quality Assessment Programs (MQAPs) have been developed that assign numeric scores to models in a set, and then use the scores to rank the models and ultimately select a single model. MQAP methods can be divided roughly into three categories based on the type of information they use: evolutionary methods use sequence or profile similarity between target sequence and template, consensus methods use similarity between models, and structure-based methods use model coordinates . Each category of methods has inherent strengths and weaknesses.
Evolutionary methods can provide quality scores that have been shown to correlate with structural similarity to native . However, for lower confidence alignments the scores do not correlate well with structural similarity. Furthermore, identification of the best template and specific alignment can be difficult. In addition, models built from multiple templates or template-free methods cannot be scored appropriately by evolutionary methods alone.
Consensus methods take advantage of the observation that similar models produced by different predictors tend to be more accurate than those that are structural outliers. In practice, consensus methods outperform the methods they draw from, and they rarely pick a very poor model. The disadvantage, however, is that when the best model is a structural outlier it will be overlooked for lack of popularity . Also, consensus methods are not appropriate for selecting from small sets of structurally diverse models, especially in the extreme case of a two-model set.
While consensus methods depend on similarity between models, structure-based methods calculate scores on each model independently. For this reason, structure-based methods can be applied to model sets of any size and diversity, and will produce the same score for a model regardless of the other models in the set. Structure-based methods can also be used for template-free modeling [3-6] and model refinement procedures [7,8]. One weakness of high resolution structure-based methods, including protein free energy approximation functions [9-12] and physics based approaches [13,14], is their sensitivity to local structural irregularities such as steric clashes and chain breaks, which can significantly bias scores on otherwise accurate models. Even slight differences in model backbones can produce significantly different scores . Lower resolution structure-based methods, such as statistical potentials [6,16,17], are more robust to backbone variation, but are sensitive to extended low contact-order regions in the models.
Here we describe SELECTpro, a novel structure-based MQAP that combines high and low resolution energy terms into a model selection method that is effective on model sets of variable size, diversity, and target difficulty. Most of our assessment is calculated from the CASP7 model quality assessment category (QA) results published online . The QA category provides a framework for the unbiased evaluation of MQAPs on ensembles of models produced by diverse automated prediction methods.
Results and discussion
We analyze the CASP7 quality assessment category predictions with a focus on the quality of the model ranked first by each predictor and the recovery of the most native-like model in the set. Only SetAll is used in the assessment of the quality of the model ranked first by each group (Table 1). The results are very similar when using SetComplete (data not shown) because QA groups rarely rank an incomplete model first.
Table 1. Quality of Model Ranked First (MQA1) Relative to Most Native-Like Model (Mmax)
The assessment of the recovery of the most native-like model, is performed on both SetAll and SetComplete (Table 2) because the few cases where an incomplete model is the most native-like have a significant effect on the average recovery metrics of all QA groups. Incomplete and irregular models are especially challenging for structure-based methods. A comparison of the average Pearson Correlation on SetAll and SetComplete, highlights these issues (Table 3). The frequency of recovering the most native-like model is calculated on SetComplete (Figure 1).
Table 2. Recovery of Top GDT-TS Model (Mmax)
Table 3. Correlation of Selected Groups
Figure 1. Recovery of Mmax using SetComplete. (A) number of targets where Mmax is ranked first (top of dark blue bar), in the top five (top of gray bar), and in the top ten (top of white bar). (B) number of targets where ΔGDTBLUNDER% is less than 10% (top of light blue bar) and less than 20% (top of purple bar). Only the first ten groups are shown in both graphs.
The utility of SELECTpro for selecting the best model from a small set is demonstrated by selecting from the five models submitted for each target by the top automated predictors. These small set selection results are calculated using SetAll (Figure 2). SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein and compared to I-TASSER (Figure 3).
Figure 2. Reranking models from top servers. Each server predictor submitted five models per target, with the highest confidence model ranked first. (A) the number of targets where each server's highest GDT-TS model is ranked first is shown with gray bars, and black bars when the models are reranked with SELECTpro. (B) shows the change in average GDT-TS for each group when SELECTpro is used to select model 1. P-values of paired t-tests are shown above the horizontal axis when SELECTpro demonstrates improved model selection and statistically significant improvements (p < .05) are in bold.
Figure 3. Large Decoy Set Model Selection. Large decoy set model selection with SELECTpro on I-TASSER benchmark set. This set of 16 small proteins was used as one of the benchmark sets for evaluating the I-TASSER method . The complete decoy sets can be downloaded from . Each protein has from 12500 to 20000 decoy models. For each protein different symbols are used to indicate the GDT-TS of Mmax (□), SELECTpro's MQA1 (×), and I-TASSER's MQA1 (+).
To make fair comparisons to groups participating on only a subset of targets, common subset comparisons between SELECTpro and each of these groups are included in Tables 1 and 2. Only groups participating on at least half of the targets are included, and for groups with multiple submissions only the best one is shown. In the results tables any value that is better than SELECTpro is underlined.
For multiple domain targets, the sum of GDT-TS over all domains is used as the GDT-TS of the model. Since the QA predictions correspond to the entire structures, it is impossible to fairly assess the domains independently.
To assess the significance of the summary statistics compared in Table 1, Table 2, and Figure 2, we performed paired t-tests between SELECTpro each other group on common subsets of targets (or targets and models when appropriate). All p-values from the tests appear in the tables and figure, but only statistically significant p-values (p < .05) are shown in bold.
The following notations are used throughout the results section:
• Mmax: The model with the highest GDT-TS among all server models.
• MQA1: The model with the highest QA score.
• NT: The number of targets a group made valid predictions on.
• ND: The number of domains a group made valid predictions on.
The recovery of Mmax by a QA predictor can only be evaluated if Mmaxwas scored by the predictor. In most cases QA predictors did not provide scores for all available server models, and frequently there is no score for Mmax. For example, predictor 016_1 (AMBER/PB) made submissions on 86 targets, but Mmax is only scored for 53 of these targets – so only these targets (NT = 53) can be evaluated for this predictor.
Quality of Model Ranked First (MQA1) Relative to Most Native-Like Model (Mmax)
In this section on the assessment of the model ranked first, and the corresponding Table 1, we use the following three metrics:
• ΔGDTQA1 = GDT-TS(Mmax) - GDT-TS(MQA1) : The GDT-TS difference between Mmax and MQA1 measures how much is lost by selecting MQA1 rather than Mmax for a single target.
• ΔGDTQA1% = ΔGDTQA1/GDT-TS(Mmax) : The GDT-TS difference percentage allows for comparison across targets with different numbers of domains and difficulty levels.
The columns of Table 1 are: (1) group number; (2) number of targets the group made predictions on; (3) number of targets such that ΔGDTQA1 = 0; (4) number of targets such that ΔGDTQA1% < 1%; (5) number of targets such that ΔGDTQA1% < 10%; and (6) . The common subset results section has an additional column for the p-value of the paired t-test using ΔGDTQA1. The rows are sorted first by the number of targets and then by . Of the groups participating on all 95 targets, SELECTpro has the lowest average ΔGDTQA1, with a value of 5.07, followed closely by group 713_1 (Circle-QA), with a value of 5.44. Predictor 038_1 (GeneSilico) has an average ΔGDTQA1 of 5.75, with predictions on 85 targets. In common subset comparisons with these two groups SELECTpro is not significantly better, with p-values of .25 and .12 respectively. In common subset comparisons with all remaining groups SELECTpro is significantly better.
Another way to assess the quality of MQA1 over many targets is to count the number of targets such that MQA1 is the best model, or nearly the best, in the set. A method that performs very well on most targets, but very poorly on a few, would still be recognized by this criteria. SELECTpro recovers the best model for 12 targets, selects a model with ΔGDTQA1% < 1% for 18 targets, and selects a model with ΔGDTQA1% < 10% for 66 targets. Group 091_1 (Ma-OPUS) also performs well, with 11, 18, and 61 targets in the respective categories. Only the 60 targets with ΔGDTQA1% < 10% of predictor 038_1 (GeneSilico) on its 85 target subset are better than SELECTpro in common subset comparison (58 for SELECTpro).
The BLUNDER Measure Recovery of Mmax
How well does a QA predictor recover Mmax? The traditional metric to assess Mmax recovery is the rank of Mmax, and the average rank over many targets (). While rank captures some important information, it ignores the redundancy of models and the quality of models ranked better than Mmax. Consider the following hypothetical situation: group A ranks Mmax 10th and all nine models ranked above it are redundant with ΔGDT of ~2.0, group B ranks Mmax 5th and the four models ranked above it are diverse with a ΔGDT between 10.0 and 20.0. Which group has done a better job of recovering Mmax? In this example, the rank metric favors group B, although group A ranks only a single redundant model above Mmax. In addition, the models ranked better than Mmax by group A have only slightly lower GDT-TS than Mmax, while the models ranked better than Mmax by group B are significantly worse than Mmax. To address these weaknesses of the rank metric, we introduce the BLUNDER metric, which focuses on the worst model ranked better than Mmax (the most embarrassing blunder). This measure is not affected by model redundancy and measures the quality of models ranked above Mmax. The BLUNDER metric is defined using the following notation, and used in the assessment of the recovery of Mmax and the corresponding Table 2 and Figure 1:
• MBLUNDER: The model with the minimum GDT-TS among models ranked better than Mmax.
• ΔGDTBLUNDER = GDT-TS(Mmax) - GDT-TS(MBLUNDER) : The GDT-TS difference between Mmax and MBLUNDER measures the size of the worst blunder.
• ΔGDTBLUNDER% = ΔGDTBLUNDER/GDT-TS(Mmax) : The ΔGDTBLUNDER percentage allows for comparison across targets with different numbers of domains and difficulty levels.
Figure 1 contains graphs of the frequency of recovering Mmax using the rank (A) and ΔGDTBLUNDER% (B) measures on SetComplete. SELECTpro ranks Mmax first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets. SELECTpro's ΔGDTBLUNDER% values are less than 10% of GDT-TS(Mmax) for 40 targets and less than 20% for 63 targets. These results are best among all QA participants. The average Mmax recovery results are summarized in Table 2. The results columns are (1) average rank () and (2) average ΔGDTBLUNDER () on SetAll and SetComplete. The common subset results section also includes a column for the p-value of a paired t-test using ΔGDTBLUNDER (p-value). Rows are sorted separately for each dataset by NT first and then . On SetComplete SELECTpro has a of 10.4. In common subset comparisons one group has a lower : group 091_1 (Ma-OPUS) with of 16.8 on 94 targets compared to 17.4 for SELECTpro. On SetAll SELECTpro did not submit a score for Mmax of target T0356 (HHpred2_TS1) due to a processing error. In order to make complete common subset comparisons when possible we added in the SELECTpro score for HHpred2_TS1. SELECTpro ranks it 86th and ΔGDTBLUNDER = 50.0. Both results are significantly worse than the SELECTpro averages.
Pearson Correlation for Individual Proteins
The assessor evaluation of the quality assessment category  focused on the Pearson Correlation between the QA scores and GDT-TS. Here we use the Pearson Correlation only to highlight some of the difficulties for structure-based methods in dealing with incomplete models, as well as basic non-protein like structural features. Approximately half of the models in SetAll are incomplete, with backbone coordinates missing for one or more residues.
Incomplete models present a challenge to SELECTpro and other structure-based methods because the scores for each model are only comparable when calculated on coordinates for the same set of residues. Another issue is that some complete models have severe chain-breaks, severe steric clashes, or significant portions modeled only as extended chains. These local problems can overwhelm the energy of what may otherwise be a good model. Consensus methods do not suffer from these local structure problems. Given this rationale, one would expect structure-based methods to see the most improvement in terms of average Pearson Correlation on SetComplete relative to SetAll. Table 3 shows the average Pearson Correlation of five selected groups. Predictors 713_1 (Circle-QA), 633_1 (ProQ), and SELECTpro are structure-based MQAPs, while 634_1 (Pcons) is a consensus method and 556_1 (LEE) scored structures based on the GDT-TS similarity to their human Model 1 CASP7 prediction . As expected, the structure-based MQAPs improve more than the structural similarity-based methods. The even greater increase in Pearson Correlation for SELECTpro can be accounted for by the failure to generate appropriate complete models for some of the incomplete models resulting in QA scores calculated on extended chains.
Reranking Top Server Group Models
Predictors in CASP may submit up to five models, but CASP evaluation focuses on the model designated as Model 1. Clearly, the selection of Model 1 is critical in the CASP setting and for protein structure prediction in general. Figure 2 contains the results when SELECTpro is used to rerank the five models submitted by each of the top ten servers from CASP7, compared to each server's results. In the following assessment Mmax-g is the model with the highest GDT-TS of the five models submitted by a server. Figure 2 (A) shows that SELECTpro recovers Mmax-g more frequently than 8 of the top 10 server groups; in addition, when SELECTpro is used to select Model 1 the average GDT-TS increases for 7 of 10 sever groups; however, the increase is only statistically significant for 3 groups. SELECTpro improves using both criteria for the top 3 server groups (Zhang-Server, Pmodeller6, and ROBETTA). These results highlight the utility of SELECTpro for the task of model selection. The comparisons made here are fair because structure-based methods can be applied in the server setting to any number of models.
Large Decoy Set Model Selection
Here we analyze SELECTpro's model selection capability on the large decoy sets for 16 small proteins from a recent I-TASSER benchmark set . The I-TASSER prediction method generates 12500 to 20000 different backbone conformations. The complete decoy sets can be downloaded from . The consensus method SPICKER  is used to cluster the models and a centroid model is built from the first cluster. A second round of simulation resolves the steric clashes in the centroid model and results in the final predicted model. The centroid model and final model are not part of the decoy set. In order to make a fair model selection comparison the decoy model closest to the centroid is used as I-TASSER's MQA1.
On the benchmark set SELECTpro has an average GDT-TS of 63.7, while I-TASSER has an average GDT-TS of 62.1. SELECTpro's average ΔGDTQA1 is 9.2 and I-TASSER's ΔGDTQA1 is 10.7. Figure 3 displays the GDT-TS results for the individual proteins in the benchmark set. Different symbols are used to indicate the GDT-TS of Mmax (□), the GDT-TS of SELECTpro's MQA1 (×), and the GDT-TS of I-TASSER's MQA1 (+) for each protein. A paired t-test of the hypothesis that SELECTpro and I-TASSER's mean performance are equal produces a p-value of .19, which is not statistically significant, but does give some evidence that SELECTpro can select a very good model from a large set of decoys at least well as an established method that utilizes consensus methods.
A MQAP that can select the most native-like model from a set of possibilities has a variety of applications in protein structure prediction. The new quality assessment category introduced in CASP7 allows for the unbiased assessment of MQAPs on the models produced by automated predictors. This category allows researchers to focus on the model scoring aspect of protein structure prediction.
The results presented in this work demonstrate that SELECTpro, a structure-based model selection method, consistently selects one of the best models from the large diverse sets of models produced by automated predictors, across all levels of target difficulty. On these large diverse sets of models, SELECTpro also recovers the single most native-like model well compared to other methods. On the small sets of five models submitted for each target by the top automated predictors, in most cases SELECTpro selects better models than the predictors themselves.
Since SELECTpro and other structure-based methods score models independently, they can be incorporated into the model selection pipelines of individual protein structure prediction servers. For this reason, it may help predictors if the CASP organizers distinguished methods that score models independently from those that do not.
Consensus and structure-based methods can be combined to achieve improved results. For example, the meta-server method Pmodeller  combines consensus (Pcons ) and structure-based methods (ProQ ) to predict protein structures more accurately than either method in isolation. The assessment of the QA category by CASP assessors recognized the consensus method Pcons (group 634_1) for the high Pearson Correlation between their scores and model GDT-TS on most targets . In their own assessment the authors of Pcons recognized that while consensus methods perform well in most cases, "when most of the models are incorrect and the few correct models are outliers a consensus based approach cannot be expected to make an optimal choice."  For instance, they identified three particular targets in CASP7 where their consensus method failed: T0283, T0350, and T0351 . The Pcons average ΔGDTQA1 on these three targets is 30.8. The same research group's structure-based method ProQ (group 633_1) has an average ΔGDTQA1 of 17.2. In contrast, on these three targets SELECTpro has an average ΔGDTQA1 of only 7.1. This example highlights the potential of combining SELECTpro with existing model selection methods.
SELECTpro has been made publicly available as a server, where users may submit from 2 to 100 models for evaluation. In addition to the global confidence scores, the scores of individual energy terms are also returned to the user by email for each model submitted. SELECTpro is one of several protein structure tools in the SCRATCH suite of predictors , and is available through: http://www.igb.uci.edu/~baldig/selectpro.html webcite.
All of the comparative analysis in this work is performed on the server models and quality assessment predictions submitted in the CASP7  experiment. The CASP QA experiment is particularly relevant for the evaluation of model selection methods for several reasons: (1) the QA predictors were blind to the true structures at the time of prediction making it impossible for methods to be tuned to improve results; (2) the set of proteins is diverse: the 95 targets range in size from 68 to 530 amino acids, come from a variety of organisms, and span the full range of prediction difficulty; (3) each target has more than 200 predicted models that contain the types of errors that occur in automated structure prediction; (4) the protein set is not selected by any of the participating QA groups; (5) the models are scored by a variety of methods and the results are publicly available. We perform analysis on the set of all models (SetAll) and a subset of models (SetComplete) that are complete and free of gross structural irregularities, as described below. All of the ABIpro models and some of the 3Dpro models were optimized using the exact energy function of SELECTpro. These models are removed because of the obvious bias towards these models. In recent CASP experiments the GDT-TS  has been used as the primary automatic structural similarity measure. The published GDT-TS values from the CASP7 website are the only structural similarity measure used in this work.
The SetAll dataset consists of the server models with a GDT-TS value published on the CASP7 website, a total of 23,423 models. To calculate a score on a protein model SELECTpro requires the backbone coordinates (N, Cα, C) for all model residues as input. A total of 8,812 models in SetAll have only a Cα trace or have no coordinates for one or more residues. Modeller8v1 [28-30] was used to generate complete models from the incomplete ones, and then the complete models were scored by SELECTpro. In most cases the complete models were built appropriately from the incomplete models; however, in some cases the final model was a fully extended chain due to an error in our application of Modeller. We failed to identify this problem until after the completion of the CASP7 competition. The SELECTpro scores versus GDT-TS scores for all models of target T0305 are displayed in plot A of Figure 4. The circled outliers with very low confidence scores and high GDT-TS scores are models that were incomplete and the complete models generated by Modeller were fully extended chains. The Pearson correlation on the set of all models for T0305 is .641. The SELECTpro scores versus GDT-TS scores for complete models only are displayed in plot B of Figure 4, and the Pearson correlation is .966.
Figure 4. SetAll versus SetComplete. Plots of SELECTpro scores versus GDT-TS scores for T0305 models from SetAll (A) and SetComplete (B). The Pearson correlation is .641 for SetAll and .996 for SetComplete. This large difference is mainly due to the extended chain models (circled in plot A) scored by SELECTpro due to an error in our use of Modeller to generate complete models from incomplete ones.
The scores produced by SELECTpro are comparable on complete models of the same sequence. There is no standard for the handling of incomplete models and we assume that participating groups took a variety of approaches. Using only complete models ensures that the MQAP scores are calculated from the same coordinates. Thus, the models retained in SetComplete are screened first for completeness. Models missing backbone coordinates for one or more residues are removed. This leaves 14,611 models.
Structure-based MQAPs are susceptible to local structural irregularities in models, and will tend to score such models poorly. This is why methods developed to select near-native models from sets of decoys remove such models from consideration . We apply additional filters (described below) for Cα-Cα clashes, Cα-Cα chain breaks, and expanded termini to remove an additional 1,217 models leaving 13,494 more plausible models in SetComplete.
The Cα-Cα clash model filter enforces a squared difference penalty for Cα-Cα distances less than 3.6 Ǻ. The distance between the Cα atoms of residue i and j is denoted by ri,Cα,j,Cα and N is the protein length. The constant 13.52 in the threshold below corresponds to two severe clashes where ri,Cα,j,Cα = 1.0 Ǻ. Models with a sum of squared differences greater than 13.52 per 100 residues are filtered out.
The Cα-Cα chain break model filter enforces a squared difference penalty for ri,Cα,i+1,Cα distances greater than 4.0 Ǻ. The constant 16.0 in the threshold below corresponds to a single chain break where ri,Cα,i+1,Cα = 8.0 Ǻ. Models with a sum of squared differences greater than 16.0 per 100 residues are filtered out.
The expanded termini filter removes models where a large portion of the structure is modeled as expanded chain with no non-local interactions. The screening procedure is: scan from the N-terminus until three consecutive residues have a contact number of at least 10, and repeat from the C-terminus. The contact number of a residue is defined here as the number of other Cβ atoms within 10 Ǻ of the residue's Cβ . If the sum of low contact number termini residues is at least 20% of N, the model is filtered out.
In the reduced representation the heavy backbone atoms, carbonyl oxygen, amide hydrogen (N, Cα, C, O, H), and Cβ are represented explicitly. For glycine residues a pseudo Cβ is calculated. The side-chain atoms are represented by a single united point (centroid) [32,33]. The centroid is calculated as the mean of the position of the heavy side-chain atoms. For glycine and alanine the centroid (CT) is set to the Cβ atom. Only the heavy backbone atoms (N, Cα, C) are used as input to SELECTpro and the positions of additional atoms and centroids are calculated from these.
All heavy-atom representation
In the all heavy-atom representation the centroid is removed and the heavy side chain atoms are represented explicitly. The side-chains are initially placed onto the backbone of the reduced representation in their most likely conformation according to the SCWRL backbone-dependent rotamer library . The side-chain placements are then optimized using the SELECTpro all-atom energy terms (described below) in conjunction with the rotamer library.
Energy Functions Overview
EREDUCED is the combined energy calculated from the reduced representation. EREDUCED is a linear combination of predicted (EPRED-SS, EPRED-SA, EPRED-CM), physical (EVDW-REP), and statistical (ECT-REP, ESTAT-ENV, ESTAT-PW-CI, ESTAT-PW-CD, EROG) terms:
EALL-ATOM consists of the energy terms that depend on the all heavy-atom representation. EALL-ATOM is a linear combination of the following physical terms:
EFINAL is the sum of EREDUCED and EALL-ATOM, and is used for the final scoring of models by SELECTpro. The individual energy terms are outlined briefly below and the detailed description of the novel terms follow in the remainder of this section. Underlined terms are adapted from previously described energy terms their details are included in the Appendix.
The parameter weights were determined by repeatedly varying individual weights and maximizing the sum of the GDT-TS of the lowest EFINAL models on a training set built from CASP6 protein domains. For each CASP6 protein domain a set of 500 decoy models was generated using fragment assembly with the RMSD to native as the dominant term in the objective function .
EPRED-SS: predicted secondary structure
EPRED-ACC: predicted solvent accessibility
EPRED-CM: predicted contact map
EBETA: sheet formation
BB-REP: backbone repulsion
CT-REP: centroid repulsion
STAT-ENV: residue environment potential 
STAT-PW-CD: context dependent pair-wise potential 
ESC-HB: side-chain hydrogen bonding
LEN-JONES: van der Waals forces 
SOLVATION: solvation effects 
ELECTRO: electrostatic interactions
Throughout this work the convention of all capital letters referring to global energy and all lower case referring to local energy is used. For instance, EPRED-CM refers to the global contact map energy and Epred-cm(i,j) refers to the contact map energy between residues i and j.
Parameter notation used in energy equations
ri,x,j,y: distance between atom x of residue i and atom y of residue j
rx,y: distance between atom x and atom y
vi,x,j,y: vector from atom x of residue i to atom y of residue j
ui,x,j,y: unit vector calculated from vi,x,j,y
Ni: number of residues in contact with residue i, with contact defined as ri,Cβ,j,Cβ < 10 Ǻ
phii: Phi angle of residue i
psii: Psi angle of residue i
Protein specific input parameters
aai: amino acid type of residue i
ssi: predicted secondary structure of residue i (H,E,C)
acci: predicted solvent accessibility of residue i ('e': exposed, '-', buried)
cmapi,j: predicted contact/non-contact between residues i and j, with contact defined as ri,Cα,j,Cα < 12 Ǻ
Protein independent parameters
Ivalue: ideal parameter value for a given calculation
σvalue: standard deviation value for a given calculation
vdwx: van der Waals radius of atom x
vdwx+y: vdwx + vdwy
Ωstat-env: pre-calculated statistics for use in ESTAT-ENV
Ωstat-pw-oi: pre-calculated statistics for use in ESTAT-PW-CI
Ωstat-pw-od: pre-calculated statistics for use in ESTAT-PW-CD
Dmin,pw-od: minimum interaction distance for centroid pairs used in ESTAT-PW-CD
Dmax,pw-od: maximum interaction distance for centroid pairs used in ESTAT-PW-CD
Dmin-CT: minimum distances between centroids of amino acid pairs observed in pdb_select25 .
Reduced Representation Energy Term Details
The details of how the novel reduced representation energy terms are calculated are presented in this section. The predicted structural terms EPRED-SS, EPRED-ACC, and EPRED-CM and the β-strand pairing term, EBETA, are novel and unique to SELECTpro. Additional reduced representation terms are adapted from previously published work and their details are included in the Appendix.
Predicted structural features overview
The predicted structural feature predictions used in EPRED-SS, EPRED-ACC, and EPRED-CM come from the SCRATCH suite of predictors . Each predictor is trained in a supervised fashion using curated non-redundant datasets extracted from the PDB . The secondary structure (SSpro ) and solvent accessibility (ACCpro ) predictors use ensembles of 1D-RNN (one dimensional-recursive neural network) architectures . The contact map predictor (CMAPpro ) uses ensembles of 2D-RNN architectures .
EPRED-SS: predicted secondary structure
The predicted secondary structure term EPRED-SS penalizes deviation of the torsion angles from the torsion angle parameters for helices and strands predicted by SSpro. There is no penalty for predicted coils. The parameter values for helix residues are: IHφ = -65.3, σHφ = 11.9, IHψ = -39.4, σHψ = 11.3. The parameter values for strand residues are: IEφ = -135.0, σEφ = 15.0, IEψ = 135.0, σEψ = 15.0. Only torsion angles that are more than two standard deviations from the ideal are penalized, with the penalty defined as follows:
The definition of Epred-strand(j) is equivalent to Epred-helix(i), but with IEφ, σEφ, IEψ and σEψ in place of the corresponding helical values.
EPRED-ACC: predicted solvent accessibility
The solvent accessibility predictor ACCpro predicts the percent of solvent accessibility in 5% increments for each residue. Using 25% exposure as a binary threshold the accuracy of the predictor is ~77% . The binary exposure ('e')/burial ('-') prediction is used as the predicted solvent accessibility for EPRED-ACC. In the reduced representation the solvent accessibility of residue i is estimated by its contact number (Ni), where Ni > 16 is considered buried . If the predicted status of a residue is not realized in the model, the penalty is calculated as:
EPRED-CM: predicted contact map
The contact map predictor CMAPpro predicts the probability of contact or non-contact between Cα atoms, with a contact threshold of 12 Å. The strategy utilized to infer predicted contacts from the probability matrix  results in maps that are sparse when compared to those of real proteins; thus, unrealized contacts are penalized while non-contacts are not. The constant 1.0 is added to the penalty to ensure that all unrealized contacts make a significant contribution to EPRED-CM.
The predicted contact map can help identify the highest GDT-TS models in the set, even when they are not highly similar to native. A good example of this is CASP7 target T0304 is a 122 residue α/β protein where the highest GDT-TS model in the set is Zhang-Server_TS1 (GDT-TS = 45.55). Most secondary structure predictors (including SSpro) failed to predict the first two strands making this target especially difficult. No QA method ranked the highest GDT-TS model first; however, SELECTpro ranked it second and the model ranked first by SELECTpro (T0304.Zhang-Server_TS4) has the second highest GDT-TS. These models have the lowest EPRED-CM of any models in the set, but the native structure has an even lower EPRED-CM. Figure 5 compares the native and predicted contact maps for target T0304.
Figure 5. Contact map comparison. True contact map of target T0304 in lower left, predicted contacts upper right. Contact is defined as Cα atoms within 12 Å. For predicted contacts with a sequence separation of at least six, 651 of 915 (71%) are correct.
EBETA: strand pairing
The formation of hydrogen bonds between the residues of β-strand partners is a major determinant of the tertiary structure of β and α/β proteins. The β hydrogen bonding treatment described here favors realistic strand pairing and sheet formation. The treatment also efficiently accommodates bulges in strands because it does not force the register between two paired strands. EBETA is the global strand pairing energy that penalizes the hydrogen bonding of β residues between strand pairs. Ebeta-sp(βk→βw) is the strand pairing energy of strand βk to strand βw. Ebeta-sp is only commutative if the two strands have the same length. Ebeta-hb(i,j) is the hydrogen bonding penalty between residues i and j.
Ebeta-sp is calculated for all possible strand pairings, but only the two lowest energies from each strand are used in EBETA. Other strand-strand interactions are ignored. In the equations below S is the set of all strands in the protein, βm1 is the strand with the minimum pairing energy from βk, and βm2 is the strand with the next lowest pairing energy from βk. If the strand count is less than six at least two of the strands must be edge strands. This is accounted for by only considering the single best strand partner for two strands.
In the equations for Ebeta-sp below, Sk is the set of all residues in strand βk. Each time Ebeta-hb is calculated the pair (i,j) is chosen with i from Sk and j from Sw, such that Ebeta-hb is minimized. Then residue i is removed from Sk, and residue j is removed from Sw. Ebeta-hb is calculated once for each residue in Sk. If Sk has more residues than Sw each unpaired residue is given maximum penalty of Ebeta-hb.
Between two anti-parallel strand partners, only every other pair of residues is hydrogen bonded. For the pairs that are not hydrogen bonded, a pseudo-bonding calculation is used. The hydrogen bonding energy and pseudo-bonding energy are both calculated and the minimum of the two is used in Ebeta-hb(i,j).
If residues i and j are paired in parallel strands, either i forms hydrogen bonds with j-1 and j+1, or j forms hydrogen bonds with i-1 and i+1. No hydrogen bonds are formed between the atoms of residues i and j. The hydrogen bonding energy is calculated for both possible conformations and only the minimum of the two is used in Ebeta-hb(i,j).
Φ(a→d) is the directional energy calculation for a single hydrogen bond where a is the index of the acceptor residue and d is the index of the donor residue. Three geometrical measures are used to estimate the strength of hydrogen bonds: the distance between the acceptor and the hydrogen atoms (ra,O,d,H), the angle at the acceptor atom (ua,C,a,O · ua,O,d,H), and the angle between the acceptor and donor atom vectors (ua,C,a,O · ud,N,d,H). The distance and acceptor atom angle parameters are motivated by the orientation-dependent hydrogen bonding potential described in . The following parameters were set based on idealized hydrogen bonding between β residues, with standard deviation values set such that two standard deviations approximate the cut-off in true hydrogen bonds. The ideal distance from hydrogen atom to accepting oxygen is Ihb-dist = 1.9 Ǻ, with standard deviation σhb-dist = 0.5 Ǻ. The ideal angle at the acceptor atom is 0°, so the ideal (ua,C,a,O · ua,O,d,H) is Iacc-dp = 1.0, with standard deviation σacc-dp = 0.11. The ideal angle between the acceptor and donor atom vectors is 180°, so the ideal (ua,C,a,O · ud,N,d,H) is Iacc-don-dp = -1.0, with standard deviation σacc-dp = 0.15. The parameters for pseudo-bonded residues are as follows: the ideal distance for ra,O,d,H is Ips-hb-dist = 7.9 Ǻ, Ips-acc-dp = -1.0, and Ips-acc-don-dp = -1.0. The standard deviations from the corresponding hydrogen bonding parameters above are used in Φps(a→d).
The penalty for the observed value (x) increases up to 6 standard deviations from the ideal value (μ).
All-Atom Energy Term Details
The all-atom energy terms depend on atom-atom interactions when all heavy atoms are included in the model. In the all-atoms energy equations x and y refer to atoms in the model and the residue positions are not referenced. The van der Waals radii and well-depths (εx, used in ELEN-JONES) come from the CHARMM19 parameter set . The side-chain hydrogen bonding term, ESC-HB, is described in detail here because it is unique to SELECTpro. The details of ELEN-JONES, ESOLVATION, and EELECTRO are provided in the Appendix.
ESC-HB: side-chain hydrogen bonding
ESC-HB penalizes unsatisfied hydrogen bond donor and acceptor atoms that are at least partially buried. There is no penalty for fully exposed donor or acceptor atoms. Exposure percent ( %) is calculated as . The definitions of and are provided in the description of ESOLVATION in the Appendix. Atoms at least 75% exposed are considered fully exposed and atoms less than 25% exposed are considered fully buried. For 25% <% < 75% the penalty weight is reduced linearly from 1.0 at 25% to 0 at 75%. The ideal distance from the acceptor atom to donor atom is Ihb-da-dist = 2.9 Ǻ. In the equations below donors is the set of all side-chain hydrogen donor atoms and acceptors is the set of all side-chain hydrogen acceptor atoms.
In the interest of completeness and reproducibility we include the details of the energy terms that are adapted from previous work.
Reduced Representation Energy Term Details
EBB-REP: backbone repulsion
This term penalizes steric clashes between non-bonded atoms explicitly represented in the reduced representation. The penalty for overlapping atoms is the overlap distance squared as defined here:
ECT-REP: centroid repulsion
A centroid-centroid repulsive term is used to reduce the overcrowding of side-chains in the reduced representation. The minimum distance between two centroids in the calculation is the minimum observed for each pair of residue types – DCT-min(aai,aaj) – in pdb_select25. The penalty for centroid-centroid overlaps is defined as the overlap distance squared:
ESTAT-ENV: residue environment potential
The motivation for this term is to model the hydrophobic effect. The level of burial for each residue in the model is estimated by the number of other Cβ atoms within 10 Ǻ (the contact number Ni) . The values in the table Ωstat-env reflect the likelihood of observing a particular Ni for each residue type. For model residues near both termini the contact number is artificially increased to account for the missing neighbors along the chain.
ESTAT-PW-CI: context independent pair-wise interactions
This context independent pair-wise potential comes from Equation 6 of . The potential considers the likelihood of observing the pair of centroids in a given distance bin relative to the background, with distance bins of < 5, 5–7, 7–10, 10–12, and > 12 Å. The advantage of a context independent pair-wise potential is that it is less vulnerable to over-fitting by a conformational search because of its generality.
ESTAT-PW-CD: context dependent pair-wise potential
This context specific pair-wise potential is from . This pair-wise potential depends on the local structure and relative orientation of both amino acids in the interaction. The statistics are calculated independently for each combination of local structures and relative orientations. At each position the local structure is considered either compact or open and the relative orientation is determined by the dot product of the Cα to Cβ unit vectors of each residue and divided into three classes: parallel, anti-parallel, and intermediate.
The radius of gyration is a simple measure of the global compactness of a domain. EROG penalizes models that are less compact than expected according to . If the radius of gyration of the model (λ) is less than the expected value (2.2N.38), there is no penalty. If it is greater, then the penalty is the squared difference between observed and expected. In the equation below ri,mean is the distance between the Cα of residue i and the mean of all Cαs in the model.
All-Atom Energy Term Details
ELEN-JONES: van der Waals forces
A fundamental characteristic of native globular protein structures is their efficient steric packing of atoms in the protein core. A Lennard-Jones 12-6 potential with damped repulsion (ELEN-JONES) is used to measure the quality of steric packing. ELEN-JONES is the sum of local energy calculations Elen-jones(x,y) performed on all pairs of non-bonded atoms. Since the repulsive portion of the standard Lennard-Jones 12-6 potential will overwhelm the entire energy function with a single significant atom-atom clash – repulsion is handled by a linear ramp from 0 to 10 as shown in the equation below . Since Elen-jones = 0 when (vdwx,y/rx,y) = independent of atom types, the switch to a linear ramp occurs when (vdwx,y/rx,y) > .
ESOLVATION: solvation effects
Solvation energy is calculated using the implicit solvation model described in  with the following adjustment: for overlapping atoms, the sum of their van der Waals radii is used in the calculation in place of the observed atom-atom distance in the model. This restricts the amount a single atom can contribute to the burial of another atom. Without this adjustment overlapping atoms will bias the calculation to indicate an atom is more buried than it would be otherwise. In the solvation model is the observed solvation free energy of atom x in the model, calculated as the free energy of the fully exposed atom () minus the reduction in solvation caused by the surrounding atoms. was determined empirically by setting it equal to and increasing its magnitude until of deeply buried atoms became zero. λx is the correlation length of atom x. Vy is the volume neighboring atom y. The values of these parameters come from , with the exception of . The equation for below is the combination of Equations 5,6, and 7 of , with the atom overlap adjustment.
Electrostatic interactions between charged atoms are treated by simple repulsion and attraction according to inverse distance squared. The use of distance squared rather than linear distance encourages the formation of salt bridges in the models. There is a correction for atom-atom distance below the minimum realistic value. The ideal distance between oppositely charged atoms is Ihb-da-dist = 2.75 Ǻ. In the equations below pos is the set of all positively charged atoms and neg is the set of all negatively charged atoms.
Availability and requirements
• Operating system: linux for stand alone version, server is platform independent
• Programming language: C++ and Perl
• Software requirements: Perl
• Disk space requirements: 1.6 Gb for full version, 13 Mb without feature predictors
AR and PB designed the novel energy terms. AR implemented the methods and carried out the experiments. AR and PB authored the manuscript. Both authors approved the manuscript.
Work supported by NIH grant LM-07443-01, NSF grants EIA-0321390 and IIS-0513376, and a Microsoft Faculty Research Award to PFB.
Methods Mol Biol 2000, 143:97-129. PubMed Abstract
Felts A, Gallicchio E, Wallqvist A, Levy R: Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the Surface Generalized Born solvent model.
Oldziej S, Czaplewski C, Liwo A, Chinchio M, Nanias M, Vila JA, Khalili M, Arnautova YA, Jagielska A, Makowski M, Schafroth HD, Kazmierkiewicz R, Ripoll DR, Pillardy J, Saunders JA, Kang YK, Gibson KD, Scheraga HA: Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: Assessment in two blind tests.
Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins.
Zhang 2007 Decoy Sets [http://zhang.bioinformatics.ku.edu/I-TASSER/decoys/] webcite
Proteins 1999, 37(Suppl 3):22-29. Publisher Full Text
J Mach Learn Res 2003, 4:575-602. Publisher Full Text
J Chem Phys 1996, 105:1902-1921. Publisher Full Text