Abstract
Background
Successful protein structure prediction requires accurate lowresolution scoring functions so that protein main chain conformations that are close to the native can be identified. Once that is accomplished, a more detailed and timeconsuming treatment to produce allatom models can be undertaken. The earliest lowresolution scoring used simple distancebased "contact potentials," but more recently, the relative orientations of interacting amino acids have been taken into account to improve performance.
Results
We developed a new knowledgebased scoring function, LoCo, that locates the interaction partners of each individual residue within a local coordinate system based only on the position of its main chain N, C_{α }and C atoms. LoCo was trained on a large set of experimentally determined structures and optimized using standard sets of modeled structures, or "decoys." No structure used to train or optimize the function was included among those used to test it. When tested against 29 other published main chain functions on a group of 77 commonly used decoy sets, our function outperformed all others in C_{α }RMSD rank of the bestscoring decoy, with statistically significant pvalues < 0.05 for 26 out of the 29 other functions considered. LoCo is fast, requiring on average less than 6 microseconds per residue for interaction and scoring on commonlyused computer hardware.
Conclusions
Our function demonstrates an unmatched combination of accuracy, speed, and simplicity and shows excellent promise for protein structure prediction. Broader applications may include proteinprotein interactions and protein design.
Background
Protein structure prediction is a difficult problem for several reasons. The forces that determine structure are not fully understood, at least quantitatively [1]. While there is a good qualitative understanding of these forces, there is still no accurate way to calculate the free energy gained by the burial of hydrophobic atoms away from solvent. Nor can we accurately model the highly variable dielectric constant in the interior of a protein. In addition, to correctly predict the conformation of a protein we must first represent it in a computer, and any computational representation of that protein must significantly simplify its components and their interactions. Any change to one part of a protein we are trying to model may affect many other parts of that model.
The first descriptions of protein structure at atomic detail were given by Pauling, Corey and Branson in 1951 [210]. Only secondary structures were described, however, and not all of them have been observed in nature. Nevertheless, it was an extraordinary achievement. The first threedimensional protein fold described was the structure of myoglobin, solved by Kendrew, et al., in 1958 [11].
Difficulties using these descriptions to predict protein structure soon became apparent. In the late 1960s, it was noted that the number of possible conformations of a typical polypeptide chain is so large that it must have a pathway in the course of protein folding, the socalled "Levinthal Paradox [12,13]."
Protein structure can be viewed in a hierarchical manner, where oligomers are made up of polypeptide chains, which are made up of amino acids, which are made up of atoms. It may be considered conceptually to be determined hierarchically as well, with primary structure (the sequence) determining secondary and tertiary structure. Since each possible main chain conformation can have an astronomically large number of possible combinations of amino acid side chain arrangements, one approach to tackle the problem in a hierarchical way is by modeling a manageable number of the most likely main chain conformations before addressing the problem of amino acid side chains. In the initial stages, coarsegrained functions using highly simplified representations of amino acids are employed to quickly evaluate a large number of proposed main chain conformations [14,15]. Only a small fraction of these structures are selected for more detailed assessment. If the lowresolution functions used to select that small fraction are unable to discriminate nearnative main chains from incorrect ones, then a successful prediction is effectively impossible using this approach.
It is therefore necessary to sample all possible main chain conformations in such a way as to ensure that nearnative structures will be among those evaluated. As a practical consideration, it is also important to model the smallest number of main chain conformations needed to ensure that conformations good enough to be considered successful predictions (or able to lead to successful predictions) are among those sampled. Just as important, one must be able to evaluate the sampled conformations in reasonable computing time.
In this work, we address the problem of rapid and accurate evaluation of sampled conformations. To do this we use sets of "decoys"non and nearnative conformations of a given protein sequence that have been proposed in the course of protein structure prediction or generated by making alterations to a native structure. The goal is to be able to discriminate the native and nearnative conformations from the nonnative ones. Further, we focus on the problem of quickly and accurately assessing proposed main chain conformations, ignoring side chains.
Types of functions
There are two categories of functions that are applied to protein structures to evaluate their likelihood of being correct: physicsbased functions [16] and knowledgebased functions [17]. Physicsbased functions attempt to model the actual physics that determine the behavior of proteins. Knowledgebased functions are derived from statistical profiles taken from sets of known protein structures. To create these profiles, some representation of a protein or its constituent parts is chosen, then the known structures in the set are categorized according to the chosen representation. Functions derived from this profile allow any protein conformation to be evaluated according to how closely it corresponds to the profile.
When examining main chains only, no individual amino acid can be considered to be in any particular side chain conformation. Since this undetermined state does not correspond to any physical entity, knowledgebased functions must be used to evaluate it. These functions take a number of forms. One common approach is to measure the separations between all pairs of residues and apply the function to all of them that fall below a given cutoff distance [1838]. These separations are typically between C_{α }atoms, C_{β }atoms or presumed centers of mass for each residue. These socalled "contact" potentials depend on the identities of both residues. They typically make use of a pairwise matrix of interaction values that may or may not be adjusted for the distance between residues.
Since the early development of coarsegrained contact potentials, progress has been steady. While the interaction representations have remained similar, the discrimination power of the matrices has been improved. Some innovations have included quasichemical treatments [24,29,32], hydrophobic energies [21,29,39] and more sophisticated statistical treatments [28,33]. Still, even developers of these potentials have acknowledged their insufficiency for protein structure prediction by themselves [30,35]. More recent work has demonstrated further difficulties with statistical potentials based on preferential interactions [40,41].
Amino acid interaction potentials have begun to include the relative orientations of pairs of residues as well. Buchete, Straub and Thirumalai calculated a fivedimensional potential with a local coordinate system generated around the main chain C_{α }and side chain C_{β }and C_{γ }atoms [42]. Mukherjeee, Bhimalapuram and Bagchi developed their potential around a single ellipsoidal representation of the side chain [43]. Makino and Itoh achieved excellent discrimination of native structures from decoys with a sixterm orientationdependent potential [44]. Rykunov and Fiser made use of a "shuffled reference state" to improve the performance of their orientationdependent potential [45].
We continue this trend of using additional geometric information in the consideration of residueresidue interactions and present a new coarsegrained function for evaluating protein main chain conformations by scoring interactions between amino acids within a single polypeptide chain, using only the positions of main chain N, C_{α }and C atoms. All pairwise residueresidue interactions are actually considered to be two interactions: one from the perspective of each residue. All other residues within a specified cutoff distance are considered to be potential interaction partners, although we do exclude from scoring some number of immediate neighbors in the chain. We use a large precalculated database of interaction potentials for quick scoring.
Scoring is carried out by locating all interaction partners for any given residue within a local Cartesian coordinate system defined by that residue's main chain N, C_{α }and C atoms. This local coordinate system is divided into cubic 1Å bins, and every interaction partner is assigned to a bin. The score for any interaction is based on the likelihood of observing a particular residue at those locallydefined coordinates, given the type of the residue for which the coordinate system is constructed and the type of the interaction partner observed. This scoring function we have named LoCo (for Local Coordinates). It yields state of the art performance with a speed and simplicity that is unmatched by any other function at its level.
Methods
Overview
The fundamental idea behind LoCo scoring is that characteristic shapes of amino acids lead to characteristic geometric relationships between interacting residues in a native structure. The interior of a properly folded protein is tightly packed. Main chain atoms typically form a rigid planar structure between C_{α }atoms, and steric considerations generally confine the side chain atom positions into one of a number of rotamers. These restrictions on the overall shapes that amino acids generally indicate that there are a limited number of ways they will typically fit well together, both spatially and energetically.
The relationships we exploit are relative positions in threedimensional space. Most coarsegrained potentials have relied simply on distances between C_{α }atoms, C_{β }atoms or centers of mass [35]. By using additional dimensions to characterize residueresidue interactions, our method is more specific about which interactions are favorable and which are not. Since it has more dimensions, it requires a considerably larger and more detailed set of parameter tables than have generally been used, which is not a limitation it once was due to everincreasing storage and memory.
LoCo Methodology
LoCo scoring takes place within a local coordinate system defined by the main chain N, C_{α }and C atoms of the residue being scored (Figure 1) for any given amino acid. The C_{α }is at the origin of this coordinate system. The N coordinates define the yaxis, and position of the main chain C atom defines the x and zaxes. A different coordinate system is generated for each residue. We refer to the amino acid at the origin as the "observing" residue and all nearby residues eligible to interact with the observing residue as "partner" residues.
Figure 1. The "LoCo" local coordinate system. Shown is a single valine residue and its local coordinate system, seen looking down the positive xaxis. The center of the C_{α }atom is defined to be the origin. The positive yaxis passes through the main chain N coordinates. The positive X axis is placed such that the main chain C coordinates fall within the xy plane in the positive X direction. The coordinate system is righthanded. Both stick and sphere representations are presented for clarity.
To score an interaction using LoCo, the C_{α }atom of each partner residue is located within a particular 1Å cubic bin of the coordinate system of the observing residue (Figure 2). The partner residue is then assigned a score based on the likelihood of its being observed in that bin, given the types of both residues. The total score for any given observing residue is the sum of all the scores for its partners, and the score for the protein is the sum of all residue scores when every residue has been treated as an observing residue.
Figure 2. A single LoCo interaction. An interaction within a LoCo coordinate system is illustrated. Bins shown are approximately 3× actual size for clarity. Bin boundaries are counted from the origin. A single valine is placed as shown in Figure 1. The C_{α }atom of an interacting residue is displayed separately in green within bin +3, +5, +4.
The individual interaction scores are derived from statistics that have been obtained from a large set of nonhomologous protein domains. Here is the formula:
where S is the total score for all pairwise residueresidue interactions, and i and j are the observing and interacting residues, respectively. This is equivalent to the Boltzmann equation [46], where the quantity j_{obs}(xyz) is the number of times a residue of type i has observed a residue of type j in the training set at bin coordinates x, y and z in its local coordinate system. The reference state, j_{exp}(xyz) is the number of times a residue of type j would be expected to be observed at those coordinates. N_{1 }represents all amino acids in the polypeptide chain; N_{2 }represents only those residues that are eligible to be interaction partners for i.
We define the reference state j_{exp}(xyz) to be the mean number of observations of all residue types at bin coordinates xyz, which is the total number of observations at those coordinates for any residue types i and j divided by the total number of possible ij combinations, totaling 400 (since there are 20 amino acid types). This mean includes zerocount cases. Since we cannot take the logarithm of zero, bins with no observed counts are assigned a penalty equal to a some multiple of the worstscoring bin for any observed ij interaction.
The value of this zerocount penalty affects the accuracy of scoring, as do eligibility criteria for partner residues. Varying the cutoff distance for eligible partners affects performance. Since the C_{α} C_{α }separations across peptide bonds are effectively fixed and the number of wellpopulated Φ and Ψ angles fairly restricted, we did not score immediate sequence neighbors of the observing residue.
The values of these three parametersthe interaction cutoff distance, the number of neighboring residues to exclude from scoring and the size of the zerocount penaltywere chosen using a training group of decoy structures before the final version was evaluated on an independent group of decoy sets.
Function Training
Training of the LoCo function took place in two separate stages: generation of the scoring database and optimization of its parameters. Each database was generated by assigning a probabilitybased score to every possible state of the system using a large set of known protein structures that are held to be representative of correct structures. We presume that this set, although not complete, captures enough information about residueresidue interactions to be of predictive value. Parameter optimization involved finding the version of the function with the bestperforming set of values (from among those tested) for the three interaction parameters described above.
All observed counts for the generation of all versions of the LoCo scoring databases were taken from the ASTRAL 1.73 set [47] of 9527 nonhomologous protein domains. As noted above, we also used a "training group" of 154 decoy sets to find an optimal set of function parameters. All structures in both the training group and the testing group that were part of the ASTRAL set were removed before the potentials used to score each group were generated.
When optimizing our function parameters, we followed a process of tenfold crossvalidation to ensure that even within the training group no function was evaluated on a group of decoy sets that had been used to select it. Interatomic cutoff distances of 8Å 20Å were tested in 2Å increments. From 1 to 4 chain neighbors on both the N and Cterminal sides of the observing residue were not scored. We established a baseline zerocount penalty equal to the worst score calculated for each pair of residue types, then tested penalties equal to 1, 2 or 3 times the baseline. This gave us a total of 84 different versions of the LoCo scoring function.
The training group was divided into ten randomly selected subsetssix containing 15 decoy sets and four containing 16. Ten different groupings of nine of these subsets were scored using all 84 versions of the LoCo function, and the average C_{α }RMSD between the native and the bestscoring nonnative structure was calculated for each version. The version of LoCo with the lowest average C_{α }RMSD across all nine subsets was used to score the remaining subset. The LoCo version selected was the one with the lowest overall average C_{α }RMSD among all ten remaining subsets.
This tenfold crossvalidation procedure was carried out ten separate times to ensure that the outcome was not dependent on a particular random selection of the subsets. In every case the best performance was achieved with a cutoff distance of 14Å, with only a single residue on either side of each residue excluded from scoring and with a zerocount penalty equal to 3 times the worst observed score for each particular combination of amino acid types. This version of the LoCo scoring database was used for our final performance testing.
Decoy sets for evaluation of scoring functions
The purpose of protein main chain scoring functions is to discriminate nearnative from nonnative conformations. "Decoy" structures representing a mix of near and nonnative conformations for a particular amino acid sequence, commonly generated in the course of protein structure prediction, are often used to evaluate them. Such sets typically include the native structure.
We decided to follow the model of Makino and Itoh [44], to optimize parameters before we could test the scoring performance of LoCo. We used the same 231 decoy sets from the "Decoys R Us" database http://dd.compbio.washington.edu/ webcite[48], the 62protein "Rosetta" set from David Baker's group http://depts.washington.edu/bakerpg/decoys/rosetta_decoys_62proteins.tgz webcite, and the "moulder" set ftp://salilab.org/decoys/ webcite[49,50] from Andrej Sali's group. These are among the most widely used decoy sets in the field. We divided these 231 decoy sets into the same two groups, a 154 set group for function optimization (the "training group") and a 77 set group for performance evaluation (the "testing group").
Since we are pursuing main chain structure discrimination only, all side chain atoms except C_{β}s were removed from the decoys. Although C_{β }atoms are not part of the main chain, their positions do not change (at least ideally) as side chain conformations do, so they can be included in an initial search for main chain conformations. C_{β }atoms are not used in LoCo scoring, but are used by some of the other functions in our comparisons.
Function comparisons
Performance of the LoCo potential was tested against a total of 29 other published functions for main chain evaluation. Twentysix of these functions are from the Jernigan Lab's Knowledgebased Potential Server: http://gor.bb.iastate.edu/potential/ webcite, representing some of the widely used contact potentials of the last 30 years. Also among the 26 are 3 more recentlydeveloped functions from the Jernigan Groupthe Fourbody and Generalfourbody [38], and the Shortrange [27]that are not simple contact potentials.
The remaining 23 contact potentials are identified here with the same codes used on the Jernigan server: Qa, Qm, Qp [37], HLPL, MJPL [25], SKOa, SKOb, SJKG [29,34], MJ1, MJ2h, MJ3, MJ3h [20,24,32], TS [18], BT [31], BFKV [36], TD [26], TEl, TEs [35], RO [19], MS [23], GKS [22], VD [30], BL [21], and MSBM [28,33].
Three more modern potentials are considered as well. The program ProSa 2003 is from the group of Manfred Sippl [46,51,52] and is available from the Center of Applied Molecular Engineering: http://www.came.sbg.ac.at/ webcite. Two recently developed functions that explicitly take the relative orientations of interacting residues into account are DFMAC, by Makino and Itoh [44], and RF_CB_SRS_OD, by Rykunov and Fiser [45]. Executables of both are available from their authors.
The functions from the Jernigan Group server encompass a wide variety of approaches: the oldest (TS) was published in 1976, and the newest (the Fourbody and Generalfourbody) in 2007. Some are simple contact potentials that assign a score to all pairs or residues found within a given cutoff distance of one another. Other functions in the set assign distancedependent scores to pairs within the cutoff distance. Not all functions are purely knowledgebased: several use techniques such as quasichemical approximation or attempt to calculate hydrophobic energies. Some of the publications represented note the insufficiency of contact potentials alone for protein structure prediction.
ProSa 2003 generates three scores for every residue: a pair score, a surface score and a combined score. Scores used for comparison are the sum of all individual residue combined scores, which outperformed both the individual pair and surface terms. The potentials used were the "prosa2003.paircb" and "prosa2003.surfcb" included with the distribution.
The DFMAC function is a linear combination of six separate weighted pseudoenergy potentials involving pairwise C_{α }separations, relative orientations of pseudo C_{α}→ C_{β }vectors, main chaintomain chain pseudohydrogen bonding, Φ/Ψ angle pairings between residues, individual residue ω angles, and the number of other C_{α }atoms surrounding each C_{α }atom. These six potentials have sixteen independent parameters that were "tuned" on the same group of 154 decoy sets that we used for our parameter training. Once the most favorable set of those sixteen parameters was selected for that training group, the weights of all six components of the function were similarly optimized before the function was tested on the same 77 decoy set testing group we have used here.
The RF_CB_SRS_OD function groups residueresidue interactions into three categories: residues facing in the same direction, residue facing toward each other and residues facing away from each other. "Facing" in this context refers to the direction of each amino acid's C_{α}→ C_{β }vector. A "shuffled" reference state is created by randomizing the sequence position of all residues in the protein.
Performance Measures
We use five performance measures for native structure recognition: Rank_{nat}, RMSD_{best}, Z_{nat}, CC_{nat }and FE_{nat}. Eight measuresR_{B1}, R_{B10}, RMSD_{decoy}, Z_{decoy}, CC_{decoy}, FE_{decoy}, log(P_{B1}) and log(P_{B10})are used for decoy discrimination. Rank_{nat }is the score rank of the native structure among all decoys. RMSD_{best }is the C_{α }RMSD of the bestscoring structure, including the native. Z_{nat }is the Zscore of the score of the native structure relative to all other scores (native included) in that decoy set. CC_{nat }is the Pearson's correlation coefficient between score and C_{α }RMSD for all structures in the set, including the native. FE_{nat }is the fraction enrichment among all decoys (native included) after scoring. The fraction enrichment is defined as the fraction of the top 10% of our structures by C_{α }RMSD that are found among the top 10% by score. We express the fraction enrichment as a percentage for clarity.
R_{B1 }is the C_{α }RMSD ranking among decoys only (native excluded) of our bestscoring structure. R_{B10 }is the lowest C_{α }RMSD rank among the 10 bestscoring structures from the decoy set (not including the native). RMSD_{decoy }is the Cα RMSD of the bestscoring structure, excluding the native. Z_{decoy }is the Zscore of the score of the lowestRMSD decoy relative to all other scores (not including the native) in that decoy set. CC_{decoy }is the correlation coefficient between score and C_{α }RMSD for all structures in the set, excluding the native. FE_{decoy }is the fraction enrichment among all decoys (native excluded) after scoring. The measures log(P_{B1}) and log(P_{B10}) are the common logarithms of the probabilities of selecting the R_{B1 }and R_{B10 }structures. These probabilities are simply the values of R_{B1 }and R_{B10 }divided by the total number of decoy structures in the set (excluding the native).
Results
Native recognition vs. decoy discrimination
The performance measures we use fall into two categories: native recognition and decoy discrimination. Native recognition is the ability to recognize the native structure from among all decoys in the set. Decoy discrimination is the ability to pick out one or more nearnative structures within the set. A good scoring function should be able to pick out the native, at a minimum. However, the likelihood of reproducing a completely correct structure in the course of sampling different conformations is quite low. For practical use, a good scoring function must be able to distinguish nearnatives from nonnative structures.
Training and testing group comparison
We used separate groups of decoy sets for optimizing the variable parameters of LoCo and for testing its performance against other functions. A comparison of LoCo scores achieved with training and testing groups is in Tables 1 and 2. Table 1 shows the differences between these groups in native structure recognition. Table 2 shows these differences for decoy discrimination. Roughly comparable results were achieved with both groups, though the test group did yield somewhat better results across the board.
Table 1. Training vs. testing groups: native recognition test
Table 2. Training vs. testing groups: decoy discrimination test
We consider decoys that are less than 5Å C_{α }RMSD from the native to be "near native" structures and decoys that are less than 2Å to be "very near native." We include the numbers of near native and very near native structures found with our "native recognition" measure in Table 1. For the training group, the bestscoring structure in each set was very near native for 112 of 154 decoy sets (72.7% of the time) and near native for 136 sets (88.3% of the time). For the test group, the bestscoring structures were very near native in 60 of 77 cases (77.9%) and near native for 70 of 77 sets (90.9% of the time). The average C_{α }RMSD (from the native) of the bestscoring structures from all of the training group was 2.10Å. For the test group it was 1.62Å, a difference of less than 0.5Å. All performance measures, with the exception of numbers of near native and very near native structures, are explained in Performance measures at the end of Methods.
Differences between training and testing groups were smaller for decoy discrimination. The difference between the average C_{α }RMSD (from the native) for the bestscoring nonnative structure was less than 0.25Å between the groups. It is perhaps not surprising that these measures were so close, since that was the metric for which the training group was optimized. Again, test group measures were somewhat better but not largely so, with the exception of R_{B10}, indicating that LoCo was significantly more able to place one of the ten nearestnative structures among its ten topscoring decoys.
Main chain function performance
Native recognition performance is demonstrated in Table 3. The performance of the top four functions, LoCo, DFMAC, RF_CB_SRS_OD and ProSa 2003, is superior to that of the remaining potentials. LoCo outperforms every function except DFMAC. However, the relatively larger differences between LoCo and DFMAC in Rank_{nat }and Z_{nat }may partly be due to the inclusion of an ωangle component in DFMAC, which is of limited practical utility (see Omega angles, in Discussion).
Table 3. Function comparison: native recognition
Decoy discrimination is shown in Table 4. Again, LoCo and DFMAC were the top two functions in most measures. LoCo had the best R_{B10}, RMSD_{decoy }and log(P_{B1}). It was slightly lower than DFMAC in Z_{decoy}, CC_{decoy }and FE_{decoy}, and it was slightly higher than ProSa 2003 in log(P_{B10}).
Table 4. Function comparison: decoy discrimination
Allatom function comparison
To get a sense of how our main chainonly function compares to available allatom functions, we tested four widelyused potentials that work with all heavy atom coordinates on the same final testing group of decoys we have used throughout. The potentials chosen were RAPDF [53], dDFIRE [54,55], DOPE [50] and RF_HA_SRS [45]. These functions require that all side chain atoms be included and their positions determined in every structure to be scored.
LoCo performance compared to these four potentials for native structure recognition is shown in Table 5, while performance for decoy discrimination is shown in Table 6. The performance of LoCo was quite comparable to these higherresolution functions. LoCo outperformed all four in Rank_{nat}, R_{B10 }and log(P_{B10}). It placed no worse than third (of five) in every performance metric except R_{B1}.
Table 5. LoCo vs. allatom potentials: native recognition
Table 6. LoCo vs. allatom potentials: decoy discrimination
Speed
LoCo is extremely fast, particularly compared to other functions that are based on explicit distance calculations and table lookups. Scoring for LoCo was carried out on an Apple iMac with a 2.4 GHz Intel Core 2 Duo processor with 4 GB of memory. The function was written in C++ and compiled using GNU g++ 4.2.
The average total processing time for a single structure in the final testing group was 2.6 milliseconds. This time includes reading the structure from the hard disk drive, loading it into the program, determining all relevant interactions, scoring the structure and clearing it from memory. The average time for interaction determination and scoring only was 0.47 ms. The numbers of residues per structure in the final testing group varied from 31 to 274, so the standard deviations for total processing time and interaction determination and scoring time were relatively large: 1.4 ms (54%) and 0.32 ms (68%), respectively.
On a "per residue" basis the times were more consistent. The average total processing time per residue was 0.032 ms with a standard deviation of 0.0037 ms (12%). The average interaction determination and scoring time per residue was 0.0054 ms with a standard deviation of 0.0011 ms (20%).
The time taken by LoCo to score the entire final testing set of 39,611 structures, including reading the scoring databases, input structures, and writing the output score files is ~4 minutes. We were unable to determine the amount of time needed by DFMAC or any of the server functions to score the entire final testing group, but ProSa 2003 takes ~121 minutes and RF_CB_SRS_OD takes 11 minutes.
In contrast an all atom scoring function, RAPDF [53], pioneered by our group takes several seconds on average to score a structure from scratch as described above, and about one second for interaction determination and scoring only. The backbone only version of this function is about ten fold faster but still takes about 100 ms per structure. Thus a very rough comparison indicates that LoCo is approximately two orders of magnitude faster compared to traditional distance bin based potentials of mean force.
Statistical significance
To assess the statistical significance of differences between potentials in the distribution of ranks, we performed pairwise onetailed Wilcoxon tests on all tested functions. We used R_{B1}, the C_{α }RMSD rank (among decoys only) of the bestscoring decoy structure as our tested distribution. We felt that this was the closest suitable metric to RMSD_{decoy}, the one on which the LoCo potential was parameterized. We also believe that it best encompasses our primary goal of picking out the nearest native decoy structures. Results of this test are in Figure 3.
Figure 3. Statistical significance of differences in rank distributions. C_{α }RMSD rank distributions for the bestscoring nonnative structures for all functions are compared. Pvalues show the likelihood that better rank distributions for the function on the left are the result of chance. Pvalues less than 0.05 have been colored in red, showing statistically significant differences in these distributions. These ranks are among decoy structures only. The null hypothesis of this onetailed Wilcoxon test is that neither distribution is lower than the other. The alternative hypothesis is that functions on the left achieved lower ranks for their bestscoring decoys than functions along the top.
Our null hypothesis was that neither function performed better in the the distributions of these ranks, and our alternative hypothesis was that the function in the leftmost column of Figure 3 had a distribution of ranks that were lower than that of the function in the column across the top, showing that the functions in the left column performed better. The Wilcoxon test was used because the rank distributions being compared are far from normal.
The large number of red values in the top four rows (pvalue < 0.05) show that LoCo, DFMAC, RF_CB_SRS_OD and ProSa 2003 have statistically significant differences in rank distribution from most of the other 26 functions, based on the hypothesis that their distributions are lower. These pvalues represent the likelihood that the better rankings for the functions on the left could have come about by chance. The ranks for LoCo were better than all other functions, since all pvalues were < 0.5. However, these rank distributions vs. DFMAC, RF_CB_SRS_OD and ProSa 2003 were not below the statistical significance threshold of 0.05.
Discussion
Relative importance of performance measures
The primary goal of a main chainonly scoring function is to identify proposed main chain conformations that are reasonably likely to be close enough to the native structure to be kept for more detailed evaluation. A large number of possible main chains are typically tried, and the likelihood that any of them will be exactly the same as the native is very small. For this reason, we believe that good performance in decoy discrimination is more important than good performance in native structure recognition.
We also consider R_{B1}, R_{B10}, RMSD_{decoy}, log(P_{B1}) and log(P_{B10}) to be more important to the goal of selecting a relatively small number of near native decoys than Z_{decoy}, CC_{decoy }and FE_{decoy}. R_{B1 }and R_{B10 }inform whether or not the very bestscoring decoys are among the very closest to the native. RMSD_{decoy }tells how close to correct the bestscoring decoy is. Log(P_{B1}) and log(P_{B10}) gives us measures of how meaningful the R_{B1 }and R_{B10 }values are.
Other metrics, while still valuable, are less directly related to the goal of finding near native structures. Z_{decoy }measures how far from the mean score our best decoy is, but what matters most is whether we can identify it. CC_{decoy }reveals the correspondence between score and RMSD across the entire set, but this correspondence is of little importance for poor decoys that will be rejected. FE_{decoy }assesses performance with the top 10% of decoys, but at the initial main chain evaluation stage, we are likely to be keeping far fewer than 10% of the main chain conformations we examine.
LoCo vs. DFMAC
LoCo outperformed all other functions in R_{B10}, RMSD_{decoy}, and log(P_{B1}), three of the five measures most important for finding near native decoys. It was only slightly higher than ProSa 2003 at log(P_{B10}). Its R_{B1 }was higher than many of the other functions, but since any initial main chain search will keep more than one structure for further evaluation, LoCo's lowest R_{B10 }should be considered more relevant. At native structure recognition, LoCo's performance was just behind that of DFMAC in all categories, although it was still substantially better than the remaining 28 functions.
While we consider the performance of LoCo and DFMAC to be roughly comparable, we believe that LoCo has clear practical advantages over DFMAC. DFMAC is a weighted composite of six separate functions that require the creation of pseudoN, O, H and  C_{β }atoms for every residue as well as the calculation of at least five angles between vectors for every residueresidue interaction and three dihedral angles for every residue. These angle calculations are computationally expensive and must be repeated for every new main chain conformation.
LoCo, on the other hand, was designed to be extremely fast. Every residueresidue interaction requires only a single lookup from the potential database. The initial C_{α}→ C_{α }vector between any two residues being scored undergoes a single matrix rotation into the local coordinate system of the observing residue, where it is then binned and the score for the interaction is looked up. The initial generation of the rotation matrix that defines the local coordinate system does require several computationally expensive square root and trigonometric operations per residue, but all translations and rotations of the main chain after that require only simple arithmetic floatingpoint operations, including rotating the coordinate system.
DFMAC was also finely tuned to its training set, with sixteen independent parameters and five weights optimized to give the best possible performance. These training procedures were carried out with rigor to ensure that no structure was scored using parameters that had been trained on it, but all decoy sets used for training had been generated using the same methods employed to create the decoy sets in the testing group. 15 of the 77 decoy sets in the final testing group had as their native structures proteins that appeared in the training group as part of decoy sets generated by alternate methods. In 12 of those 15 sets the native was correctly identified by DFMAC. It is unclear to us that the values of those parameters and weights used by DFMAC will be optimal for the prediction of protein structures more generally.
LoCo, on the other hand, is largely insensitive to changes in its parameters. We compared the best, worst and average values for each individual performance measure across all 84 LoCo parameter sets with the performance of DFMAC, ProSa 2003, and RF_CB_SRS_OD. We also compared them with the best, worst and average values for all 26 functions from the Jernigan Lab server.
Tables 7 and 8 indicate that the differences in performance between LoCo parameter sets were not large. For native recognition (Table 7), the average value for LoCo across all 84 parameter sets in any of the five performance measures were still better than for any potential other than DFMAC. The worst LoCo value was better than the best value for any of the Jernigan server potentials in 4 out of 5 cases, and the worst LoCo CC_{nat }of 0.403 was only 0.007 lower than the best Jernigan server CC_{nat }of 0.410.
For decoy discrimination (Table 8), the best value for LoCo across all parameter sets was better than any other function for all performance measures, with the exception of Z_{decoy }and CC_{decoy }for DFMAC. The average value for LoCo across all sets was better than the best values from the Jernigan server potentials for 6 out of 8 measures. It was also better than RF_CB_SRS_OD for 7 of 8 measures, with a slightly worse Z_{decoy}.
Omega angles
The DFMAC function includes an ω angle term. The ω is the main chain dihedral angle between the C_{α}→C vector of one residue and the C_{α}→N vector of the following residue. In an experimentally determined structure these angles are typically clustered around 180° because of the partially doublebonded character of most C_{α}→C→N→ C_{α }groups. There are usually a few places within any main chain where the planarity of this system is broken to make energetically favorable interactions elsewhere, but the great majority of native ω angles are within 15° to either side of a planar 180° separation.
It is unlikely that any initial main chain conformational search would include variations of the ω angle, since that would introduce unnecessary degrees of freedom to achieve only slight differences in the overall structure. An ω angle function can, however, be quite effective at distinguishing native main chain geometry from that of computergenerated decoys. This is because these variations are often more characteristic of the method used to generate the decoys than of structural correctness.
To demonstrate this point, we created a very simple ω angle discrimination function. It calculates the standard deviation of all individual ω angles for any main chain that are within 15° of 180° apart. The score for each main chain is the magnitude of the difference (in degrees) between its own standard deviation and the mean of all the standard deviations in the decoy set.
For purposes of illustration only, we have included this function in Tables 9 and 10 and have compared it to the performance of LoCo and of DFMAC both with and without the ω angle score component. For native recognition (Table 9), our ωonly function is able to recognize native structures (Rank_{nat}) very nearly as well as DFMAC without an ω angle component. The Z_{nat }of the ωonly function is more than twice as great as either version of DFMAC. Its RMSD_{best }is better than every function tested except LoCo, DFMAC and ProSa 2003, and it is within 0.01Å of ProSa 2003.
For DFMAC, Z_{nat }improves noticeably and Rank_{nat }improves significantly with the inclusion of the ω angle component while RMSD_{best }and FE_{nat }decline slightly. This mirrors the very good scores of the ωonly function for Rank_{nat }and Z_{nat }and its relatively poor performance at FE_{nat}. The slight decline in RMSD_{best }for DFMAC when the ω angle component is included must be considered an artifact of the tenfold crossvalidation used when weighting the various DFMAC components. This is because that performance measure was the one being optimized and because the ω angle component was assigned a positive weight.
With native structures removed (Table 10), the decoys selected by our ωonly function are effectively random. DFMAC performance improves slightly across the board without the ω component. This suggests that using ω angles improves some performance measures of native structure recognition but degrades decoy discrimination.
LoCo Applications
LoCo potentials combine speed, accuracy and ease of implementation. They should be of use in a variety of structure prediction tasks, including both template based (homology) and template free (ab initio) modeling. We anticipate that they will be accurate enough to allow for improved main chainonly refinement of template based models before they are treated at the allatom level.
We expect that our potentials will be useful for protein design applications as well. Currently successful sequence search algorithms must evaluate structures at an allatom level [56,57]. This means that they cannot fully sample the sequence space but must rely on more restricted search techniques, such as a Monte Carlo method [58]. A sufficiently accurate main chainonly potential function should allow the entire sequence space to be searched, treating design as a combinatorial optimization problem, much like choosing side chain conformations.
With its speed and accuracy, LoCo is a good candidate for such an application. The stablest possible sequence for a given main chain is the global minimum energy conformation (GMEC). A lowresolution function like LoCo would be unlikely to arrive at the GMEC, but it would not need to. The LoCodesigned sequence would only need to be stable enough for the desired application. Even if the LoCodesigned sequence was not stable enough to be used, it should provide a good starting point for further refinement using allatom methods.
Future directions
While these potentials have been developed for and with complete polypeptide chains, there may well be value in developing individual potentials for secondary structure elements and loops. Such potentials may be able to aid in the recognition of helices and sheets within sequences for which no homolog is known, and loopspecific functions may aid in faster and more accurate modeling of the most challenging aspect of protein structure prediction. As noted above, we hope that LoCo will allow for a broader search of protein sequence space in design applications.
The idea behind LoCo scoring should also work for lowresolution screening of docked proteinprotein complexes. Currently, initialstage docking programs are dominated by gridbased algorithms [59] that rely on fastfouriertransforms (FFTs) to provide the speed necessary to sample all possible docked conformations in a reasonable amount of time, which may be improved by a LoCo type potential for docking.
Conclusions
We present a novel scoring function, "LoCo," for evaluating protein main chain conformations. Our method considers relative positioning in all three dimensions and examines every interaction from the perspective of both partners, in contrast with every other function it was tested against. A number of recentlydeveloped potentials have achieved improved performance over more traditional contact potentials by considering the relative orientations of two interacting residues.
LoCo provides an unprecedented combination of speed and accuracy. Once an interaction has been characterized by the identities of the participating residues and their relative positions, a single lookup gives the score for that particular interaction. This function has many potential uses in the field of protein structure prediction, and since a local coordinate system can be generated for any chiral group of atoms, there are many possible ways the fundamental concept could be applied.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SM conceived the function, wrote the software, carried out all function training and decoy testing and drafted the manuscript. RS supervised the research, edited the manuscript, and provided intellectual mentorship. All authors read and approved the final manuscript.
Acknowledgements
The authors wish to thank the School of Dentistry and the Department of Oral Biology at the University of Washington for their support. SM was supported by and this work was carried out under NIH grant T32DE07132. RS was supported by NIH grant 5DP1OD6779 and CAREER grant IIS0448502 (20052010).
References

Dill KA: Dominant forces in protein folding.
Biochemistry 1990, 29(31):71337155. PubMed Abstract  Publisher Full Text

Pauling L, Corey RB, Branson HR: The structure of proteins; two hydrogenbonded helical configurations of the polypeptide chain.
Proc Natl Acad Sci USA 1951, 37(4):205211. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: Atomic coordinates and structure factors for two helical configurations of polypeptide chains.
Proc Natl Acad Sci USA 1951, 37(5):235240. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: The structure of synthetic polypeptides.
Proc Natl Acad Sci USA 1951, 37(5):241250. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: The pleated sheet, a new layer configuration of polypeptide chains.
Proc Natl Acad Sci USA 1951, 37(5):251256. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: The structure of feather rachis keratin.
Proc Natl Acad Sci USA 1951, 37(5):256261. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: The structure of hair, muscle, and related proteins.
Proc Natl Acad Sci USA 1951, 37(5):261271. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: The structure of fibrous proteins of the collagengelatin group.
Proc Natl Acad Sci USA 1951, 37(5):272281. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: The polypeptidechain configuration in hemoglobin and other globular proteins.
Proc Natl Acad Sci USA 1951, 37(5):282285. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pauling L, Corey RB: Configurations of Polypeptide Chains With Favored Orientations Around Single Bonds: Two New Pleated Sheets.
Proc Natl Acad Sci USA 1951, 37(11):729740. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kendrew JC, Bodo G, Dintzis HM, Parrish RG, Wyckoff H, Phillips DC: A threedimensional model of the myoglobin molecule obtained by xray analysis.
Nature 1958, 181(4610):662666. PubMed Abstract  Publisher Full Text

Levinthal C: Are there pathways for protein folding?
Journal de Chimie Physique et de PhysicoChimie Biologique 1968, 65(1):4445.

Levinthal C: How to Fold Graciously. In Mossbauer Spectroscopy in Biological Systems: 1969; Allerton House, Monticello, IL. University of Illinois Press, Urbana, IL; 2224.

HeadGordon T, Brown S: Minimalist models for protein folding and design.
Curr Opin Struct Biol 2003, 13(2):160167. PubMed Abstract  Publisher Full Text

Tozzini V: Coarsegrained models for proteins.
Curr Opin Struct Biol 2005, 15(2):144150. PubMed Abstract  Publisher Full Text

Boas FE, Harbury PB: Potential energy functions for protein design.
Curr Opin Struct Biol 2007, 17(2):199204. PubMed Abstract  Publisher Full Text

Poole AM, Ranganathan R: Knowledgebased potentials in protein design.
Curr Opin Struct Biol 2006, 16(4):508513. PubMed Abstract  Publisher Full Text

Tanaka S, Scheraga HA: Medium and LongRange Interaction Parameters between Amino Acids for Predicting ThreeDimensional Structures of Proteins.
Macromolecules 1976, 9(6):945950. PubMed Abstract  Publisher Full Text

Robson B, Osguthorpe DJ: Refined models for computer simulation of protein folding. Applications to the study of conserved secondary structure and flexible hinge points during the folding of pancreatic trypsin inhibitor.
J Mol Biol 1979, 132(1):1951. PubMed Abstract  Publisher Full Text

Miyazawa S, Jernigan RL: Estimation of effective interresidue contact energies from protein crystal structures: quasichemical approximation.
Macromolecules 1985, 18(3):534552. Publisher Full Text

Bryant SH, Lawrence CE: An empirical energy function for threading protein sequence through the folding motif.
Proteins 1993, 16(1):92112. PubMed Abstract  Publisher Full Text

Godzik A, Kolinski A, Skolnick J: Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets.
Protein Sci 1995, 4(10):21072117. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mirny LA, Shakhnovich EI: How to derive a protein folding potential? A new approach to an old problem.
J Mol Biol 1996, 264(5):11641179. PubMed Abstract  Publisher Full Text

Miyazawa S, Jernigan RL: Residueresidue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading.
J Mol Biol 1996, 256(3):623644. PubMed Abstract  Publisher Full Text

Park B, Levitt M: Energy functions that discriminate Xray and near native folds from wellconstructed decoys.
J Mol Biol 1996, 258(2):367392. PubMed Abstract  Publisher Full Text

Thomas PD, Dill KA: An iterative method for extracting energylike quantities from protein structures.
Proc Natl Acad Sci USA 1996, 93(21):1162811633. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bahar I, Kaplan M, Jernigan RL: Shortrange conformational energies, secondary structure propensities, and recognition of correct sequencestructure matches.
Proteins 1997, 29(3):292308. PubMed Abstract  Publisher Full Text

Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions.
J Mol Biol 1997, 268(1):209225. PubMed Abstract  Publisher Full Text

Skolnick J, Jaroszewski L, Kolinski A, Godzik A: Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct?
Protein Sci 1997, 6(3):676688. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Vendruscolo M, Domany E: Pairwise contact potentials are unsuitable for protein folding.
The Journal of Chemical Physics 1998, 109(24):1110111108. Publisher Full Text

Betancourt MR, Thirumalai D: Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes.
Protein Sci 1999, 8(2):361369. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Miyazawa S, Jernigan RL: Selfconsistent estimation of interresidue protein contact energies based on an equilibrium mixture approximation of residues.
Proteins 1999, 34(1):4968. PubMed Abstract  Publisher Full Text

Simons KT, Ruczinski I, Kooperberg C, Fox BA, Bystroff C, Baker D: Improved recognition of nativelike protein structures using a combination of sequencedependent and sequenceindependent features of proteins.
Proteins 1999, 34(1):8295. PubMed Abstract  Publisher Full Text

Skolnick J, Kolinski A, Ortiz A: Derivation of proteinspecific pair potentials based on weak sequence fragment similarity.
Proteins 2000, 38(1):316. PubMed Abstract  Publisher Full Text

Tobi D, Shafran G, Linial N, Elber R: On the design and analysis of protein folding potentials.
Proteins 2000, 40(1):7185. PubMed Abstract  Publisher Full Text

Bastolla U, Farwer J, Knapp EW, Vendruscolo M: How to guarantee optimal stability for most representative structures in the Protein Data Bank.
Proteins 2001, 44(2):7996. PubMed Abstract  Publisher Full Text

Boniecki M, Rotkiewicz P, Skolnick J, Kolinski A: Protein fragment reconstruction using various modeling techniques.
J Comput Aided Mol Des 2003, 17(11):725738. PubMed Abstract  Publisher Full Text

Feng Y, Kloczkowski A, Jernigan RL: Fourbody contact potentials derived from two protein datasets to discriminate native structures from decoys.
Proteins 2007, 68(1):5766. PubMed Abstract  Publisher Full Text

Casari G, Sippl MJ: Structurederived hydrophobic potential. Hydrophobic potential derived from Xray structures of globular proteins is able to identify native folds.
J Mol Biol 1992, 224(3):725732. PubMed Abstract  Publisher Full Text

Mittal A, Jayaram B: Backbones of folded proteins reveal novel invariant amino acid neighborhoods.
J Biomol Struct Dyn 2011, 28(4):443454. PubMed Abstract  Publisher Full Text

Mittal A, Jayaram B, Shenoy S, Bawa TS: A stoichiometry driven universal spatial organization of backbones of folded proteins: are there Chargaff's rules for protein folding?
J Biomol Struct Dyn 2010, 28(2):133142. PubMed Abstract  Publisher Full Text

Buchete NV, Straub JE, Thirumalai D: Anisotropic coarsegrained statistical potentials improve the ability to identify nativelike protein structures.
The Journal of Chemical Physics 2003, 118(16):76587671. Publisher Full Text

Mukherjee A, Bhimalapuram P, Bagchi B: Orientationdependent potential of mean force for protein folding.
J Chem Phys 2005, 123(1):014901. PubMed Abstract  Publisher Full Text

Makino Y, Itoh N: A knowledgebased structurediscriminating function that requires only mainchain atom coordinates.
BMC Struct Biol 2008, 8:46. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Rykunov D, Fiser A: New statistical potential for quality assessment of protein models and a survey of energy functions.
BMC Bioinformatics 2010, 11:128. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sippl MJ: Calculation of conformational ensembles from potentials of mean force. An approach to the knowledgebased prediction of local structures in globular proteins.
J Mol Biol 1990, 213(4):859883. PubMed Abstract  Publisher Full Text

Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004.

Samudrala R, Levitt M: Decoys 'R' Us: a database of incorrect conformations to improve protein structure prediction.
Protein Sci 2000, 9(7):13991401. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

John B, Sali A: Comparative protein structure modeling by iterative alignment, model building and model assessment.
Nucleic Acids Res 2003, 31(14):39823992. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shen MY, Sali A: Statistical potential for assessment and prediction of protein structures.
Protein Sci 2006, 15(11):25072524. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sippl MJ: Boltzmann's principle, knowledgebased mean fields and protein folding. An approach to the computational determination of protein structures.
J Comput Aided Mol Des 1993, 7(4):473501. PubMed Abstract  Publisher Full Text

Samudrala R, Moult J: An allatom distancedependent conditional probability discriminatory function for protein structure prediction.
J Mol Biol 1998, 275(5):895916. PubMed Abstract  Publisher Full Text

Yang Y, Zhou Y: Specific interactions for ab initio folding of protein terminal regions with secondary structures.
Proteins 2008, 72(2):793803. PubMed Abstract  Publisher Full Text

Zhou H, Zhou Y: Distancescaled, finite idealgas reference state improves structurederived potentials of mean force for structure selection and stability prediction.
Protein Sci 2002, 11(11):27142726. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Safi M, Lilien RH: Restricted deadend elimination: protein redesign with a bounded number of residue mutations.
J Comput Chem 2010, 31(6):12071215. PubMed Abstract  Publisher Full Text

Georgiev I, Donald BR: Deadend elimination with backbone flexibility.
Bioinformatics 2007, 23(13):i185194. PubMed Abstract  Publisher Full Text

Das R, Baker D: Macromolecular modeling with rosetta.
Annu Rev Biochem 2008, 77:363382. PubMed Abstract  Publisher Full Text

Vajda S, Kozakov D: Convergence and combination of methods in proteinprotein docking.
Curr Opin Struct Biol 2009, 19(2):164170. PubMed Abstract  Publisher Full Text  PubMed Central Full Text