Abstract
Background
Empirical scoring functions have proven useful in protein structure modeling. Most such scoring functions depend on protein side chain conformations. However, backboneonly scoring functions do not require computationally intensive structure optimization and so are well suited to protein design, which requires fast score evaluation. Furthermore, scoring functions that account for the distinctive relative position and orientation preferences of residue pairs are expected to be more accurate than those that depend only on the separation distance.
Results
Residue pair scoring functions for fixed backbone protein design were derived using only backbone geometry. Unlike previous studies that used spherical harmonics to fit 2D angular distributions, Gaussian Mixture Models were used to fit the full 3D (position only) and 6D (position and orientation) distributions of residue pairs. The performance of the 1D (residue separation only), 3D, and 6D scoring functions were compared by their ability to identify correct threading solutions for a nonredundant benchmark set of protein backbone structures. The threading accuracy was found to steadily increase with increasing dimension, with the 6D scoring function achieving the highest accuracy. Furthermore, the 3D and 6D scoring functions were shown to outperform side chaindependent empirical potentials from three other studies. Next, two computational methods that take advantage of the speed and pairwise form of these new backboneonly scoring functions were investigated. The first is a procedure that exploits available sequence data by averaging scores over threading solutions for homologs. This was evaluated by applying it to the challenging problem of identifying interacting transmembrane alphahelices and found to further improve prediction accuracy. The second is a protein design method for determining the optimal sequence for a backbone structure by applying Belief Propagation optimization using the 6D scoring functions. The sensitivity of this method to backbone structure perturbations was compared with that of fixedbackbone allatom modeling by determining the similarities between optimal sequences for two different backbone structures within the same protein family. The results showed that the design method using 6D scoring functions was more robust to small variations in backbone structure than the allatom design method.
Conclusions
Backboneonly residue pair scoring functions that account for all six relative degrees of freedom are the most accurate and including the scores of homologs further improves the accuracy in threading applications. The 6D scoring function outperformed several side chaindependent potentials while avoiding timeconsuming and error prone side chain structure prediction. These scoring functions are particularly useful as an initial filter in protein design problems before applying allatom modeling.
Background
Empirical scoring functions, or knowledgebased potentials, have been successfully applied to a wide variety of problems in biomolecular modeling. These scoring functions are derived from the statistics of residue or atomic interactions as observed in experimental protein structures. Most such scoring functions depend on amino acid side chain conformations, either through the definition of residue contacts [13] or through the dependence of the potentials on the distances between side chain atoms or residue centroids [46].
Another class of scoring functions, which includes the ones considered in this study, depend only on the protein backbone conformation. These scoring functions have two significant advantages: (1) they can be directly applied to proteins structures with unknown or uncertain side chain conformations and (2) they can be rapidly calculated. Because they are independent of the side chain conformations, these scoring functions can be applied to proteins for which the correct side chain conformations are not available from experimental structures. In such cases, computationally expensive and potentially inaccurate side chain optimization would be necessary in order to evaluate a side chaindependent scoring function for the protein structure. Comparative modeling results have shown that although successful methods usually agree on the core backbone structure, they have considerably more difficulty in predicting correct side chain conformations [7]. Also, because backboneonly scoring functions do not require side chain optimization they can also be used to evaluate protein structures more quickly. Protein design applications require such a fast but approximate evaluation of protein stability for a large number of sequences. Because of the highdimensional search space for side chain geometry optimization, empirical scoring functions that depend only on backbone geometry offer a computationally efficient solution to these tasks. Finally, even modeling problems in which a complete atomic structure of a protein is required can be efficiently accomplished using a multiscale approach in which potential structures are first filtered using the fast backboneonly scoring function and the remaining structures then subjected to more timeconsuming allatom refinement and rescoring.
The nonuniform angular distributions of side chains about residues of different types observed in protein structures [8] suggests that residue pair scores that depend on the relative position and orientation of the residues will discriminate native from nonnative protein structures better than scores that depend only on the residue separation distance. Some previously developed scoring functions incorporate this orientation dependence through the orientation of residue side chains [911]. The scoring functions derived in this study also incorporate orientation dependence but, unlike those scoring functions, they depend only on the protein backbone conformation. Two previous studies have examined orientationdependent empirical potentials for backbone structures. The study of Onizuka et al. [12] defined local coordinate systems centered on C_{α }atoms and fit residue pair potentials that were either functions of their relative position (3D) or their relative position and orientation (6D). Potentials were fit to data for different sequence separations using a truncated expansion in terms of spherical harmonics. The loworder terms included in the expansion yield a smoothed potential function that presumably eliminates sampling noise. A comparison between 1D (separation distance dependent), 3D, and 6D potentials with different expansion cutoffs showed that the 3D potential performed the best in ranking the native structure from among decoys. A later study by Miyazawa and Jernigan [13] also used similar spherical harmonic expansions in order to derive orientationdependent potentials that were a function of backbone coordinates. Unlike Onizuka et al., these were residue contact potentials and so did not have any explicit dependence on the separation distance. More importantly, the residue contact potentials in Miyazawa and Jernigan are not strictly backboneonly potentials since they implicitly depend on side chain conformations via a smooth contact cutoff that is a function of the distances between the geometrical centers of the two residue side chains. Also, unlike Onizuka et al., Miyazawa and Jernigan concluded that accounting for the relative orientation of residue pairs through Euler angles using a 5D residue contact potential improved fold recognition performance over a 2D potential that accounts only for the relative positions of the contacting residue pairs. Whereas the scoring functions in those studies were tested using folding decoys, the scoring functions considered here are designed and tested for the different modeling problem of fixed backbone protein design. The appropriate decoy structures in this case are incorrect fixed backbone threading solutions with the correct backbone structure but incorrect amino acid sequences. The scoring functions described here are optimized to detect the correct (native) sequence from among many incorrect (decoy) sequences for a common backbone structure.
We have developed position and orientationdependent residue pair scoring functions that depend only on the protein backbone structure, i.e. are completely independent of side chain conformations. As discussed above, these scoring functions are expected to be useful for protein design as well as fold recognition. 6D scoring functions that depend on both the relative position and orientation of the residue pairs were compared with 3D scoring functions, which depend only on the relative position, and with 1D scoring functions, which depend only on the separation distance. Also, the challenging task of estimating the highdimensional residue pair distributions was solved using a different method, Gaussian Mixture Models (GMMs), than the spherical harmonic expansions employed in the two previous studies. One advantage of GMMs is that they are able to directly estimate the 3D and 6D residue pair distributions, including offdiagonal contributions from multiple orientational variables. Another advantage of the GMM formulation used in this study is that the optimal model complexity, or number of parameters, is estimated from the data using the Bayes Information Criterion (BIC) statistic [14].
The relative effectiveness of the 1D, 3D, and 6D scoring functions was evaluated by their ability to identify the correct threading solution from among a set of decoy threading solutions generated either by shuffling the native sequences or by crossthreading for a large nonredundant set of proteins. The new scoring functions were also compared with side chaindependent empirical potentials from other studies using the same decoy set. Also, in order to illustrate the utility of the new backboneonly scoring functions two computational methods that exploit their advantages over traditional side chaindependent potentials were studied. The first method involves averaging the residue pair scores over threading solutions for homologs. Because homologous proteins with significant sequence similarity usually adopt the same backbone structure such a procedure is expected to improve accuracy by combining scores for these distinct but structurally similar proteins. This method was applied to the difficult problem of predicting interactions between alpha helices in membrane proteins using fixed backbone threading and found to further improve accuracy. The second application of the new scoring functions involves using a fast approximate optimization method, called Belief Propagation, to estimate the optimal sequence for a given protein backbone structure. This design method was compared with one using allatom modeling on a fixed protein backbone structure and found to be more robust to inevitable small backbone variations between the template and native protein structures.
Results and Discussion
Overview of the backboneonly residue pair scoring functions
The new backboneonly logodds residue pair scoring functions were derived as the logarithm of the ratio between the backbone orientational distribution of residue pairs in correct native structures and the corresponding distribution for incorrect decoy structures, which corresponds to randomly shuffled amino acid sequences for the same native backbone structures. The latter background distribution is the same as the distribution of all residue pairs in protein structures, regardless of their type. Such a background decoy distribution is appropriate for the desired modeling application of fixed backbone protein design and threading but is nonoptimal for other applications, such as ab initio protein folding. Also only residue pairs that are at least 6 residues apart in the protein sequence and with C_{β }separations ≤ 10 Å were included in the scores. This eliminated residue pairs that were too close in the linear sequence so that their relative position and orientation are largely determined by the local secondary structure as well as residue pairs that were too far apart and so have little orientational preference because they only interact weakly. The 1D, 3D, and 6D scoring functions were derived based on the distributions of separation distances, relative positions, and both relative positions and orientations of particular residue type pairs, respectively. While the 1D scoring functions were fit using kernel density estimation (KDE), the difficult highdimensional density estimation for the 3D and 6D scoring functions could not be accomplished by KDE so that another technique, Gaussian Mixture Models (GMMs), was employed.
Prediction accuracy for individual residue type pairs
The performance of the scoring function was first evaluated for each of the 210 residue type pairs individually in order to find out which individual residue pairs make the largest contributions to the overall protein threading accuracy. The performance of each residue type score was assessed by the area under the Receiver Operating Characteristic (ROC) curve (AUC) for 10fold crossvalidation results. The ROC curve displays the tradeoff between sensitivity and specificity as the score cutoff is varied. Higher AUC values, near 1.0, indicate better prediction performance.
The crossvalidation sets were created by randomly dividing the residue pair data for the nonredundant set of proteins into 10 approximately equalsized sets such that all data for a particular protein is contained within only a single set. The predictions were then made on each set using logodds scoring functions fit to data in the other 9 sets. This procedure then gives an estimate of the performance on novel data since the predictions are made for residue pair data from proteins unrelated to those used to fit the scoring functions.
The residue pairs whose 6D scores have the highest accuracy are CysCys and all paired combinations of Iso, Leu, and Val. The high accuracy of the CysCys score is due to the fact that a large percentage of such pairs (44%) form conformationally constraining disulfide bonds thus making their distinctive relative orientation easier to predict. Iso, Leu, and Val are some of the most commonly occurring residues, particularly in the protein core where they can form more contacts than residues on the surface. In fact, there is a significant correlation (Spearman's rank correlation coefficient ρ = 0.60, significance < 2.2 × 10^{16}) between the prediction accuracy, as measured by the AUC, and the number of training examples for each residue type pair. This means that generally the accuracy is better for common residue pairs than for rare ones, presumably because a large quantity of independent training examples are necessary in order to fit the highdimensional probability densities. The median AUC over all 210 residue pairs was 0.60.
For comparison, the same analysis was also performed for the distancedependent 1D scores. The results showed that some of the same residue pairs had the highest accuracy as for the 6D score, namely CysCys as well as all combinations of Iso, Leu, and Val. However, in addition the samecharge residue pairs GluGlu and AspGlu were among the most accurately predicted residue pairs. This is likely due to the unfavorable electrostatic energy for close separations, which is reflected in negative scores at close separations for these residue pairs. Figure 1 shows a plot of the 1D residue pair scores for CysCys and GluGlu and illustrates these trends as a function of interresidue separation. The median AUC for the 1D scores was 0.53, which is lower than that for the 6D scores. As will be seen in the next section, the higher accuracy of the individual 6D residue scores translates into higher accuracy for total protein threading scores.
Figure 1. 1D logodds scores as a function of C_{β }separation for AlaAla, CysCys, and GluGlu residue pairs. The CysCys function has a peak near the typical C_{β }separation for disulfide bonds, in the range of 3.54.0 Å and is negative for large separations. On the contrary, the score for the samecharge GluGlu pairs is negative for small separations and positive for large separations, reflecting the electrostatic energy penalty for close proximity. Both the CysCys and GluGlu scores are among the most accurate because of these physical constraints on their separations. The AlaAla score, shown for comparison, manifests an oscillatory behavior with a peak near that of the CysCys score.
It is somewhat surprising that the accuracy of the individual residue pair scores was significantly correlated with the amount of training data even for the 1D scores (ρ = 0.49, significance = 5 × 10^{14}). This implies that there is still room for improvement in both the 1D and 6D scores and that their accuracy will likely increase as new experimental structures of proteins become available and provide additional independent training data.
Threading prediction accuracy on a benchmark set
The accuracies of the residue pair scores were assessed in two ways: (1) by their ability to identify the native sequence of a protein from among a set of 5000 distinct decoy sequences and (2) crossthreading of all sequences in the benchmark set against a subset of the structures. In order to reliably estimate the prediction accuracy for novel data and to account for potential overfitting, 10fold crossvalidation was used for both test methods, thus insuring that the scoring functions were fit to a different set of protein structures than the set of proteins used to assess their accuracy. In the first test method, 4999 decoy sequences were generated for each protein in the benchmark set by shuffling the native sequence. Crossvalidation predictions were then made for all 5000 sequences (native + decoy) and the rank of the native sequence determined. The results for this test are shown in Table 1. In the second test method, allagainstall gapless crossthreading was performed for the 333 structures in the first cross validation set using residue pair scoring functions fit to the remaining 2995 structures in the benchmark set. Only sequences with at least as many residues as the template structure were threaded and all possible gapless alignments were considered. This resulted in a total of over 52 million threading solutions for all protein structures. The number of possible threading solutions varied dramatically between structures and ranged from 43, for the largest template structure, to 1472929 for the smallest template structure. Because of this large variation, the fractional rank of the native sequence, or ratio of the rank of the native sequence to the number of threading solutions for the particular template structure, was reported, rather than the absolute rank of the correct (native) sequence. These results are shown in Table 2.
Table 1. A comparison of the accuracy of the new scoring functions with other residue pair scoring functions from the literature in detecting correct threading solutions
Table 2. Prediction results for allagainstall gapless crossthreading of 333 proteins
Score accuracy improves with increasing dimension
Both the results for shuffled sequences, in Table 1, and the results for crossthreading, in Table 2, show that the prediction accuracy steadily improves as the dimension of the scoring function increases from 1D to 3D to 6D. Although the largest incremental improvement occurred in going from 1D to 3D potentials, the highest dimensional 6D potentials still had significantly higher accuracy than the 3D potentials in both tests. This conclusion contrasts with that of Onizuka et al. [12], who found that their 3D potentials achieved higher accuracy than their 6D potentials. However, this result agrees with the general conclusion of Miyazawa and Jernigan [13], although as mentioned above, the contact potentials in that study differ from the backboneonly scoring functions considered here.
Comparison with other empirical potentials
The accuracy of the backboneonly scoring functions was also compared with that of one other backboneonly potential, that of Rajgaria, McAllister, and Floudas 2006 [15] (RMF2006), as well as with several conventional side chaindependent empirical potentials. This was accomplished by calculating the total energies for the same threading benchmark set used to evaluate the backboneonly scoring functions. The results, as evaluated by the ranks of the native sequences, are given in Table 1. Side chaindependent empirical potentials from the following studies were evaluated: Miyazawa and Jernigan 1999 [16] (MJ1999), Tobi and Elber 2000 (TE2000), and Rajgaria, McAllister, and Floudas 2008 [17] (RMF2008). These particular potentials were chosen because they all have available published parameters. The MJ1999 potential was the only residue contact potential while the other side chaindependent potentials were a function of the side chain centroid separation distances. Side chain conformations were generated for the template backbone structures using the ROSETTA 3 program [18].
The results in Table 1 show that the MJ1999 contact potential was the most accurate among the five other empirical potentials. They also indicate that this potential achieves better performance than our 1D scoring function but worse performance than our 3D and 6D scoring functions. The other four empirical potentials all had lower accuracy on the benchmark set than even the 1D scoring function. This may be due to the fact that these four potentials were originally designed for different protein modeling tasks than fixed backbone design. The TE2000 potential was optimized to discriminate correct structures from decoys generated from a combination of gapless threading and allatom structure prediction while the other three potentials (RMF2006, RMF2008/6bin, and RMF2008/7bin) were optimized to discriminate correct structures from among decoys generated from allatom structure prediction. The background distribution for decoys generated by allatom structure prediction is expected to be different from that for decoys generated by fixedbackbone threading, which is most relevant for this test. However, these results demonstrate that the 3D and 6D backboneonly scoring functions achieve accuracy that is competitive with side chaindependent empirical potentials without requiring timeconsuming prediction of side chain conformations prior to score evaluation.
Examination of the difficult threading cases
The protein structures for which the most sensitive scoring function still gave poor threading results were examined in detail in order to determine common factors that may contribute to the difficulty of recognizing the correct solutions using the 6D score. For this purpose, all structures for which the 6D score assigned the native structure a rank greater than 1000 (20% percentile rank) were examined. A total of 22 structures satisfied this criterion. Five properties were overrepresented in these structures as compared with the other structures in the benchmark set. These distinctive characteristics of the structures were (1) solved by experimental methods other than Xray diffraction (mostly solution NMR), (2) small protein domains, (3) large disordered regions, (4) subunit in a complex, and (5) few residueresidue contacts. These properties are not independent since NMR can only be used to solve relatively small proteins and also can be used to obtain structures of proteins with disordered regions, which by definition have few residueresidue contacts. A total of 18 of the structures were solved by NMR and one was a structure fit to electron microscopy data. This proportion of nonXray structures was significantly higher (p = 5.3 × 10^{8}, Fisher exact test) than in the set of other benchmark set structures, which included 973/3306 nonXray structures. A total of 15 protein structures contained significant disordered regions, which also had multiple conformations in the NMR structure ensembles. In addition, five of the proteins were subunits in a structure of a protein complex and so may be unstable as monomers. Finally, because many structures had little regular secondary structure (αhelix or βstrand), they also had relatively few residueresidue contacts compared to more structured, compact proteins. These properties are interdependent so that it is unclear whether the poor threading performance is because NMR structures may contain significant errors [19], or because they have few nonlocal residueresidue contacts that contribute to the total score, or because the disordered regions cannot be reliably recognized by the score.
Combining the residuepair scores of homologs improves threading accuracy
Proteins with significant sequence similarity usually have similar backbone structures [2023]. This suggests an approach to improving threading accuracy by combining a potential threading solution with the corresponding solutions for homologous proteins. This can be accomplished by calculating the average over the score for the original threading solution and the scores for the homolog threading solutions, each obtained by replacing the native sequence segment with the aligned segment of a homolog sequence. This method is well suited to the backboneonly scoring functions because the scores can be quickly evaluated for the multiple homolog threading solutions. It should be noted that this procedure is applicable only to protein threading applications and not to protein design because it requires protein sequences that are evolutionarily related to the query sequence.
This procedure was tested using allagainstall crossthreading of a nonredundant set of contacting pairs of transmembrane αhelices in membrane proteins. This is a particularly challenging task for threading because of the limited number of interhelical residue contacts, compared with the number of residue contacts in a typical complete protein structure. Crossthreading was performed for a nonredundant set of 78 contacting TM helices, which significantly differ both in sequence and structure. The effects of three different factors on the threading accuracy were investigated by calculating the percentile rank of the correct solution among all crossthreading solutions for each TM helix pair. First, the scores calculated by averaging the threading scores over homologs were compared with the usual scores calculated for just the native sequences. Second, the scores were calculated both with and without a membrane depthdependent scoring function that reflects the propensity of each residue type to occur at a particular depth in the membrane. Third, the effect of the size of the TM helix pair interface was studied by calculating prediction statistics for only the 27 TM helix pairs that have at least 15 interhelix residue contacts (defined by atomic separation < 4 Å). The results for all 78 TM helix pairs are shown in Table 3 and the results for the large interface helix pairs are in Table 4.
Table 3. Comparison of crossthreading results for 78 contacting transmembrane helix pairs using the 6D potential and optionally including homolog sequences or a membrane depthdependent residue potential
Table 4. Results for the same quantities shown in Table 3, except only for the 27 transmembrane helix pair structures that have at least 15 interhelix residue contacts
The results in Tables 3 and 4 show that each of the three factors have clear effects on the threading accuracy. Most importantly, combining threading scores for homologs significantly improves the threading accuracy, as can be seen by the lower median percentile rank of the correct solution when homologs are included. As expected, including the depthdependent scoring function also improves the accuracy because it provides information on the suitability of the threading solution that is independent of the residue pair score. Finally, the accuracy is considerably higher for the subset of helix pairs with more residue contacts. This can be explained by the fact that additional residue contacts provide more constraints on potential threading solutions and thus lead to better discrimination of correct solutions. A more extreme example of this trend can be seen by comparing the crossthreading results for the TM helix pairs with those for complete proteins shown in Table 2. The much lower percentile rank of the correct threading solutions, indicating higher accuracy, for complete proteins is likely due in large part to the much greater number of residue contacts in the larger complete proteins.
Calculating the optimal sequence using Belief Propagation
Next, we demonstrate how the new backbone scoring functions can be applied to solving problems in protein design without computationally costly conformational sampling. One question that arises in the context of protein design and the study of protein folding is what amino acid sequences yield the most stable proteins with a given backbone structure. Based on the Boltzmann hypothesis of empirical threading potentials, which implies that the most commonly observed interactions in protein structures also have the lowest energy, this problem can be translated into finding amino acid sequences that maximize the total score for the backbone structure of interest. In this study we consider the contributions of single residues and residue pair interactions but not any higher order (≥ 3) residue interactions. The total score, S_{total}, is then
in which T_{i }is the type of residue i, P_{i }are its properties (here the solvent accessibility and secondary structure type), and Ω_{ij }are the relative position and orientation between residues i and j. The same amino acid sequence that maximizes S_{total }for a fixed backbone also maximizes its exponential
Because the objective function, , is to be optimized with respect to the amino acid sequence for a fixed backbone, its implicit dependence on the backbone geometry, through the P_{i }and Ω_{ij }parameters, is not shown.
The function in Eq. (2) has the same functional form as a pairwise Markov Random Field (MRF) model, up to a multiplicative constant 1/Z, where Z is the partition function. The partition function is independent of the amino acid sequence. Since only the sequence that maximizes Eq. (2) is of interest, and not its maximum value, this constant may be ignored.
The main advantage of interpreting the objective function in Eq. (2) as a MRF is that an efficient method, called Belief Propagation (BP) [24,25], can be used to solve it. There are two formulations of BP, the standard algorithm to calculate marginal distributions for a MRF and the maxproduct algorithm to calculate the maximum a posteriori (MAP) solution. Because we would like to find the sequence that maximizes Eq. (2), the latter maxproduct algorithm will be used. Eq. (2) can also be solved as an Integer Linear Programming problem [26], however BP offers a simple and fast alternative that generally yields accurate approximate solutions [27]. The MAP problem was solved by first converting the pairwise MRF corresponding to Eq. (2) into an equivalent factor graph and then using the maxproduct algorithm implemented in the libDAI C++ library [28] to solve it.
Comparison of optimal amino acid sequences with native sequences
Most protein design methods, including the one described here, assume a fixed backbone structure because accounting for backbone flexibility would dramatically increase the number of degrees of freedom to be sampled. However, Xray crystallographic studies have shown that proteins accommodate mutations in core residues by adjustments in the backbone structure [29]. This raises the question of how much bias the fixed backbone approximation introduces into protein design methods and whether this bias can be partially corrected by the backbone flexibility implicit in the smooth 6D residue pair scoring functions introduced here. This was studied by comparing the sequence identity between optimal sequences for backbone structures from two homologous proteins within the same family. A test set of protein structure pairs was compiled for this purpose by randomly selecting pairs of protein structures from the same HOMSTRAD [30] family. Each pair was selected from a different family and the final set contained a total of 407 protein pairs. HOMSTRAD families contain proteins with significant sequence or structure similarity so that each pair of proteins are presumably homologous.
We also studied the related question of how similar the optimal sequence is to the native sequence and how this similarity is affected by the protein design method. A previous study by Kuhlman and Baker [31] investigated this question using a fixed backbone design protocol that involved Monte Carlo sampling of both side chain conformations and residue types using allatom energy based scoring with ROSETTA. The conclusion was that the sequence similarity between the optimal and native sequences is high, with identical residues for 51% of the buried core residues and 27% of all residues in the set of 108 backbone structures considered. However, the fixed backbone structure may lead to higher apparent sequence similarity because it is unable to accommodate as many mutations, primarily because of steric clashes in the protein core. In addition, many proteins share significant structure similarity without having any recognizable sequence similarity [32,33]. In these cases, the similarity between the optimal and native sequences must be low.
Both of these questions were examined by comparing the results using the protein design method described in the previous section, involving BP optimization using the 6D residue pair scores, with the results obtained using the ROSETTA program [18]. The ROSETTA optimal sequences were obtained using the default fixed backbone design algorithm implemented in Rosetta version 3.0. The two questions were investigated by comparing (1) the similarities between the optimal sequences for each pair of proteins in the same HOMSTRAD family (interstructure similarity) and (2) the similarities between the optimal and native sequences for a given backbone structure (intrastructure similarity), respectively. Sequence similarities were calculated both for all residues and for only core residues, which were defined as in Ref. [31] as those residues with > 20 C_{β }atoms within 10 Å. Separate sequence similarities were also calculated separately for highresolution (< 2.5 Å) and lowresolution (≥ 2.5 Å) protein structures. The results are shown in Table 5. First, both the intrastructure and interstructure similarities were higher for the core residues. This trend agrees with that observed in Ref. [31] and can be explained by the fewer allowed residue substitutions in the protein core due to stricter constraints resulting from van der Waals, hydrophobic, and to a lesser degree hydrogen bonding interactions [34]. Another clear trend that is apparent from these results is that the intrastructure similarity is higher for ROSETTA than the BP method. However, in contrast, the median interstructure similarity for BP derived optimal sequences are almost identical to that obtained by ROSETTA. If these latter results are further broken down by the resolution of the protein structures then it is seen that ROSETTA designed sequences have higher interstructure similarities for lowresolution structures whereas the BP method has higher interstructure similarities for highresolution structures. One interpretation of these results is that the scoring functions have a smoother dependence on the backbone structure and thus yield more consistent optimal sequences for the similar backbones of homologous proteins. This interpretation is also supported by the lower dependence of the BP method on the resolution for both the intrastructure and interstructure similarities. Furthermore, the lower interstructure similarity for the BP method may be partly due to the fact that a fold is actually more permissive of different sequences once backbone flexibility is accounted for, rather than simply the expected lower accuracy of the residue pair scoring functions compared with the allatom modeling of ROSETTA. In conclusion, even though both design methods rely on a fixed backbone approximation, the protein design method using the residue pair scoring functions yields optimal sequences that are more robust to the small variations in backbone structure that occur within protein families.
Table 5. Similarities between the optimal and native sequences (intrastructure) and similarities between the optimal sequences for a pair of proteins in the same HOMSTRAD family (interstructure)
Conclusions
The performance of 1D, 3D, and 6D residue pair scoring functions that depend only on the protein backbone structure were compared by their ability to identify correct threading solutions in a large benchmark set of proteins. The accuracy was found to improve as the dimension increased, with 6D scoring functions the most accurate. The orientationdependent scoring functions were also found to achieve higher accuracy than several side chaindependent empirical potentials on this benchmark set. This is remarkable since the backboneonly scoring functions do not require prediction of the side chain conformations, which is computationally demanding, particularly for many protein design applications. Interestingly, a previous study [35] also found that a backboneonly statistical potential outperformed a side chaindependent potential, although the potential was derived and evaluated using decoy sets appropriate for folding rather than for fixed backbone design. We also found that averaging the threading scores for the query sequence and aligned segments of homologous sequences further improved the accuracy for protein threading applications. Incidentally, because those results were obtained for membrane proteins, they also demonstrated that the scoring functions can be successfully applied to the TM regions of membrane proteins even though they were derived from a data set containing predominantly nonmembrane proteins. The applicability of the scoring functions to both membrane and nonmembrane proteins is likely due to the fact that most of the residue pairs involve buried residues, which are not strongly affected by the distinct physiochemical environments of these two classes of proteins. However, direct comparison between the performance of the scoring functions on membrane and nonmembrane proteins is needed to conclusively confirm this hypothesis. In addition, comparisons of predicted optimal amino acid sequences for pairs of similar backbone structures within the same protein families revealed that the 6D scoring functions were more robust to the small backbone rearrangements observed between homologous proteins than fixedbackbone allatom modeling. Thus the 6D scoring functions should perform well in actual protein design applications in which the backbone structures of the designed and template proteins are expected to differ slightly.
Several possible extensions of this work are possible. First, as expected by the geometrical dependence of their physical interactions, accounting for the relative position and orientation of residue side chains should improve the accuracy of empirical scoring functions that depend on side chain conformations. Several previous studies have investigated such scoring functions, but with fewer relative degrees of freedom [911]. The same method used here for fitting 3D and 6D backboneonly scoring functions with GMMs could also be applied to such side chain conformationdependent scoring functions. One difficulty is deciding the best choice of reference coordinate systems for larger flexible side chains. Second, the same methodology described here for deriving scoring functions for protein threading could be directly applied to derive similar residue pair scoring functions for other modeling tasks, such as ab initio protein structure prediction or proteinprotein docking, by using the appropriate set of decoy structures. Finally, as mentioned earlier, the residue pair scoring functions can be employed in a fixedbackbone protein design procedure to generate initial protein sequences that are then subjected to allatom optimization and scoring in order to select the optimal solutions.
Methods
Protein structures for training and evaluation
A set of Protein Data Bank (PDB) protein structures for which no two proteins share more than 25% sequence identity was obtained from the PDBSELECT database [36]. Protein structures with less than 50 residue contacts (defined below) were observed to be predominantly isolated alpha helices and so were removed from the set resulting in a total of 3328 structures. These structures of diverse proteins were then divided into ten approximately equal size sets for crossvalidation.
Residue pair relative position and orientation variables
Local coordinates at each residue are defined by axis unit vectors, , , and , which are centered at the respective C_{β }atoms, as:
so that the zaxis is along the C_{β}C_{α }bond. A virtual C_{β }atom was added to all glycine residues by assuming ideal tetrahedral bond central angles of about C_{α }and a typical C_{β}C_{α }bond length of 1.53 Å. Six variables, {r, θ, ϕ, α, β, γ}, completely specify the relative position and orientation of the two residue backbones. The residue pair distance, r, is defined by the C_{β }separation. The angles θ and ϕ are the polar angles of the residue #2 C_{β }atom position in the residue #1 coordinate system. Finally, the relative orientation of the two coordinate systems fixed at each residue is described by three Euler angles, α, β, and γ. These Euler angles were calculated from the rotation matrix M between coordinate system #1 and coordinate system #2 with elements M_{ij }= (u_{2})_{i}·(u_{1})_{j}, with , as
in which the step function is defined by Θ (×) = 1 for ×>0 and 0 otherwise. Defining the vector connecting the origin of coordinate system #1 to the origin of coordinate system #2 as v_{12 }then the remaining variables describing the relative position of residue #2 with respect to residue #1 are calculated as
Empirical scoring function as a Naïve Bayes classifier
The goal of the scoring function is to discriminate between two classes of protein structures: nearnative (correct) structures and nonnative (incorrect) structures. The input data for the query protein structure consists of the set of residue pair types and their relative orientation, D ≡ {(T_{i}, T_{j}, Ω_{ij}), i = 1, ..., N_{residues}, j = i + 1, ..., N_{residues}}. The data for each residue pair is (T_{i}, T_{j}, Ω_{ij}) with T_{i }the amino acid type of residue number i and (i, j) are the residue numbers of neighboring pairs within a fixed cutoff separation of r_{max }= 10 Å. The relative position and orientation of residue pair (i, j) are specified by either all 6 variables described above or a subset thereof:
Using the Bayes theorem, the identity
expresses the ratio of class posterior probabilities as the product of the class conditional probabilities of the data and the ratio of prior class probabilities. According to the maximum a posteriori (MAP) criterion, the protein structure is then classified as correct if the posterior ratio is greater than 1 or as incorrect if it is less than 1.
In a Naïve Bayes classifier, the class conditional probability of the data is approximated as a product of the probabilities for individual variables, e.g. the variables are considered independent, so that
in which S_{pair }(T_{i}, T_{j}, Ω_{ij}) is defined as the logodds residue pair score for residues i and j, which have types T_{i }and T_{j }and relative coordinates Ω_{ij}. The prior probability ratio is defined as
Classification by MAP then implies that the structure is classified as native if S_{pair }>log(R_{prior}) and nonnative if S_{pair }<log(R_{prior}).
Logodds scores are often converted into physical energy values using the Boltzmann distribution [37]. In spite of problems with their physical interpretation [38], such energy values may be useful when they are to be combined with physical energies from, for example, experimental data or force field calculations. However, one advantage of retaining logodds scores is that the score cutoff has a useful interpretation in terms of Bayesian prior probabilities, as shown above. In any case, because the Boltzmann factor, exp(E/kT), is monotonic, these two interpretations are operationally equivalent in terms of discriminating correct from incorrect solutions using a score cutoff.
The remaining task is to estimate the logodds scores using training data. S_{pair }(T_{i}, T_{j}, Ω_{ij}) using training data. It should be noted that because the volume element depends on the relative angular coordinates for the 3D and 6D cases, a Jacobian factor is required. However, because the conditional probabilities in the numerator and denominator of Eq. (8) are evaluated at the same relative coordinates these factors cancel. The above relations are quite general and the choice of nonnative data, and hence background probability distribution, will depend on the discrimination task. In this study, we are interested in identifying native, or stable, protein structures from among other decoy structures with the same backbone geometry but different residues. The resulting scores are then appropriate for both fold prediction by threading and protein design. The logodds scores were calculated by estimating the class conditional probability distributions, P (T_{i}, T_{j}, Ω_{ij} native) and P (T_{i}, T_{j}, Ω_{ij} nonnative), using two different density estimation methods applied to the native and nonnative training data, respectively. The native distribution for a particular residue type pair was fit using the relative position and orientation of those residue types in native protein structures while the nonnative distribution was fit to the relative position and orientation for all residue pairs in native backbone structures, regardless of their types. Thus the nonnative distributions are actually independent of residue types and so equal to a common distribution, or P (T_{i}, T_{j}, Ω_{ij} nonnative) = P_{nn }(Ω_{ij}). Kernel Density Estimation was used to fit the 1D scores and Gaussian Mixture Models were used to fit the 3D and 6D scores.
Deriving 1D distributions using Kernel Density Estimation
Histograms of the residue pair separation distances have been commonly used to derive distancedependent residue pair scores. Kernel Density Estimation (KDE) is a technique that has the advantages over histograms of yielding a smooth density estimate as well as a faster asymptotic rate of convergence of O(n^{4/5}) as compared with O(n^{2/3}) for histograms as the number of data points, n, is increased [39].
KDE fits the probability density function (pdf) P(r) as a sum of kernel functions centered about each data point
In particular, Gaussian functions of uniform bandwidth, , were used in this study in order to estimate the distribution of residue separation distances. KDE was calculated in R [40] with the default bandwidth parameter λ calculated by the method of Ref. [41]. The resulting density was then approximated using linear interpolation of values calculated on a regular grid for computational efficiency.
Deriving 3D and 6D distributions using Gaussian Mixture Models for density estimation
The "curse of dimensionality" [42] makes estimating the probability densities in the full 3 and 6dimensional relative coordinate spaces challenging. Histograms are infeasible because the number of bins increases exponentially with the number of dimensions, assuming equal length in each bin dimension, and so the quantity of training data is inadequate for 3 or 6dimensional fitting. Kernel density estimation is also challenging for two reasons. First, the computational resources, both CPU time and memory, are prohibitive for a straightforward implementation. Second, the standard methods for estimating the Gaussian kernel bandwidth matrix, which contains D(D + 1)/2 parameters in the general Ddimensional case, are not feasible in higher dimensions. Instead we have chosen to use Gaussian mixture models (GMMs) because they can accurately and efficiently fit the highdimensional densities required for the residue pair scores.
A Gaussian mixture model with N_{comp }components estimates the pdf as
in which
is a multidimensional Gaussian function and the mixture proportions satisfy . The model parameters are usually estimated by maximum likelihood using the ExpectationMaximization (EM) algorithm [43,44].
A general problem with statistical models is choosing the optimal complexity of the model, i.e. number of parameters in the model. The accuracy of the model, as reflected in the GMM loglikelihood, steadily improves as the number of parameters is increased, even though the accuracy on novel data, not used for fitting the model, is optimal at a finite number of parameters. In order to fit the 3 and 6dimensional densities, Gaussian kernels with fullrank covariance matrices Σ_{k }were used and the optimal number of components was chosen by minimizing the Bayes Information Criterion (BIC) [14]
As the number of model parameters is increased, the first term in the BIC decreases while the second term increases so that a minimum is achieved at an intermediate value. The R package MCLUST [45,46] was used to search for the best GMMs with up to 100 components. In all except a few cases the optimal number of components was less than the maximum.
Single residue scoring function
In addition to the pairwise residue scoring function described above, a single residue scoring function was included in the BP prediction of an optimal amino acid sequence (see Eq. (2)). The single residue score depends on two residue properties, solvent accessibility and secondary structure, and, like the pairwise case, is a logodds score. Each residue was assigned to two solvent accessibility classes, buried or surface, depending on whether its relative solvent accessible surface area (SASA) is < 0.01 or ≥ 0.01, respectively. The relative SASA was calculated as ratio of the SASA to the maximum SASA as defined in [47]. The secondary structure was classified as an αhelix (H), βstrand (E), or other (C). The score was then calculated as
in which the solvent accessibility class of residue i is Acc_{i }∊{ buried, surface} and the secondary structure of residue i is SS_{i }∊{ H, E, C}. The probabilities P(Acc_{i}, SS_{i } native) and P(Acc_{i}, SS_{i } nonnative) were estimated from the benchmark set of protein structures with the native sequences and shuffled sequences, respectively. The single residue properties are assumed to be independent of the pairwise geometrical properties so that, when both scores are used, the total score is their sum.
Nonredundant set of contacting transmembrane helix structures
A set of highresolution structures of interacting transmembrane (TM) helix pairs were extracted from PDB membrane protein structures in order to test whether including homologs improves threading accuracy, as described in the Results section. TM segment boundaries were determined by the PDBTM database [48,49]. The included TM helix pairs were nonredundant both with respect to their amino acid sequences, by selecting sequences with < 30% sequence identity, and structure, by clustering them by structural similarity and choosing a single representative structure from each cluster. Clustering was performed by calculating all pairwise similarities between different helix pairs, as measured by the C_{α }RMSDs of their aligned segments, and then clustering them based on these RMSD values. The aligned segments were required to be at least 10 residues long and contain all contacting residues. All possible segments in corresponding helices that satisfied these criteria were superimposed using the Kabsch algorithm [50]. Also both possible correspondences of helices in each helix pair were tried, i.e. (A→ A', B→ B') and (A→ B', B→ A') for helix pairs (A, B) and (A', B'). Once the similarity matrix was calculated, the helix pairs were then clustered using the robust Partioning Around Medoids (PAM) algorithm [51]. In addition to the cluster assignments, the PAM algorithm outputs a medoid for each cluster. A medoid is the member of each cluster that has the minimum average dissimilarity to all other cluster members. Thus, in a sense, it can be considered the central object in each cluster. Because of this desirable property, the medoid helix pairs in each cluster were chosen as the representative structures that comprised the set. The number of clusters, 78, was chosen so that the maximum RMSD between a medoid and any other helix pair in the same cluster was 2.5 Å.
Homologous sequences for TM helices
Aligned sequences of homologs were then collected for each TM helix in the nonredundant set of 78 helix pairs. First BLAST [52] was used to search the NCBI nr protein sequence database using fulllength protein sequences containing the helices (Evalue cutoff = 10^{2}). Highly similar sequences were then removed using CDHIT [53] with a 90% sequence identity cutoff and the remaining sequences aligned with MUSCLE [54]. Only gapless segments aligned to the TM helices were included in the final set of homologous sequences used for threading.
Membrane depthdependent scoring function
A previous study [55] derived a logodds depthdependent empirical potential for inserting different amino acid side chains into the membrane. The potential had a good correlation with experimental transfer free energy values and was able to reproduce the correct tilt angle of TM αhelices with respect to the membrane. We derived a similar logodds scoring function from the propensities of different residue types in αhelical integral membrane proteins to occur at different depths in the membrane. The scoring function was added to the 6D residue pair scoring function for the prediction of TM αhelix associations described in the Results section.
First all αhelical TM segments were downloaded from the PDBTM database and a nonredundant set extracted using the CDHIT program [53] with a 30% sequence identity cutoff. Residue depths were defined by the absolute distance, z, from the PDBTM predicted central membrane plane. The logodds depthdependent score for each amino acid type i, S_{i}(z), was then calculated as
in which P(zaa_{i}) is the pdf of amino acid type i appearing at membrane depth z and P(z) is the pdf of any residue appearing at depth z. Both pdfs are defined to be symmetric, i.e. P(zaa_{i}) = P(zaa_{i}) and P(z) = P(z). Unlike the previous study, which fit fixed functional forms to histograms of amino propensities as a function of depth, KDE was used to fit the pdfs, just as for the 1D (distancedependent) residue pair scoring functions.
Authors' contributions
AJB designed and performed the study, analyzed the results, and wrote the manuscript.
Acknowledgements
This work was funded by the Mayo Clinic.
References

Miyazawa S, Jernigan RL: Estimation of effective interresidue contact energies from protein crystal structures: Quasichemical approximation.
Macromolecules 1985, 18(3):534552. Publisher Full Text

Miyazawa S, Jernigan RL: Residueresidue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading.
J Mol Biol 1996, 256(3):623644. PubMed Abstract  Publisher Full Text

Mirny LA, Shakhnovich EI: How to derive a protein folding potential? A new approach to an old problem.
J Mol Biol 1996, 264(5):11641179. PubMed Abstract  Publisher Full Text

Kocher JP, Rooman MJ, Wodak SJ: Factors influencing the ability of knowledgebased potentials to identify native sequencestructure matches.
J Mol Biol 1994, 235(5):15981613. PubMed Abstract  Publisher Full Text

Tobi D, Elber R: Distancedependent, pair potential for protein folding: Results from linear optimization.
Proteins 2000, 41(1):4046. PubMed Abstract  Publisher Full Text

Samudrala R, Moult J: An allatom distancedependent conditional probability discriminatory function for protein structure prediction.
J Mol Biol 1998, 275(5):895916. PubMed Abstract  Publisher Full Text

Kopp J, Bordoli L, Battey JN, Kiefer F, Schwede T: Assessment of CASP7 predictions for templatebased modeling targets.
Proteins 2007, 69(Suppl 8):3856. PubMed Abstract  Publisher Full Text

Bahar I, Jernigan RL: Coordination geometry of nonbonded residues in globular proteins.
Fold Des 1996, 1(5):357370. PubMed Abstract  Publisher Full Text

Buchete NV, Straub JE, Thirumalai D: Orientational potentials extracted from protein structures improve native fold recognition.
Protein Sci 2004, 13(4):862874. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Buchete NV, Straub JE, Thirumalai D: Continuous anisotropic representation of coarsegrained potentials for proteins by spherical harmonics synthesis.
J Mol Graph Model 2004, 22(5):441450. PubMed Abstract  Publisher Full Text

Lu M, Dousis AD, Ma J: OPUSPSP: an orientationdependent statistical allatom potential derived from sidechain packing.
J Mol Biol 2008, 376(1):288301. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Onizuka K, Noguchi T, Akiyama Y, Matsuda H: Using data compression for multidimensional distribution analysis.
IEEE Intelligent Systems 2002, 17(3):4854. Publisher Full Text

Miyazawa S, Jernigan RL: How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins?
J Chem Phys 2005, 122(2):024901. PubMed Abstract  Publisher Full Text

Schwartz G: Estimating the dimension of a model.
Annals of Statistics 1978, 6(2):461464. Publisher Full Text

Rajgaria R, McAllister SR, Floudas CA: A novel high resolution CalphaCalpha distance dependent force field based on a high quality decoy set.
Proteins 2006, 65(3):726741. PubMed Abstract  Publisher Full Text

Miyazawa S, Jernigan RL: Selfconsistent estimation of interresidue protein contact energies based on an equilibrium mixture approximation of residues.
Proteins 1999, 34(1):4968. PubMed Abstract  Publisher Full Text

Rajgaria R, McAllister SR, Floudas CA: Distance dependent centroid to centroid force fields using high resolution decoys.
Proteins 2008, 70(3):950970. PubMed Abstract  Publisher Full Text

Rohl CA, Strauss CE, Chivian D, Baker D: Modeling structurally variable regions in homologous proteins with rosetta.
Proteins 2004, 55(3):656677. PubMed Abstract  Publisher Full Text

Nabuurs SB, Spronk CA, Vuister GW, Vriend G: Traditional biomolecular structure determination by NMR spectroscopy allows for major errors.
PLoS Comput Biol 2006, 2(2):e9. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins.
EMBO J 1986, 5(4):823826. PubMed Abstract  PubMed Central Full Text

Sander C, Schneider R: Database of homologyderived protein structures and the structural meaning of sequence alignment.
Proteins 1991, 9(1):5668. PubMed Abstract  Publisher Full Text

Rost B: Twilight zone of protein sequence alignments.
Protein Eng 1999, 12(2):8594. PubMed Abstract  Publisher Full Text

Abagyan RA, Batalov S: Do aligned sequences share the same fold?
J Mol Biol 1997, 273(1):355368. PubMed Abstract  Publisher Full Text

Pearl J: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 2nd edition. San Francisco, CA: Kaufmann; 2005.

Yedidia JS, Freeman WT, Weiss Y: Understanding belief propagation and its generalizations. In Exploring Artificial Intelligence in the New Millenium. Edited by GLa B. San Diego, CA: Morgan Kaufmann; 2003:239269.

Kleinberg J, Tardos E: Aproximation algorithms for classification problems with pariwise relationships: Metric labeling and Markov Random Fields. In 40th Annual Symposium on Foundations of Computer Science: 1999. IEEE; 1999:1423.

Weiss Y, Freeman WT: On the optimality of solutions of the maxproduct beliefpropagation algorithm in arbitrary graphs.
IEEE transactions in information theory 2001, 47(2):736744. Publisher Full Text

LibDAI  A free/open source C++ library for Discrete Approximate Inference methods [http://people.kyb.tuebingen.mpg.de/jorism/libDAI/] webcite

Baldwin EP, Matthews BW: Corepacking constraints, hydrophobicity and protein design.
Curr Opin Biotechnol 1994, 5(4):396402. PubMed Abstract  Publisher Full Text

Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families.
Protein Sci 1998, 7(11):24692471. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuhlman B, Baker D: Native protein sequences are close to optimal for their structures.
Proc Natl Acad Sci USA 2000, 97(19):1038310388. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison.
Curr Opin Struct Biol 1996, 6(3):377385. PubMed Abstract  Publisher Full Text

Rost B: Protein structures sustain evolutionary drift.
Fold Des 1997, 2(3):S1924. PubMed Abstract  Publisher Full Text

Bowie JU, ReidhaarOlson JF, Lim WA, Sauer RT: Deciphering the message in protein sequences: tolerance to amino acid substitutions.
Science 1990, 247(4948):13061310. PubMed Abstract  Publisher Full Text

Fitzgerald JE, Jha AK, Colubri A, Sosnick TR, Freed KF: Reduced C(beta) statistical potentials can outperform allatom potentials in decoy identification.
Protein Sci 2007, 16(10):21232139. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hobohm U, Sander C: Enlarged representative set of protein structures.
Protein Sci 1994, 3(3):522524. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sippl MJ: Calculation of conformational ensembles from potentials of mean force. An approach to the knowledgebased prediction of local structures in globular proteins.
J Mol Biol 1990, 213(4):859883. PubMed Abstract  Publisher Full Text

Thomas PD, Dill KA: Statistical potentials extracted from protein structures: how accurate are they?
J Mol Biol 1996, 257(2):457469. PubMed Abstract  Publisher Full Text

Scott DW: On optimal and databased histograms.
Biometrika 1979, 66(3):605610. Publisher Full Text

R Development Core Team: R: A language and environment for statistical computing.

Silverman BW: Density Estimation for Statistics and Data Analysis. London: Chapman and Hall/CRC; 1986.

Bellman R: Adaptive Control Processes: A Guided Tour. Princeton, NJ: Princeton University Press; 1961.

Dempster AP, Laird NM, Rubin DB: Maximum likelihood for incomplete data via the EM algorithm (with discussion).
Journal of the Royal Statistical Society, Series B 1977, 39(1):138.

McLachlan GJ, Krishnan T: The EM Algorithm and Extensions. New York: Wiley; 1997.

Fraley C, Raftery AE: Modelbased clustering, discriminant analysis and density estimation.

Fraley C, Raftery AE: MCLUST Version 3 for R: Normal mixture modeling and modelbased clustering. University of Washington: University of Washington; 2006.

Rost B, Sander C: Conservation and prediction of solvent accessibility in protein families.
Proteins 1994, 20(3):216226. PubMed Abstract  Publisher Full Text

Tusnady GE, Dosztanyi Z, Simon I: Transmembrane proteins in the Protein Data Bank: identification and classification.
Bioinformatics 2004, 20(17):29642972. PubMed Abstract  Publisher Full Text

Tusnady GE, Dosztanyi Z, Simon I: PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank.
Nucleic Acids Res 2005, (33 Database):D275278. PubMed Abstract  PubMed Central Full Text

Kabsh W: A discussion of the solution for the best rotation to relate two sets of vectors.
Acta Crystal A 1973, 34:827828. Publisher Full Text

Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. New York, NY: John Wiley & Sons; 1990.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25(17):33893402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Li W, Godzik A: Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.
Bioinformatics 2006, 22(13):16581659. PubMed Abstract  Publisher Full Text

Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res 2004, 32(5):17921797. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Senes A, Chadi DC, Law PB, Walters RF, Nanda V, Degrado WF: E(z), a depthdependent potential for assessing the energies of insertion of amino acid sidechains into membranes: derivation and applications to determining the orientation of transmembrane and interfacial helices.
J Mol Biol 2007, 366(2):436448. PubMed Abstract  Publisher Full Text