Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA

Department of Biological Sciences, Markey Center for Structural Biology, Purdue Cancer Center, and Bindley Bioscience Center, Purdue University, West Lafayette, IN 47907, USA

Abstract

Background

Fold recognition techniques take advantage of the limited number of overall structural organizations, and have become increasingly effective at identifying the fold of a given target sequence. However, in the absence of sufficient sequence identity, it remains difficult for fold recognition methods to always select the correct model. While a native-like model is often among a pool of highly ranked models, it is not necessarily the highest-ranked one, and the model rankings depend sensitively on the scoring function used.

Results

This paper presents an integrated computational-experimental method to determine the fold of a target protein by probing it with a set of planned disulfide cross-links. We start with predicted structural models obtained by standard fold recognition techniques. In a first stage, we characterize the fold-level differences between the models in terms of topological (contact) patterns of secondary structure elements (SSEs), and select a small set of SSE pairs that differentiate the folds. In a second stage, we determine a set of residue-level cross-links to probe the selected SSE pairs. Each stage employs an information-theoretic planning algorithm to maximize information gain while minimizing experimental complexity, along with a Bayes error plan assessment framework to characterize the probability of making a correct decision once data for the plan are collected. By focusing on overall topological differences and planning cross-linking experiments to probe them, our

Conclusions

Fold determination can overcome scoring limitations in purely computational fold recognition methods, while requiring less experimental effort than traditional protein structure determination approaches.

Introduction

Despite significant efforts in structural genomics, the vast majority (> 90%

Protein fold determination by disulfide cross-linking

**Protein fold determination by disulfide cross-linking****.** The example shows two models, but the method readily handles tens or even hundreds of models. (left) Two models, TS125_3 (green) and TS194_2 (magenta), for CASP target T0351, are of reasonable quality but have rather different topologies. (middle) The three-dimensional structures are compiled into graphs on the secondary structure elements (SSEs), representing the topology in terms of contacting SSE pairs. A topological fingerprint is selected based on differences in SSE contacts (e.g., 1-2, 2-4, 3-5, etc.) that together distinguish the models. (right) For each SSE pair in the topological fingerprint, a set of residue pairs is selected for disulfide cross-linking, in order to robustly determine whether or not the SSE pair is actually in contact. The figure shows the selected cross-links (yellow) to test for SSE pair (1, 2). Residues selected for cross-linking are colored red.

Seeking to close the gap between computational structure prediction and experimental structural determination, we

While earlier methods have focused on probing geometry and selecting a model, we target here a more defined characterization of protein structure, ascertaining the overall protein fold. We call this approach

The method presented here strikes a balance between very limited cross-linking (e.g., six disulfide pairs in our earlier work

Methods

We are given a set

Topological fingerprint selection

In order to compare SSE topologies, we need a common set of SSEs across the models. Since secondary structure prediction techniques are fairly stable

Given the SSE identities, we form for each model _{i}_{SSE}_{,}_{i}_{i}_{i}^{β}^{β}

Our goal then is to find a minimum subset

Probabilistic model

First we develop a probabilistic model in order to evaluate the information content in a possible experiment plan. Let us treat each edge as being a binary random variable c representing whether or not the SSE pair is in contact, with Pr(_{i}

where the summed variables range over {0, 1} and the indicator function 1 tests for membership of c in set _{i}

The approach readily extends to be less conservative and to allow different weights for different SSE pairs, e.g., according to cross-link planning (discussed in the next section).

We can likewise compute a joint probability Pr(

where again the sums are over {0, 1} and the indicator function is as described above.

Then we can evaluate the

Experiment planning

The mRMR approach seeks to minimize the total mutual information (redundancy) and maximize the total entropy (relevance). In this paper, we define the objective function as the difference of the two terms:

To optimize this objective function, we employ a first-order incremental search _{*} that maximizes:

The search algorithm stops when the score for _{*} drops below a threshold (we use 0.01 for the results shown below).

The original mRMR formulation with first-order incremental search was proved to be equivalent to max-dependency (i.e., to provide the most information about the target classification)

Data interpretation

In the next section, we will describe the planning of disulfide cross-linking experiments to evaluate a given fingerprint. For now, let us assume that the form of experimental data

where we use the subscript to get the ^{th} element of the set. The naive conditional independence assumption here is reasonable, since the elements of _{i}

Plan evaluation

In the experiment planning phase, we don’t yet have the experimental data. However, we can evaluate the potential for making a wrong decision using a given plan by computing the

where Pr(

In the case of fold determination, there may not be a single best model—a number of models may in fact have the same fold, and thus be equally consistent with the experimental data. Thus in the data interpretation phase we would not want to declare a single winner, but instead would return a set of the tied-for-optimal models. In the experiment planning phase, we develop a complementary metric to the Bayes error, which we call the

The formula mirrors that for

Finally, the topological fingerprint approach allows us to handle the “none-of-the-above” scenario, when we decide that no model is sufficiently good; i.e., the correct fold isn’t represented by a predicted model. While in other contexts that would be done by comparing the likelihood to some threshold (is the selected model “good enough”?), here we can actually explicitly consider the chance of not considering the correct fold. Note that since a fingerprint typically has a small number of SSE pairs, we can enumerate the space _{i}

Moving from data interpretation to experiment planning, we can again evaluate a plan for the probability of deciding none of the above. If we think of Bayes error as the false positive rate, then we want something more like a false negative rate. We call this metric

Thus

Cross-link selection

Once a topological fingerprint

Different models may place an SSE at somewhat different residues, so when planning cross-links to probe that SSE’s contacts, it is advantageous to focus on residues common to many models (and thus able to provide information about cross-linkability in those models). We define for each SSE a set of common residues that may be used in a disulfide plan. Our current implementation includes all residues that appear in at least half of the models that have that SSE. In the following, let

For each model _{i}_{xlink}_{,}_{i}_{i}_{i}^{β}^{β}

Probabilistic model

We must define a probabilistic model in order to evaluate the information content provided by a set of cross-links. We treat possible cross-link (pair of residues) as a binary random variable indicating whether or not there is a cross-link. We start with the model of our earlier work, in which the prior probability of a cross-link wrt a model is 0.95 for distances ≤ 9Å, 0.5 for distances between 9 and 19 Å, and 0.05 for those > 19 Å

Noise factors in cross-link planning

**Noise factors in cross-link planning****.** Noise factors include misalignment (left) and flexibility (right). Blue dots represent residues and yellow lines their contacts. Regions in dashed lines are the modeled SSE and those in solid lines those measured by cross-linking experiments.

We place a distribution Pr(

These two factors result in dependence among possible cross-links: if an SSE is misaligned or has moved relative to the original model, all its cross-links will be affected. However, the cross-links are conditionally independent given the particular value of misalignment or backbone choice. Thus we have for any two cross-links

and similarly for backbone flexibility. Furthermore, misalignment and flexibility are independent.

Experiment planning

Our goal is to select a “good” set of residue pairs

and we incrementally select cross-links to maximize the difference in relevance regarding contact and average redundancy with already-selected cross-links.

Data interpretation

Once we have experimentally assessed cross-link formation for each selected residue pair, we can evaluate the probability of the SSE pair being in contact. Let

Here _{0} is the minimum number of positive cross-links for us to start believing c is in contact. For example, for _{0} = 3 and the likelihoods of

Plan evaluation

Finally, in order to assess an experiment plan’s robustness, we develop a Bayes error criterion to evaluate the probability of making a wrong decision regarding SSE contact:

As in the previous section, we sum over the possible outcomes (here, in contact or not) and the possible experimental results (

Results and discussion

We demonstrate the effectiveness of our approach with a representative set of 9 different CASP targets (Tab.

Test data sets (from CASP7)

CASP ID

PDB ID

2°

AAs

Models

Av. RMSD

T0283_D1

2hh6

5

97

162

17.26

T0289_D2

2gu2

5

74

34

13.45

T0299_D1

2hiy

3

91

30

15.23

T0304_D1

2h28

2

101

26

15.76

T0306

2hd3

7

95

45

14.22

T0312_D1

2h6l

2

132

55

16.13

T0351

2hq7

5

117

65

15.42

T0382_D1

2i9c

6

119

196

12.79

T0383

2hnq

2

127

59

11.61

Topological fingerprint selection

Fig.

Evaluations of fingerprints for case study targets

**Evaluations of fingerprints for case study targets.** The plots show Bayes error (

On the other hand, we observe that the

The fingerprint evaluation incorporates a parameter in the

Sensitivity analysis

**Sensitivity analysis****.** Differences in plans for target T0383, as evaluated by

End-to-end simulation study

Once we have selected a topological fingerprint, we next design a disulfide cross-linking plan to determine the contact state of the selected SSE pairs. To validate the overall process (fingerprint + disulfides), we perform a simulation study. Given a selected set of residue pairs for cross-linking, we use the crystal structure (PDB entry in Tab. 1) to determine whether or not they should form disulfides (C^{β}^{β}

To compare the decision based on simulated cross-linking data with that based on fold analysis, we performed a Receiver Operator Characteristic (ROC) analysis. The area under the ROC curve (AUC) measures the probability that our experiment plan will rank a randomly chosen positive instance higher than a randomly chosen negative one. The larger the AUC, the better classification power our algorithm has to detect the right fold. Fig.

Simulation studies

**Simulation studies****.** ROC curves for eight simulation studies, at different SSE contact fraction thresholds

Robustness

One of the merits of the fold determination approach is that it is robust to errors in models, and can even account for the case when none of the models is correct. The selected targets provide examples requiring such robustness; we summarize here just a couple. ^{–3}), compared to that (≈ 0.66) of the uncovered but correct fold, which is found by enumeration.

Conclusions

This paper presents a computational-experimental mechanism to rapidly determine the overall organization of secondary structure elements of a target protein by probing it with a planned set of disulfide cross-links. By casting the experiment planning process as two stages of feature selection—SSE pairs characterizing overall fold and residue pairs characterizing SSE pair contact states—we are able to develop efficient information-theoretic planning algorithms and rigorous Bayes error plan assessment frameworks. Focusing on fold-level analysis results in a novel approach to elucidating three-dimensional protein structure, robust to common forms of noise and uncertainty. At the same time, the approach remains experimentally viable by finding a greatly reduced set of residue pairs (tens to around a hundred, out of hundreds to thousands) that provide sufficient information to determine fold.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FX, AMF, and CBK developed the approach; FX and CBK designed the algorithms, FX implemented the algorithms and collected the results; FX, AMF, and CBK analyzed the results and wrote the paper. All authors read and approved the final manuscript.

Acknowledgements

This work was inspired by conversations with and related work done by Michal Gajda and Janusz Bujnicki, International Institute of Molecular and Cell Biology, Poland. It was supported in part by US NSF grant CCF-0915388 to CBK.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 12, 2011: Selected articles from the 9th International Workshop on Data Mining in Bioinformatics (BIOKDD). The full contents of the supplement are available online at