School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
Abstract
Background
Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents SubLoop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences.
Results
Our simulation experiments demonstrate that the method computes highscoring conformations of large loops (
Conclusion
Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ionbinding demonstrate its potential as a tool for loop ensemble generation and missing structure completion.
Background
Sampling conformations of kinematic chains  rigid objects connected by articulated joints  is a fundamental problem in protein structure prediction, the geometry of folding linkages, and robot motion planning. Sampling poses a challenging computational problem when chains are large and must satisfy a variety of constraints and statistical preferences. Conformations may be required to satisfy hard feasibility constraints, such as loop closure and collision avoidance, while also obeying soft preference constraints, such as low energy and high structural likelihood. Particularly around folded protein structures, the subset of feasible and favorable conformations comprises a miniscule fraction of the conformation space, and due to the "curse of dimensionality" this fraction shrinks dramatically with the dimensionality of the state space. Because interesting biological macromolecules have large numbers of degrees of freedom, ranging up to hundreds or thousands, new techniques are needed to sample severely constrained conformations efficiently.
Protein loops are flexible structures that often deform during binding, and are extremely important for understanding protein functioning
For each of these methods, the sampling
Our new method overcomes many of the weaknesses of prior methods (see Table
Characteristics of loop generation techniques
Technique
Loop closure
Prior distribution/energy function
Global search
Scalability
Optimization
Exact
Y
N
+
Inverse kinematics sampling
Exact
N
Y
++
Discrete search
Inexact
Y
Finite subset

Standard Monte Carlo
No
Y
Y, reqs. mixing
+
SLIKMC
Exact
Y
Y, reqs. mixing
++
Methods
SLIKMC is a Markov chain Monte Carlo (MCMC) method that takes as input an experimental conformation scoring function Φ, a protein structure from the Protein Data Bank (PDB), the beginning and ending residues of the loop, and outputs a sequence of perturbed loop conformations such that the sequence asymptotically approaches a probability distribution proportional to Φ. If the structure is missing, a rough initial structure is sampled using existing inverse kinematics loop closure techniques. To generate a subsequent conformation, it performs the following operations:
For each 4residue subloop, repeat the following steps:
1. Sample a new subloop conformation that satisfies kinematic constraints.
2. Compute the MetropolisHastings importance ratio
3. Accept or reject the new subloop conformation with probability
The method terminates when a fixed number of conformations are generated or until a desired time cutoff is reached. The novel contributions of this paper include an exact derivation of the importance ratio
As a MCMC method, SLIKMC samples from a complex joint probability distribution by constructing a Markov chain whose equilibrium distribution is equal to the desired distribution. It is a hybrid MCMC algorithm that combines blocked Gibbs sampling and MetropolisHastings (MH) sampling. MH permits the use of nonnormalized probability distributions, which is important because it is relatively simple to define a useful scoring function but virtually impossible to ensure that it integrates to one. The blocked Gibbs sampling method samples a small subloop at each step, which helps SLIKMC scale better to large chains, because acceptance rates decrease roughly exponentially in the number of variables sampled at once. This section will first review classical MCMC methods and then describe the new approach.
Markov chain Monte Carlo framework
Let x = (
where the
where
The MetropolisHastings (MH) algorithm addresses the problem that it is hard to sample directly from an unnormalized distribution Φ in part due to the difficulty of evaluating the normalization term
This is the socalled
is called the
The key question for MH is how to choose a proposal distribution that we can sample from and evaluate. The acceptance strategy must evaluate the terms in (3) exactly so that the MH algorithm respects the detailed balance. One of our key contributions is a technique for evaluating
Note that it is challenging to choose
and keeping the remaining variables fixed. The variable is updated and the index
Our method combines Gibbs sampling with MH sampling to generate a new sample from (5). To do so, simply consider all other variables fixed, sample
Sparse factored models
Due to the locality of interactions in most scoring functions of interest, it is possible to represent Φ in a
where each
Probabilistic graphical models like Bayesian networks and Markov random fields are inherently factored: the domain of each factor consists only of a vertex and its neighbors in the graph. A graphical model is
• Ramachandran plots
• Steric clashes
• Bfactors defined as Gaussians
• Sidechain rotamer distributions, as described the Side Chain Sampling section.
Each factor can be evaluated quickly, but over thousands or millions of evaluations they accumulate significant computational cost. Significant savings can be achieved in sparse models, because when a few variables are changed, the change in Φ can be calculated quickly by only evaluating those factors involved, rather than recomputing Φ from scratch. Although steric clashes are theoretically considered as
In future work we are interested in including additional statistical potentials and/or allatom energy function terms in scoring. With a naive implementation, each atom is involved in
Kinematic chain modeling
Consider a jointed kinematic chain with reference frames
Although it is standard practice and beneficial for certain algorithms to define the system state with a minimal set of coordinates, e.g., x = (
Our method represents an expanded state that incorporates all spatial variables along with the conformation variables: x = (
Probabilistic graphical models of kinematic chains
Probabilistic graphical models of kinematic chains. Left: sparse graphical model relating
where each
where
With (7) encoded so that factors contain few variables in their domain, the model becomes sparse. However, we have added the complication of maintaining a valid kinematic structure, because the set of x for which Φ is nonzero lies on a lowerdimensional manifold. Technically speaking, the probability density must be considered with respect to a base measure that assigns finite, nonzero density to the manifold. For 3D chains, the state space has dimensionality 7
Block sampling and selection
A block is a subset of variables that are simultaneously sampled. The number of variables in a block must be sufficiently large to give at least one continuous degree of freedom of movement. The MetropolisHastings criterion is used to accept or reject a move because it is unrealistic to sample directly from the block's conditional density. This key subroutine, SampleBlockMH, takes as input the previous sample x^{(k) }and a block
SampleBlockMH(x^{(k)},
1. Using SampleBlock as described below, sample a candidate conformation
2. Compute the MH acceptance probability
3. Accept the move
Here the subscript
How many variables should be included in a block? Standard Gibbs sampling (i.e.,
Setting
Parameterization of subloops via independent subchains
Parameterization of subloops via independent subchains. Left: a 5angle block for a planar chain with fixed end frames
SampleBlock
1. Sample values for the independent subchain at random.
2. Attempt to close the chain by calculating an analytical IK solution for the dependent subchain. We use the method of
3. If more than one IK solution exists, one is picked at random, and if no solution exists, the process terminates with failure.
It is recommended that
Block selection
Block selection. A 7residue chain is shown with each residue drawn in a distinct color. SLIKMC incrementally samples block of 4 consecutive residues (8 torsional angles) with the first 3 residues overlapping with the preceding block.
Calculation of subloop sampling densities
To calculate the MH importance ratio, we must calculate
Fix the endpoints of the block, and let
Sampling distributions on manifold charts
Sampling distributions on manifold charts. Top: abstract illustration of how analytical IK implicitly decomposes a 1parameter manifold
where
where
A remaining issue is that it is often difficult to explicitly compute the Jacobian of the IK function involved in
We have the constraint equation:
Taking the derivative of both sides of (11) with respect to y we get:
and hence
holds as long as
Finally, since
we obtain the Jacobian
in which
Beyond computing the proper sampling density, it is also important to design the algorithm to efficiently compute the MH acceptance probability. Since clash detection takes 60 times more computation time than calculating the rest of the terms in Φ, we check collisions
Extension to other topologies
Although the core method applies to linear closed kinematic chains, it can be extended to handle other molecular topologies, such as freeendpoint chains and sidechains. In theory, polycyclic compounds may also be handled as well. Each new topological structure requires specialized block selection and sampling routines. For example, freeendpoint chains need separate sampling subroutines for the start and end blocks. Standard MC methods are used to do so.
Sidechain deformations are important for shaping binding cavities, and SLIKMC can be adapted to generate sidechain conformations in the same graphical modeling framework. It is known that the sidechain conformation depends on the backbone dihedral angle of the corresponding residue
Sidechain sampling
For sidechain conformation priors we use the 2010 Backbonedependent Rotamer Library
Treating the remainder of the protein as fixed, we model the target distribution of a sidechain x
where
Extending block sampling to include sidechains requires justifying the importance ratio carefully to ensure unbiased sampling. An efficient sampling procedure is as follows: first compute a closedloop backbone subchain from the blocked Gibbs sampling step and compute its acceptance probability as usual. If accepted, sample each side chain along the block according to its backbonedependent rotameric distribution. Because it is a Gaussian mixture, we can sample from
To justify this procedure, we show that its acceptance probability is equal to the MH acceptance probability for the entire block including sidechains. Let the block be x
by conditioning on x
Since the first term is simply the importance ratio of the backbone and
Multiplyclosed kinematic loops
It may be possible to extend SLIKMC to handle multiplyclosed loops such as those that occur in polycyclic compounds. This requires special care to divide the structure into blocks that can be split into dependent and independent subchains, such that a conformation of the independent subset completely determines the dependent subset, up to some finite multiplicity. In other words, the independent subchains form a chart of the space of closedchain conformations of the whole block. The union of all blocks must also cover all state variables.
We illustrate the principle on planar kinematic chains, which require blocks of size at least 4. Assume each cycle contains at least 3 joints. We define a topological ordering by selecting a linear main chain and considering branches off of the main chain. Nonbranching linear blocks, freeendpoint blocks, and sidechains (openended branches) are handled as described above. Each 3joint branch off of a branching block is then considered as part of a dependent subchain (see Figure
Extending to nonlinear topologies via branching blocks
Extending to nonlinear topologies via branching blocks. Several branching structures may be treated as blocks. Independent chains (shaded) must be chosen to parameterize the manifold of configurations satisfying closed chain constraints (open circles).
To sample a branching block, we first sample values for the independent subchain at random and then close the loops for each branch according to their topological order. To ensure unbiased sampling, we must also calculate the metric tensor in (9) for the entire branched block. This in turn requires computing the Jacobian of the chart, which requires computing the Jacobian of the implicit form for the multiple loopclosure constraints (11). Due to the tree structure the Jacobian is sparse, and the matrix inversion in the implicit chart Jacobian (13) can also be computed efficiently. We have implemented this approach on 2D chains with closed rings (see Figure
Results on a planar multiloop structure
Results on a planar multiloop structure. Fluctuations of a 2D chain with a closed ring constrained on the three ends (open circles). Left: initial conformation. The angular prior for each link is modeled as a normal distribution with 20° standard deviation. Right: 20 samples with skip length 100.
Mixing and autocorrelation
In any MCMC method it is important to empirically examine the mixing rate of the Markov Chain. Firstly, it can potentially take many iterations to "forget" the effects of a poor initialization. For protein sampling, this is not a significant problem because we initialize the chain with the native structure in PDB, which is typically quite good.
Secondly, subsequent samples are highly autocorrelated, and many conformations must be skipped to obtain a sequence with low autocorrelation. This is a serious concern because autocorrelation grows stronger as more variables are included in the conformation (see Figure
Mixing of SLIKMC samples
Mixing of SLIKMC samples. Sampling conformations of a planar 20link chain, anchored at the endpoints, with a uniform prior. Left: starting from a deliberately bad initial conformation. Middle: the sequence mixes relatively quickly, but the first 40 samples are biased by the initial conformation and autocorrelate strongly. Right: a sequence that takes every 40'th sample does not significantly autocorrelate.
Result and discussion
The SLIKMC algorithm implements a scalable framework for Monte Carlo sampling of kinematic chains. The technique uses a blocked Gibbs sampler that proposes movements of small subchains of conformation angles at once, along with a MetropolisHastings technique that guarantees an unbiased sampling of the loopclosure submanifold for that block. Due to the small block size, each energy function is local and adjustments are fast, ranging from microseconds to milliseconds. The method is mathematically proven to generate a statistically unbiased sample in the large sample limit. It is particularly wellsuited for closed loops (see Figure
Closedchain sampling
Closedchain sampling. Three sampling methods for a 20link closedloop chain. At left, the prior gives preference to joint angles with small magnitude. At right, the prior gives preference to joint positions in a triangle shaped distribution (circle centers: means, shaded circles: 3
Freeendpoint chain sampling with heterogeneous priors
Freeendpoint chain sampling with heterogeneous priors. Comparing SLIKMC against a standard MetropolisHastings (MH) sampler on a freeendpoint chain with heterogeneous prior distribution over joint positions (crosses: means, shaded circles: 3
SLIKMC is implemented as an addon to the software package LoopTK
Loop sampling with prior distributions
We consider the 10residue closed loop 1AMP181190, which is a representative segment for testing loop reconstruction algorithms
Ramachandran plot of SLIKMC samples
Ramachandran plot of SLIKMC samples. Left: the Ramachandran plot of generic residues from a database that includes 500 highresolution proteins
We compared our method with the discretesearch loop construction software RAMP
Running time comparison between RAMP and SLIKMC
Running time comparison between RAMP and SLIKMC. Time required for the discrete search method RAMP and SLIKMC to obtain one sample for loops of varying size. The time required for RAMP increases exponentially while our method runs in approximately constant time.
We also compare SLIKMC with a samplethenselect inverse kinematics method that first samples a set of clashfree, loopclosing conformations and then extracts the top scoring ones. The LoopTK configuration sampling method
Comparing SLIKMC against samplethenselect
Comparing SLIKMC against samplethenselect. Left: samples generated by SLIKMC with a skip length of 100. Right: samples generated by postselecting the top 20 scoring samples generated from the LoopTK IK sampler. Transparent balls depict the 3
RMSD distributions from SLIKMC against IK
RMSD distributions from SLIKMC against IK. Histogram of RMSD to the native structure for samples from SLIKMC and the LoopTK sampler on 1AMP 181190. With SLIKMC the use of prior information allows finegrained control over the sampling distribution.
Missing loop completion
We now consider an application to completion of missing loops. Given the starting position and ending position of a missing segment, we first generate an arbitrary loopclosing configuration, then run SLIKMC to perturb it to a highprobability conformation. As a test case, we select a helix structure (residue from 4051) from an APO protein 1B8C. We generate an arbitrary loopclosing configuration by running the LoopTK configuration sampling method
Helix recovery
Helix recovery. Left: Starting from a highly perturbed conformation, SLIKMC recovers a helix using only clash and Ramachandran plots information. Every 20 samples are drawn. The final displayed conformation has RMSD 0.2704 to the PDB structure. Right: by comparison, an IK technique attains a minimum RMSD of 4.0655 out of 13,000 samples (90 minutes running time).
Scalability tests on freeendpoint chains
To further study scalability, we apply SLIKMC to subchains of chain A in a calciumbinding protein 1B8C. Samples for a 30residue subchain are generated in 1 s (Figure
Samples of a 30residue chain
Samples of a 30residue chain. 12 samples of a 30residue subchain of protein 1B8C selected from the first 300 consecutive samples with skip length 25. Transparent balls depict the 3
Samples of a 108residue chain
Samples of a 108residue chain. 17 samples of 1B8C chain A (108 residues) selected from 170 consecutive samples with skip length 10. Each conformation is drawn in a distinct color.
We compare SLIKMC against a standard MetropolisHastings algorithm that samples backbone angles according to a Gaussian proposal distribution with 1° standard deviation. The target distribution for both methods includes steric clashes, Ramachandran plots, and Bfactors. Note that standard MH has probability zero of sampling a conformation that satisfies terminal endpoint constraints exactly, and is not applicable to closed loops. So, these tests ignore the loop closure constraint altogether.
Figure
Running time comparison between SLIKMC and standard MetropolisHastings
Running time comparison between SLIKMC and standard MetropolisHastings. Time required to obtain one quasiindependent sample on openended subchains of 1B8C with a variety of lengths. Standard MH did not generate even one sample for chain lengths above 30 after 30 minutes.
Simultaneous backbone and sidechain sampling
We demonstrate backbone and sidechain sampling using a 15residue helix structure 1AMP 120134. As priors we use backbonedependent rotamer distributions, Ramachandran plot priors, Bfactors for the backbone, and testing selfcollision and collision against the nonloop portion of the chain. Given 20 min cutoff time, 1,623 samples are generated. Figure
Sidechain distribution of residue ARG
Sidechain distribution of residue ARG. Left: Gaussian mixture distribution of sidechain torsion angles for the native structure of residue 130 (arginine) in protein 1AMP. Right: histograms of sidechain angles from samples generated by SLIKMC. The distributions of
Conclusion
We propose SLIKMC  a Markov chain Monte Carlo method for sampling closed chains according to specified probability distribution. A probabilistic graphical model (PGM) is proposed to specify the structure preferences. A novel method for sampling subloops is developed to generate statistically unbiased samples of probability densities restricted by loopclosure constraints and mathematical conditions necessary for unbiased sampling is derived. Simulation experiments show that SLIKMC completes large loops (
SLIKMC is demonstrated to be applicable to various tasks such as conformation ensemble generation, missing structure construction. For future work we intend to integrate SLIKMC with more complex energy functions, statistical potentials, and machinelearningbased structural function predictors. Another limitation of the technique is that due to the locality of each block adjustment, largemagnitude global motions may take a huge number of iterations to sample, particularly when the motion must cross lowscoring chasms in conformation space. We intend to investigate annealinglike or random restart techniques for overcoming these difficulties, as well as different block choices that allow the algorithm to take larger steps. Finally, we are interested in extending our method to study simultaneous backbone and sidechain flexibility in proteinligand and proteinprotein binding.
Appendix
This appendix presents a fundamental statement about probability densities under a transformation of variables.
From change of variables we have:
where
We now use the fact that the
Note that this can be expressed more compactly as det(
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YZ implemented the algorithm and conducted the numerical experiments. KH contributed to the study design and developed the mathematical foundations. All authors contributed to drafting the manuscript and approved the final manuscript.
Acknowledgements
The authors thank Predrag Radivojac for valuable discussions that inspired us to start this project and helped clarify our understanding of protein structure and function. This research is partially supported by NSF Grant No. 1218534.
Declarations
The publication costs for this article were funded by Dr. Kris Hauser.
This article has been published as part of