Abstract
Background
Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents SubLoop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences.
Results
Our simulation experiments demonstrate that the method computes highscoring conformations of large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques. Two new developments contribute to the scalability of the new method. First, structural preferences are specified via a probabilistic graphical model (PGM) that links conformation variables, spatial variables (e.g., atom positions), constraints and prior information in a unified framework. The method uses a sparse PGM that exploits locality of interactions between atoms and residues. Second, a novel method for sampling subloops is developed to generate statistically unbiased samples of probability densities restricted by loopclosure constraints.
Conclusion
Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ionbinding demonstrate its potential as a tool for loop ensemble generation and missing structure completion.
Keywords:
Conformation sampling; Monte Carlo methods; protein loops; ensemble generation; graphical modelsBackground
Sampling conformations of kinematic chains  rigid objects connected by articulated joints  is a fundamental problem in protein structure prediction, the geometry of folding linkages, and robot motion planning. Sampling poses a challenging computational problem when chains are large and must satisfy a variety of constraints and statistical preferences. Conformations may be required to satisfy hard feasibility constraints, such as loop closure and collision avoidance, while also obeying soft preference constraints, such as low energy and high structural likelihood. Particularly around folded protein structures, the subset of feasible and favorable conformations comprises a miniscule fraction of the conformation space, and due to the "curse of dimensionality" this fraction shrinks dramatically with the dimensionality of the state space. Because interesting biological macromolecules have large numbers of degrees of freedom, ranging up to hundreds or thousands, new techniques are needed to sample severely constrained conformations efficiently.
Protein loops are flexible structures that often deform during binding, and are extremely important for understanding protein functioning [1]. Loop sampling has been used in missing fragment reconstruction, generating fluctuations in equilibrium conformations, and generating decoy sets for function prediction. Such applications typically require methods for sampling energeticallylikely and diverse configuration ensembles rather than optimizing a single point estimate. The loop closure constraint, which requires the terminal atoms of a loop to lie at fixed positions dictated by the surrounding structured regions, poses a major challenge in sampling. Existing loop sampling methods include discrete search [2], optimization [1], and inverse kinematics (IK) methods [36]. Fiser et al's [1] approach takes an energy function that encodes spatial restraints and preferences on dihedral angles, and then runs a numerical optimization to minimize that energy. Optimization is relatively computationally expensive, which is usually worthwhile for single structure prediction but less so for generating conformation ensembles. Discrete search methods are able to explore a wider space of conformations by incrementally building a tree of clashfree subchain conformations starting from one end of the loop and progressing toward the terminal end [2,7,8]. Although effective for small loops, these methods do not scale well to large loops due to the problem of combinatorial explosion. As a result, discrete search is intractable with chains containing 78 or more residues. They also introduce discretization artifacts, and are not able to close the gap between a terminal atom and its desired position. Inverse kinematics (IK) techniques from the robotics field have been adopted to sample conformations with exact loop closure [35,9]. However, these methods prioritize the loop closure constraint and do not take energies into account during sampling. Hence, some authors employ a secondary energy optimization step to generate more plausible conformations [3,6,10].
For each of these methods, the sampling distribution is throughly entangled with the sampling procedure. To achieve a desired distribution, the sampling procedure must be tuned in an opaque, nontrivial manner, and it is unclear that a desired distribution can even be achieved. Instead, extensive posthoc empirical testing to assess the quality of the resulting distribution and to argue that a method samples well. Monte Carlo (MC) techniques represent a more principled class of approaches that sample directly from a desired destribution. They have a long history of use in computational biology because they can quickly explore multiple energy minima and transition pathways, while molecular dynamics and optimization techniques often get stuck in single local minima [1114]. They are also wellsuited for generating conformation ensembles. The general MetropolisHastings approach generates a sequence of incrementally perturbed configurations via a random walk, with a carefullydesigned acceptance criterion (the detailed balance condition) that ensures that the sampling distribution approaches the desired one as more samples are drawn. However, there is a tradeoff in choosing the perturbation size: small perturbations raise the fraction of accepted moves but lower the speed of conformation space exploration. Moreover, standard MC techniques cannot be directly applied to protein loops due to the loop closure constraint, which causes each step to be accepted with probability 0.
Our new method overcomes many of the weaknesses of prior methods (see Table 1); it simultaneously scales to long loops (e.g., >10 residues) and produces unbiased ensembles of conformations. It uses a unified probabilistic graphical modeling (PGMs) framework for modeling a desired distribution from experimentally available statistical priors. PGMs such as Bayesian networks and Markov random fields are powerful tools for inference in large domains with heterogeneous sources of information, and have been applied in a limited sense to protein structure prediction. They have been used to predict sidechain rotamer conformations [15] and conformations of macromolecular assemblies from electron density maps [16]. Their use is reasonably well understood in the discrete case, but continuous variables often prove challenging. Our work derives a new Gibbs sampling inference method for continuous PGMs with loopclosure constraints, which restrict the feasible domain to a nonlinear implicit manifold. In particular we derive the mathematical relationship between an inverse kinematic sampling distribution and the manifold's metric tensor, which is necessary to compute the detailed balance condition in the MetropolisHastings algorithm. The resulting sampling sequence is unbiased in the sense that its distribution approaches the target distribution in the largesample limit.
Table 1. Characteristics of loop generation techniques
Methods
SLIKMC is a Markov chain Monte Carlo (MCMC) method that takes as input an experimental conformation scoring function Φ, a protein structure from the Protein Data Bank (PDB), the beginning and ending residues of the loop, and outputs a sequence of perturbed loop conformations such that the sequence asymptotically approaches a probability distribution proportional to Φ. If the structure is missing, a rough initial structure is sampled using existing inverse kinematics loop closure techniques. To generate a subsequent conformation, it performs the following operations:
For each 4residue subloop, repeat the following steps:
1. Sample a new subloop conformation that satisfies kinematic constraints.
2. Compute the MetropolisHastings importance ratio α of the new conformation against the previous conformation.
3. Accept or reject the new subloop conformation with probability α.
The method terminates when a fixed number of conformations are generated or until a desired time cutoff is reached. The novel contributions of this paper include an exact derivation of the importance ratio α for the inverse kinematics sampler of step 1 and the use of sparse PGMs to evaluate the importance ratio quickly persubloop. We also describe extensions that handle flexibility in sidechains and molecules with multiple branches or loops (e.g., polycyclic compounds).
As a MCMC method, SLIKMC samples from a complex joint probability distribution by constructing a Markov chain whose equilibrium distribution is equal to the desired distribution. It is a hybrid MCMC algorithm that combines blocked Gibbs sampling and MetropolisHastings (MH) sampling. MH permits the use of nonnormalized probability distributions, which is important because it is relatively simple to define a useful scoring function but virtually impossible to ensure that it integrates to one. The blocked Gibbs sampling method samples a small subloop at each step, which helps SLIKMC scale better to large chains, because acceptance rates decrease roughly exponentially in the number of variables sampled at once. This section will first review classical MCMC methods and then describe the new approach.
Markov chain Monte Carlo framework
Let x = (x_{1}, ..., x_{n}) denote the state variables of the system of interest. Experimental conditions including hard constraints and soft preferences are encoded into a nonnegative scoring function Φ(x_{1}, ..., x_{n}). The score is required to have finite integral, is zero at states that violate hard constraints, and higher values indicate more desirable states. Φ is considered as an unnormalized probability density, and our goal is to generate samples with probability proportional to Φ. In other words, the goal is to sample from the normalized density P defined as:
where the proportionality constant Z that ensures that P integrates to 1. Φ is closely related to energy functions E(x) through the Gibbs measure
where T is the system temperature.
The MetropolisHastings (MH) algorithm addresses the problem that it is hard to sample directly from an unnormalized distribution Φ in part due to the difficulty of evaluating the normalization term Z [17]. On step k, MH first samples a candidate move from x^{(k) }to x′ from a proposal distribution Q(x′; x^{(k)}), and accepts the move x^{(k+1) }← x′ with probability
This is the socalled detailed balance condition. The Z terms in the numerator and denominator cancel out, so we use Φ directly instead of P. If the move is rejected, then the current state is maintained: x^{(k+1) }← x^{(k)}. The term
is called the importance ratio. If the ratio is greater than 1, then the new sample is accepted; otherwise it is accepted with probability equal to the ratio. With the detailed balance construction, P (x) is indeed the stationary distribution of the Markov chain generated by successive samples.
The key question for MH is how to choose a proposal distribution that we can sample from and evaluate. The acceptance strategy must evaluate the terms in (3) exactly so that the MH algorithm respects the detailed balance. One of our key contributions is a technique for evaluating Q exactly when sampling from closed chain submanifolds, which enables our method to generate an unbiased sampling sequence.
Note that it is challenging to choose Q to approximate P closely, and hence in practice the probability of accepting a sample drops sharply in the dimensionality of the space. Gibbs sampling is commonly used to address this issue. Moreover it is convenient when sampling from a conditional density is easier than sampling from the entire joint density, which is the case in the sparse Bayesian models that we employ in this work. Given the current sample at time k, Gibbs sampling generates the next state x^{(k+1) }by sampling a single variable from the conditional density
and keeping the remaining variables fixed. The variable is updated and the index i is incremented in looping fashion. If the dependencies between the variables are sparse (e.g., every variable x_{i }only depends on a handful of variables rather than the remaining n  1 variables), then Gibbs sampling can be efficient even for very large problems. This is exploited in sparse PGMs. Blocked Gibbs sampling is a variation of Gibbs sampling that groups multiple variables as a block and samples the block from the joint distribution conditioned on all other variables.
Our method combines Gibbs sampling with MH sampling to generate a new sample from (5). To do so, simply consider all other variables fixed, sample from a conditional proposal distribution , and then apply the importance ratio test as usual to determine whether to accept the step or keep .
Sparse factored models
Due to the locality of interactions in most scoring functions of interest, it is possible to represent Φ in a factored form:
where each ϕ_{i }is known as a factor and each S_{i }is a subset of {x_{1}, ..., x_{n}} known as the domain of the factor ϕ_{i}. For example, in protein structure prediction factors may include Ramachandran plots relating each pair of dihedral angles (φ, ψ), steric clashes, energy functions defined over atom positions, and prior knowledge from Bfactors or electron density maps.
Probabilistic graphical models like Bayesian networks and Markov random fields are inherently factored: the domain of each factor consists only of a vertex and its neighbors in the graph. A graphical model is sparse if each variable x_{i }is involved in only a handful of factors (i.e., bounded by a constant unrelated to n), and hence only interacts directly with a few other variables. An important consequence in the discrete case is that probabilistic inference is computationally tractable in sparse models (polynomial in n), whereas inference is intractable in dense models (in general, exponential in n). A key step in our method converts the representation of a kinematic chain from dense to sparse form, as described below. The implementation described in this paper currently supports:
• Ramachandran plots ϕ_{RP(r)}(φ, ψ) which vary by residue r.
• Steric clashes ϕ_{SC(j, k)}(p_{j}, p_{k}) which are 0 if atom j collides with atom k and 1 otherwise.
• Bfactors defined as Gaussians where µ_{j }is the predicted atom position and B_{j }is the Bfactor value in the protein's PDB file. A constant of proportionality c can be set by the user according to his/her confidence in the quality of the Bfactor estimates.
• Sidechain rotamer distributions, as described the Side Chain Sampling section.
Each factor can be evaluated quickly, but over thousands or millions of evaluations they accumulate significant computational cost. Significant savings can be achieved in sparse models, because when a few variables are changed, the change in Φ can be calculated quickly by only evaluating those factors involved, rather than recomputing Φ from scratch. Although steric clashes are theoretically considered as O(n^{2}) pairwise factors, in practice we use a gridbased hashing data structure that only checks nearby atoms for collision. As a result, each Gibbs sampling step can be performed in O(1) time.
In future work we are interested in including additional statistical potentials and/or allatom energy function terms in scoring. With a naive implementation, each atom is involved in O(n) pairwise interactions, but we expect to exploit the weakness of distant interactions to reduce the number of factors included in the computation.
Kinematic chain modeling
Consider a jointed kinematic chain with reference frames T_{0}, T_{1}, ..., T_{N}, connected with relative rotational angles q_{1}, ..., q_{N}. For a protein backbone, there is a onetoone correspondence between frames and backbone atom positions p_{1}, ..., p_{M}, and the rotational variables are simply the backbone dihedral angles φ_{1}, ψ_{1}, ..., φ_{N/2}, ψ_{N/2}.
Although it is standard practice and beneficial for certain algorithms to define the system state with a minimal set of coordinates, e.g., x = (T_{0}, q_{1}, ..., q_{N}), a key step of our method is to consider an expanded state. Minimal coordinates use the fact that each subsequent frame T_{1}, ..., T_{N }can be determined from x through straightforward forward kinematics, leading to a lower dimensional representation. However, this approach eliminates sparsity in the probabilistic model because a factor defined on T_{N }will depend on all variables, a factor defined over T_{N  1 }will depend on all variables except q_{n}, and so on. Moreover, if a sampler is asked to generate certain variables from a density defined over T_{1}, ..., T_{N }(for example, atom positions), the generated distribution may be biased unless it computes the determinant of an N × N metric tensor for each evaluation of Φ. As described below, this is a consequence of nonlinear transformations of distributions (see Appendix). On the other hand, computing determinants takes O(N^{3}) time, which scales poorly with large N.
Our method represents an expanded state that incorporates all spatial variables along with the conformation variables: x = (q_{1}, ..., q_{N}, T_{0}, ..., T_{N}). The joint probability density is then defined over angles and reference frames of all links along the chain (see Figure 1):
Figure 1. Probabilistic graphical models of kinematic chains. Left: sparse graphical model relating N joint angles and link transformations via local factors. Right: instantiation of the model for an nresidue protein backbone, with an additional layer accounting for atom positions.
where each S_{i }is now a subset of {q_{1}, ..., q_{n}, T_{0}, ..., T_{n}}, and where ϕ_{kinematic }is the forward kinematic transform that defines the frame T_{j }in terms of the prior frame T_{j1 }and the relative angle q_{j}. Because the transform is deterministic, ϕ_{kinematic }should be thought of as an indicator function. Taking the convention that each frame's origin lies on its joint's axis:
where is the relative transformation of frame j relative to frame j  1 and R(a, q) is the rotation of angle q about axis a. Fixedendpoint constraints can also be encoded with indicator factors ϕ_{closure}(T_{0}) and ϕ_{closure}(T_{n}) that are zero everywhere except at the fixed frames.
With (7) encoded so that factors contain few variables in their domain, the model becomes sparse. However, we have added the complication of maintaining a valid kinematic structure, because the set of x for which Φ is nonzero lies on a lowerdimensional manifold. Technically speaking, the probability density must be considered with respect to a base measure that assigns finite, nonzero density to the manifold. For 3D chains, the state space has dimensionality 7N but the manifold has dimensionality 6 + N for freeendpoint chains or N  6 for fixedendpoint chains. The next section will describe how we handle these submanifolds in detail.
Block sampling and selection
A block is a subset of variables that are simultaneously sampled. The number of variables in a block must be sufficiently large to give at least one continuous degree of freedom of movement. The MetropolisHastings criterion is used to accept or reject a move because it is unrealistic to sample directly from the block's conditional density. This key subroutine, SampleBlockMH, takes as input the previous sample x^{(k) }and a block B of b consecutive joint angles and their intervening frames. It then samples a candidate move, and accepts it according to the MH criterion. Pseudocode is as follows:
SampleBlockMH(x^{(k)}, B):
1. Using SampleBlock as described below, sample a candidate conformation of B at random, keeping the rest of the chain fixed.
2. Compute the MH acceptance probability
3. Accept the move with probability α.
Here the subscript B denotes the subset of variables in the block, while the subscript C denotes the complement of the block. The score Φ_{B }calculates the product of factors ϕ_{i }whose domains S_{i }overlap with B, which is more efficient than recomputing Φ from scratch. The remaining details of the method  the block size, the block sampling procedure, and calculating the sampling probability Q_{B } are described in detail in the remainder of this section. To generate a new conformation x^{(k+1) }of the entire chain, SampleBlockMH is called several times with overlapping blocks incremented sequentially down the chain. Block ordering (e.g. forward, backward, or random order) has no effect on the asymptotic distribution and experiments suggest virtually no noticeable effect apart from the first handful of samples. Thanks to sparsity, each pass is performed in O(N) time, which takes a fraction of a second for chains with hundreds of variables.
How many variables should be included in a block? Standard Gibbs sampling (i.e., b = 1) does not work because loop closure constraints constrain the conditional density of any variable given the rest (5) to a Dirac. Hence, the state would never change. In fact, no mixing occurs for b ≤ 5, except possibly at singular conformations, which occupy a set of measure zero in conformation space and are therefore unlikely to occur naturally. For 6 angles, analytical inverse kinematics (IK) techniques are available to compute solutions for a pair of fixed end frames [9]. In fact, any number from 0 to 16 solutions may exist for a given 6angle problem. Nevertheless, b = 6 is not suitable because it restricts the random walk to only a finite set of conformations.
Setting b = 7 angles allows sufficient freedom to sample from a 1dimensional manifold of solutions. In general, a block of b ≥ 6 angles admits a b  6 dimensional solution manifold. Denote the block B = {q_{i}, ..., q_{i+b1}, T_{i}, ..., T_{i+b2}}, and let us call the first b  6 angles of the block q_{i}, ..., q_{i+b7 }the independent subchain. Call the remaining 6 angles the dependent subchain. This is illustrated for a planar chain in Figure 2. A sampling procedure is as follows:
Figure 2. Parameterization of subloops via independent subchains. Left: a 5angle block for a planar chain with fixed end frames T_{i1 }and T_{i+4}. Right: a second IK solution for the dependent subchain.
SampleBlock
1. Sample values for the independent subchain at random.
2. Attempt to close the chain by calculating an analytical IK solution for the dependent subchain. We use the method of [9].
3. If more than one IK solution exists, one is picked at random, and if no solution exists, the process terminates with failure.
It is recommended that b ≥ 7 be chosen as low as possible, because as b grows, the probability of sampling an independent subchain that admits closure drops off dramatically as b grows, particularly for "stretched out" conformations. In our implementation, 4 consecutive residues are considered as a block that contains b = 8 angles since even numbers align better with the (φ, ψ) angles priors of each residue (see Figure 3). (Throughout this discussion we have assumed a 3D chain but the method works equally well in 2D. For planar chains, at least 4 angles are needed, and the manifold of solutions is (b  3)dimensional)
Figure 3. Block selection. A 7residue chain is shown with each residue drawn in a distinct color. SLIKMC incrementally samples block of 4 consecutive residues (8 torsional angles) with the first 3 residues overlapping with the preceding block.
Calculation of subloop sampling densities
To calculate the MH importance ratio, we must calculate sampling density for the sampling procedure SampleBlock. Several concepts from differential geometry are required in order to derive this density .
Fix the endpoints of the block, and let M denote the (b  6)dimensional manifold of loopclosing conformations. Let us call the (b  6) angles of the independent subchain y, which are sampled w.r.t. the density P (y). Observe that the candidate sample is distributed according to a nonlinear transformation of P(y) onto M. In fact, at nonsingular conformations the independent subchain forms a local chart of M, which is a local bijection between to M centered at (see Figure 4). Since there is a local bijection f between y and the point on the manifold x_{B}, the sampling density over x_{B }is given by:
Figure 4. Sampling distributions on manifold charts. Top: abstract illustration of how analytical IK implicitly decomposes a 1parameter manifold M into a set of local bijections (charts). Bottom: the Jacobian of a chart must be taken into account when calculating the sampling distribution Q over M.
where s is the number of IK solutions at y and G is the metric tensor of the chart x_{B }= f(y) (see Appendix). The inclusion of the metric tensor is a natural consequence of transformation of variables. For example, for the case b = 7, the metric tensor is the squared arc length of the 1dimensional parametrization of M (Figure 4, bottom). In general, G is given by
where is the Jacobian of the function f. Here we have also introduced a positive semidefinite weighting matrix W for the purpose of weighting the relative importance of matching the prior along certain axes. In the standard case, W is an identity matrix, but it can also be useful to choose a nonuniform diagonal matrix to account for heterogeneous units (e.g., angle vs. position variables).
A remaining issue is that it is often difficult to explicitly compute the Jacobian of the IK function involved in f. In other words, with z ≡ z(y) denoting the 6 angles in the dependent chain, it is difficult to evaluate ∂z/∂y. So, we compute an implicit chart Jacobian by considering the implicit form of the constraints C(x_{B}) = 0. These vectorvalued constraints state that the difference between the terminal frame of the subchain and the desired frame is zero.
We have the constraint equation:
Taking the derivative of both sides of (11) with respect to y we get:
and hence
holds as long as is invertible, which is true everywhere except at singular conformations. Each derivative of C in the above expression is a submatrix of the Jacobian and can be computed using standard techniques.
Finally, since
we obtain the Jacobian
in which I is the identity matrix and all frame derivatives are calculated using the chain rule . These partial derivatives are calculated using standard techniques.
Beyond computing the proper sampling density, it is also important to design the algorithm to efficiently compute the MH acceptance probability. Since clash detection takes 60 times more computation time than calculating the rest of the terms in Φ, we check collisions after determining whether a move will be accepted. Compared to the naive method, this method achieves an order of magnitude speedup.
Extension to other topologies
Although the core method applies to linear closed kinematic chains, it can be extended to handle other molecular topologies, such as freeendpoint chains and sidechains. In theory, polycyclic compounds may also be handled as well. Each new topological structure requires specialized block selection and sampling routines. For example, freeendpoint chains need separate sampling subroutines for the start and end blocks. Standard MC methods are used to do so.
Sidechain deformations are important for shaping binding cavities, and SLIKMC can be adapted to generate sidechain conformations in the same graphical modeling framework. It is known that the sidechain conformation depends on the backbone dihedral angle of the corresponding residue [18]. This requires sampling sidechains after the backbone conformation is sampled. Furthermore, since the distribution of sidechain torsional angles are limited to small number of typical conformations (rotamers) for most residues [19], we sample sidechains according to experimentallydetermined distributions.
Sidechain sampling
For sidechain conformation priors we use the 2010 Backbonedependent Rotamer Library [20]. In this library, each rotameric residue is associated with a list of rotamers which representing the high probability regions for sidechain torsion angles. The probability of a rotamer conformation χ is modeled as a continuous distribution given the backbone dihedral angle pairs. The dihedral angle (φ, ψ) space of each rotameric backbone residue r is discretized into a grid and each cell [a, b] × [c, d] contains its experimentally observed probabilities P (χ  r, a ≤ φ ≤ b, c ≤ ψ ≤ d). Each distribution over χ is specified as a Gaussian mixture model. For nonrotameric residues the terminal χ angles are handled specially due to the asymmetry in their distributions.
Treating the remainder of the protein as fixed, we model the target distribution of a sidechain x_{s}, conditional on backbone dihedral angles x_{b}, as follows:
where ϕ_{SC }indicates stericclashes and ϕ_{R(r) }indicates the sidechain conformation prior for residue r. Sidechain Bfactors are typically not included since we want to give enough freedom to explore the conformation space, and our experiments indicate that the flexibility of the protein chain will reduce greatly when we specify Bfactors as prior to both backbone and sidechain atoms.
Extending block sampling to include sidechains requires justifying the importance ratio carefully to ensure unbiased sampling. An efficient sampling procedure is as follows: first compute a closedloop backbone subchain from the blocked Gibbs sampling step and compute its acceptance probability as usual. If accepted, sample each side chain along the block according to its backbonedependent rotameric distribution. Because it is a Gaussian mixture, we can sample from ϕ_{R(r) }directly: pick a Gaussian from the mixture according to its weight and then sample from the Gaussian. Finally, reject the sample if the side chains collide.
To justify this procedure, we show that its acceptance probability is equal to the MH acceptance probability for the entire block including sidechains. Let the block be x_{B }= (x_{b}, x_{s}) and a candidate block sample , with b, s denoting backbone and sidechain variables respectively. The MH importance ratio is
by conditioning on x_{b}. Since we sample the sidechain according to Q(x_{s}x_{b}) = ϕ_{R(r)}(x_{s } x_{b}) and the prior sample is clashfree, we cancel terms in the numerator and denominator to get:
Since the first term is simply the importance ratio of the backbone and ϕ_{SC }is binary, we conclude that the block acceptance probability is either the backbone importance ratio if clashfree or zero if clashing. Hence the sidechain sampling procedure is sound.
Multiplyclosed kinematic loops
It may be possible to extend SLIKMC to handle multiplyclosed loops such as those that occur in polycyclic compounds. This requires special care to divide the structure into blocks that can be split into dependent and independent subchains, such that a conformation of the independent subset completely determines the dependent subset, up to some finite multiplicity. In other words, the independent subchains form a chart of the space of closedchain conformations of the whole block. The union of all blocks must also cover all state variables.
We illustrate the principle on planar kinematic chains, which require blocks of size at least 4. Assume each cycle contains at least 3 joints. We define a topological ordering by selecting a linear main chain and considering branches off of the main chain. Nonbranching linear blocks, freeendpoint blocks, and sidechains (openended branches) are handled as described above. Each 3joint branch off of a branching block is then considered as part of a dependent subchain (see Figure 5). Branches off the dependent subchain are also added to the block in a recursive manner, leading to a block with a tree topology.
Figure 5. Extending to nonlinear topologies via branching blocks. Several branching structures may be treated as blocks. Independent chains (shaded) must be chosen to parameterize the manifold of configurations satisfying closed chain constraints (open circles).
To sample a branching block, we first sample values for the independent subchain at random and then close the loops for each branch according to their topological order. To ensure unbiased sampling, we must also calculate the metric tensor in (9) for the entire branched block. This in turn requires computing the Jacobian of the chart, which requires computing the Jacobian of the implicit form for the multiple loopclosure constraints (11). Due to the tree structure the Jacobian is sparse, and the matrix inversion in the implicit chart Jacobian (13) can also be computed efficiently. We have implemented this approach on 2D chains with closed rings (see Figure 6), and extending it to 3D chains remains a problem for future work.
Figure 6. Results on a planar multiloop structure. Fluctuations of a 2D chain with a closed ring constrained on the three ends (open circles). Left: initial conformation. The angular prior for each link is modeled as a normal distribution with 20° standard deviation. Right: 20 samples with skip length 100.
Mixing and autocorrelation
In any MCMC method it is important to empirically examine the mixing rate of the Markov Chain. Firstly, it can potentially take many iterations to "forget" the effects of a poor initialization. For protein sampling, this is not a significant problem because we initialize the chain with the native structure in PDB, which is typically quite good.
Secondly, subsequent samples are highly autocorrelated, and many conformations must be skipped to obtain a sequence with low autocorrelation. This is a serious concern because autocorrelation grows stronger as more variables are included in the conformation (see Figure 7). In practice, one must determine the skip length empirically in order to obtain a quasiindependent sampling sequence, which is defined as a sequence with autocorrelation below some given threshold (0.2 is used in our experiments).
Figure 7. Mixing of SLIKMC samples. Sampling conformations of a planar 20link chain, anchored at the endpoints, with a uniform prior. Left: starting from a deliberately bad initial conformation. Middle: the sequence mixes relatively quickly, but the first 40 samples are biased by the initial conformation and autocorrelate strongly. Right: a sequence that takes every 40'th sample does not significantly autocorrelate.
Result and discussion
The SLIKMC algorithm implements a scalable framework for Monte Carlo sampling of kinematic chains. The technique uses a blocked Gibbs sampler that proposes movements of small subchains of conformation angles at once, along with a MetropolisHastings technique that guarantees an unbiased sampling of the loopclosure submanifold for that block. Due to the small block size, each energy function is local and adjustments are fast, ranging from microseconds to milliseconds. The method is mathematically proven to generate a statistically unbiased sample in the large sample limit. It is particularly wellsuited for closed loops (see Figure 8) but can also be advantageous for chains with free endpoints as well (see Figure 9).
Figure 8. Closedchain sampling. Three sampling methods for a 20link closedloop chain. At left, the prior gives preference to joint angles with small magnitude. At right, the prior gives preference to joint positions in a triangle shaped distribution (circle centers: means, shaded circles: 3σ spreads). Top: sampling joint angles followed by numerical loop closure, best 20/20,000 samples. Middle: sampling with RLG [5], best 20/20,000 samples. Bottom: SLIKMC, displayed every 40'th sample. These sample sets are generated by our method approximately as fast as RLG and an order of magnitude faster than numerical loop closure.
Figure 9. Freeendpoint chain sampling with heterogeneous priors. Comparing SLIKMC against a standard MetropolisHastings (MH) sampler on a freeendpoint chain with heterogeneous prior distribution over joint positions (crosses: means, shaded circles: 3σ spreads). For each method, 400 iterations are run and every 20'th sample is retained, taking approximately 10 s time on a standard PC. Left: MH takes steps that are too large and only generates 4 unique samples. Middle: MH with step size reduced by 10 has a higher success rate but slower convergence. Note the lack of variance in the leftmost point. Right: our method.
SLIKMC is implemented as an addon to the software package LoopTK [21][22] for protein loop sampling and is available at http://www.iu.edu/~motion/slikmc/ webcite. All experiments are run on a Intel i7 2.7 GHz computer with 4 GB RAM. The library currently supports sampling with prior information from Ramachandran plots, steric clashes, and Bfactors, and supports integration with the BackboneDependent Rotamer Library for sidechain sampling. Numerical experiments suggest that SLIKMC generates higher quality samples for large loops with lower computational cost than standard Monte Carlo techniques for openended chains and the RAMP loop completion package [23].
Loop sampling with prior distributions
We consider the 10residue closed loop 1AMP181190, which is a representative segment for testing loop reconstruction algorithms [24]. SLIKMC is applied to sample 2000 conformations from a joint probability that includes steric clashes, Ramachandran plots, and Bfactors. The Ramachandran plot (see Figure 10) shows that the distribution of dihedral angles is contained within high probability regions but explores relatively widely, with an average angular deviation of 37°.
Figure 10. Ramachandran plot of SLIKMC samples. Left: the Ramachandran plot of generic residues from a database that includes 500 highresolution proteins [26] used as a prior. Right: the Ramachandran plot for the generic residues in our 10residue test protein (1AMP 181190) generated from 2,000 consecutive samples. Each color represents one residue. (This figure is best viewed in color.)
We compared our method with the discretesearch loop construction software RAMP [8,23,25]. The latest version 0.7b was used in these experiments. We test the methods on loops of 1AMP with different lengths starting from residue 181. RAMP is tested by calling loop closure function with 0.5 Å distance tolerance. SLIKMC samples from perturbed segments using Ramachandran plots as prior with clashfree constraint. Figure 11 plots the average time for both methods to obtain one closed conformation. Due to combinational explosion, the time required for RAMP increases exponentially and is impractical for >6 residues.
Figure 11. Running time comparison between RAMP and SLIKMC. Time required for the discrete search method RAMP and SLIKMC to obtain one sample for loops of varying size. The time required for RAMP increases exponentially while our method runs in approximately constant time.
We also compare SLIKMC with a samplethenselect inverse kinematics method that first samples a set of clashfree, loopclosing conformations and then extracts the top scoring ones. The LoopTK configuration sampling method [22] was used here. Given 300 s cutoff time, LoopTK generates 888 conformations, while our method generates 705. Figure 12 shows how the top 20 samples of LoopTK compare with every 100th sample of our method. The upper figures are generated using the original Bfactors in the PDB file. SLIKMC matches the prior information more closely and obviates the need for postprocessing using numerical optimization. The bottom figures correspond to enlarged Bfactors, which indicate less confidence in the values prescribed by the PDB file. The distribution of SLIKMC samples has approximately three times the variance of the original, which closely matches theoretical predictions. Figure 13 shows the distribution of RMSDs of C_{α }atoms compared to the native structure for both LoopTK and SLIKMC with varying scales in Bfactors. This suggests that SLIKMC better supports variations in the experimenter's relative confidence in heterogeneous sources of information.
Figure 12. Comparing SLIKMC against samplethenselect. Left: samples generated by SLIKMC with a skip length of 100. Right: samples generated by postselecting the top 20 scoring samples generated from the LoopTK IK sampler. Transparent balls depict the 3σ spread of the atom position prior derived from its Bfactor. The top row uses the original Bfactors, while the bottom row enlarges Bfactors by 10.
Figure 13. RMSD distributions from SLIKMC against IK. Histogram of RMSD to the native structure for samples from SLIKMC and the LoopTK sampler on 1AMP 181190. With SLIKMC the use of prior information allows finegrained control over the sampling distribution.
Missing loop completion
We now consider an application to completion of missing loops. Given the starting position and ending position of a missing segment, we first generate an arbitrary loopclosing configuration, then run SLIKMC to perturb it to a highprobability conformation. As a test case, we select a helix structure (residue from 4051) from an APO protein 1B8C. We generate an arbitrary loopclosing configuration by running the LoopTK configuration sampling method [22] to perturb the original conformation. Starting from the highly disordered conformation, SLIKMC is run for 2 minutes using priors including enlarged Bfactors (scaled by a factor of 10) from original chain segment and Ramachandran plots with steric clashfree constraints. SLIKMC generates approximately 300 samples within the time limit, and the closest sample to the original structure is with RMSD 0.2704 calculated from backbone atoms (see Figure 14). In contrast, the LoopTK configuration sampling method did poorly in constructing a favorable missing loop.
Figure 14. Helix recovery. Left: Starting from a highly perturbed conformation, SLIKMC recovers a helix using only clash and Ramachandran plots information. Every 20 samples are drawn. The final displayed conformation has RMSD 0.2704 to the PDB structure. Right: by comparison, an IK technique attains a minimum RMSD of 4.0655 out of 13,000 samples (90 minutes running time).
Scalability tests on freeendpoint chains
To further study scalability, we apply SLIKMC to subchains of chain A in a calciumbinding protein 1B8C. Samples for a 30residue subchain are generated in 1 s (Figure 15) and samples for the entire 108residue 1B8C protein are generated in approximately 4 s (Figure 16).
Figure 15. Samples of a 30residue chain. 12 samples of a 30residue subchain of protein 1B8C selected from the first 300 consecutive samples with skip length 25. Transparent balls depict the 3σ spread of the atom position prior derived from its Bfactor. Atoms with low Bfactors near the end of the chain increase the difficulty for a standard MC method to explore the conformation space.
Figure 16. Samples of a 108residue chain. 17 samples of 1B8C chain A (108 residues) selected from 170 consecutive samples with skip length 10. Each conformation is drawn in a distinct color.
We compare SLIKMC against a standard MetropolisHastings algorithm that samples backbone angles according to a Gaussian proposal distribution with 1° standard deviation. The target distribution for both methods includes steric clashes, Ramachandran plots, and Bfactors. Note that standard MH has probability zero of sampling a conformation that satisfies terminal endpoint constraints exactly, and is not applicable to closed loops. So, these tests ignore the loop closure constraint altogether.
Figure 17 displays the average time needed to obtain one quasiindependent sample over ten 30minute runs for different chain lengths. The skip lengths are determined empirically for each run. This data suggests that our method achieves a cost per quasiindependent sample that is nearly linear to the length of the chain. In contrast, the likelihood that standard MH accepts a sample drops dramatically as the number of residues increases, leading to exponentially growing cost per sample.
Figure 17. Running time comparison between SLIKMC and standard MetropolisHastings. Time required to obtain one quasiindependent sample on openended subchains of 1B8C with a variety of lengths. Standard MH did not generate even one sample for chain lengths above 30 after 30 minutes.
Simultaneous backbone and sidechain sampling
We demonstrate backbone and sidechain sampling using a 15residue helix structure 1AMP 120134. As priors we use backbonedependent rotamer distributions, Ramachandran plot priors, Bfactors for the backbone, and testing selfcollision and collision against the nonloop portion of the chain. Given 20 min cutoff time, 1,623 samples are generated. Figure 18 illustrates that in residue 130 (arginine), the distributions of torsional angles χ_{3 }and χ_{4 }are limited due to steric clashes, while χ_{1 }and χ_{2 }match well with the priors.
Figure 18. Sidechain distribution of residue ARG. Left: Gaussian mixture distribution of sidechain torsion angles for the native structure of residue 130 (arginine) in protein 1AMP. Right: histograms of sidechain angles from samples generated by SLIKMC. The distributions of χ_{1}, χ_{2 }match well with the sampling distributions while the distributions of χ_{3 }and χ_{4 }are limited due to steric clashes.
Conclusion
We propose SLIKMC  a Markov chain Monte Carlo method for sampling closed chains according to specified probability distribution. A probabilistic graphical model (PGM) is proposed to specify the structure preferences. A novel method for sampling subloops is developed to generate statistically unbiased samples of probability densities restricted by loopclosure constraints and mathematical conditions necessary for unbiased sampling is derived. Simulation experiments show that SLIKMC completes large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques.
SLIKMC is demonstrated to be applicable to various tasks such as conformation ensemble generation, missing structure construction. For future work we intend to integrate SLIKMC with more complex energy functions, statistical potentials, and machinelearningbased structural function predictors. Another limitation of the technique is that due to the locality of each block adjustment, largemagnitude global motions may take a huge number of iterations to sample, particularly when the motion must cross lowscoring chasms in conformation space. We intend to investigate annealinglike or random restart techniques for overcoming these difficulties, as well as different block choices that allow the algorithm to take larger steps. Finally, we are interested in extending our method to study simultaneous backbone and sidechain flexibility in proteinligand and proteinprotein binding.
Appendix
This appendix presents a fundamental statement about probability densities under a transformation of variables.
Suppose and are multivariate random variables related by v = f (u), where f is differentiable and injective. Denote the image of as . If g_{u }is a density with support over A, then the corresponding density over M, with respect to the mvolume measure, is
where G(u) is the metric tensor:
More precisely, g_{v }as defined above satisfies:
for any subset U ⊆ A, where dµ is the mvolume element of M.
From change of variables we have:
where X(u) is the mvolume of the parallelotope spanned by the axes of the coordinate chart f centered at u: .
We now use the fact that the mvolume V of the parallelotope spanned by m vectors is given by the determinant:
Note that this can be expressed more compactly as det(A^{T }A) where A is the matrix with v_{1}, ..., v_{m }as its columns. Hence, . Finally, substituting g_{u }in the r.h.s. of (22) gives the desired result.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
YZ implemented the algorithm and conducted the numerical experiments. KH contributed to the study design and developed the mathematical foundations. All authors contributed to drafting the manuscript and approved the final manuscript.
Acknowledgements
The authors thank Predrag Radivojac for valuable discussions that inspired us to start this project and helped clarify our understanding of protein structure and function. This research is partially supported by NSF Grant No. 1218534.
Declarations
The publication costs for this article were funded by Dr. Kris Hauser.
This article has been published as part of BMC Structural Biology Volume 13 Supplement 1, 2013: Selected articles from the Computational Structural Bioinformatics Workshop 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcstructbiol/supplements/13/S1.
References

Fiser A, Do RKG, Šali A: Modeling of loops in protein structures. [http://dx.doi.org/10.1110/ps.9.9.1753] webcite
Protein Science 2000, 9(9):17531773. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

DePristo MA, Wde Bakker PI, Lovell SC, Blundell TL: Ab Initio Construction of Polypeptide Fragments: Efficient Generation of Accurate, Representative Ensembles.
PROTEINS: Structure, Function, and Genetics 2003, 51:4155. Publisher Full Text

Mandell D, Coutsias E, Kortemme T: Subangstrom accuracy in protein loop reconstruction by roboticsinspired conformational sampling.
Nature Methods 2009, 6:551552. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Canutescu A, Dunbrack R Jr: Cyclic coordinate descent: A robotics algorithm for protein loop closure.
Protein Science 2003, 12:963972. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cortés J, Siméon T, RemaudSiméon M, Tran V: Geometric algorithms for the conformational analysis of long protein loops. [http://dx.doi.org/10.1002/jcc.20021] webcite
Journal of Computational Chemistry 2004, 25(7):956967. PubMed Abstract  Publisher Full Text

van den Bedem H, Lotan I, Latombe JC, Deacon A: Realspace proteinmodel completion: an inversekinematics approach.

Tosatto SC, Blindewald E, Hesser J, Männer R: A divide and conquer approach to fast loop modeling.
Protein Engineering 2002, 15(4):279286. PubMed Abstract  Publisher Full Text

Samudrala R, Moult J: An Allatom Distancedependent Conditional Probability Discriminatory Function for Protein Structure Prediction.

Coutsias E, Soek C, Jacobson M, Dill K: A kinematic view of loop closure.
J Computational Chemistry 2004, 25:510528. Publisher Full Text

Shehu A, Clementi C, Kavraki L: Modeling Protein Conformational Ensembles: From Missing Loops to Equilibrium Fluctuations.
Proteins: Structure, Function, and Bioinformatics 2006, 65:164179. Publisher Full Text

Rathore N, de Pablo JJ: Monte Carlo simulation of proteins through a random walk in energy space.

Li Z, Scheraga HA: Monte Carlominimization approach to the multipleminima problem in protein folding. [http://www.pnas.org/content/84/19/6611.abstract] webcite
Proceedings of the National Academy of Sciences 1987, 84(19):66116615. Publisher Full Text

Hansmann UH, Okamoto Y: New Monte Carlo algorithms for protein folding. [http://www.sciencedirect.com/science/article/pii/S0959440X99800256] webcite
Current Opinion in Structural Biology 1999, 9(2):177183. PubMed Abstract  Publisher Full Text

Bouzida D, Kumar S, Swendsen RH: Efficient Monte Carlo methods for the computer simulation of biological molecules. [http://link.aps.org/doi/10.1103/PhysRevA.45.8894] webcite
Phys Rev A 1992, 45(12):88948901. PubMed Abstract  Publisher Full Text

Yanover C, SchuelerFurman O, Weiss Y: Minimizing and Learning Energy Functions for SideChain Prediction.
Journal of Computational Biology 2008, 15(7):899911. PubMed Abstract  Publisher Full Text

Lasker K, Topf M, Sali A, Wolfson HJ: Inferential Optimization for Simultaneous Fitting of Multiple Components into a CryoEM Map of Their Assembly.
Journal of Molecular Biology 2009, 388:180194. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hastings W: Monte Carlo Sampling Methods Using Markov Chains and Their Applications.
Biometrika 1970, 57:97109. Publisher Full Text

Dunbrack R Jr, Karplus M: Backbonedependent Rotamer Library for Proteins. Application to Sidechain Prediction.
J Mol Biol 1987, 193:775791. PubMed Abstract  Publisher Full Text

Ponder J, Richards F: Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes.
J Mol Biol 1993, 230:543574. PubMed Abstract  Publisher Full Text

Shapovalov M, Dunbrack R Jr: A smoothed backbonedependent rotamer library for proteins derived from adaptive kernel density estimates and regressions.
Structure 2011, 19:844858. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dhanik A, Kou C, Marz N, Yao P, Propper R: LoopTK: Protein Loop Kinematic Toolkit. [https://simtk.org/home/looptk] webcite
2007.

Yao P, Dhanik A, Marz N, Propper R, Kou C, Liu G, van den Bedem H, Latombe JC, HalperinLandsberg I, Altman R: Efficient Algorithms to Explore Conformation Spaces of Flexible Protein Loops.
Computational Biology and Bioinformatics, IEEE/ACM Transactions on 2008, 5(4):534545.

Hung LH, Ngan SC, Liu T, Samudrala R: PROTINFO: new algorithms for enhanced protein structure predictions.
Nucleic Acids Research 2005, 33:W77W80. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jamroz M, Kolinski A: Modeling of loops in proteins: a multimethod approach.
BMC Structural Biology 2010, 10:5. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Samudrala R, Levitt M: A comprehensive analysis of 40 blind protein structure predictions.
BMC Structural Biology 2002, 2:3. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Lovell SC, Davis I, Arendall W III, de Bakker P, Word J, Prisant M, Richardson J, Richardson D: Structure validation by Calpha geometry: phi, psi and Cbeta deviation.
Proteins: Structure, Function, and Bioinformatics 2003, 50(Issue 3):437450.