Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Department of Biochemistry, University of Washington, Seattle, WA 98195, USA

Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract

We introduce three algorithms for learning generative models of molecular structures from molecular dynamics simulations. The first algorithm learns a Bayesian-optimal undirected probabilistic model over user-specified covariates (e.g., fluctuations, distances, angles, etc). _{1 }reg-ularization is used to ensure sparse models and thus reduce the risk of over-fitting the data. The topology of the resulting model reveals important couplings between different parts of the protein, thus aiding in the analysis of molecular motions. The generative nature of the model makes it well-suited to making predictions about the global effects of local structural changes (e.g., the binding of an allosteric regulator). Additionally, the model can be used to sample new conformations. The second algorithm learns a time-varying graphical model where the topology and parameters change smoothly along the trajectory, revealing the conformational sub-states. The last algorithm learns a Markov Chain over undirected graphical models which can be used to study and simulate kinetics. We demonstrate our algorithms on multiple molecular dynamics trajectories.

Introduction

The three dimensional structures of proteins and other molecules vary in time according to the laws of thermodynamics. Each molecule visits an ensemble of states which can be partitioned into distinct

Molecular dynamics (MD) simulations are often used to characterize conformational dynamics ^{-9 }sec.). Recent advances in hardware and software (e.g., ^{-6 }sec.) and millisecond (^{-3 }sec.) time-scales. Such long simulations are especially well-suited to identifying and studying the conformational sub-states relevant to biological function. Unfortunately, the corresponding trajectories are often difficult to analyze and interpret due to their size and complexity. Thus, there is a need for algorithms for analyzing such long timescale trajectories. The primary goal of this paper is to introduce new algorithms to do so.

Our approach to analyzing MD data is to learn generative models known as Markov Random Fields (MRF). This is the first time MRFs have been used to model MD data. A MRF is an undirected probabilistic graphical model that encodes the joint probability distribution over a set of user-specified variables. In this paper those variables correspond to the positional fluctuations of the atoms, but the technique can be easily extended to other quantities, such as pairwise distances or angles. The generative nature of the model means that new conformations can be sampled and, perhaps more importantly, that users can make structural alterations to one part of the model (e.g., modeling the binding of a ligand) and then perform inference to predict how the rest of the system will respond.

We present three closely related algorithms. The first algorithm learns a single model from the data. Both the topology and the parameters of the model are learned. The topology of the learnt graph reveals which variables are directly coupled and which correlations are indirect. Alternative methods, such as constructing a covariance matrix cannot distinguish between direct and indirect correlations. Our algorithm is guaranteed to produce an optimal model. Regularization is used to reduce the tendency of over-fitting the data. The second algorithm learns a time-varying model where the topology and parameters of the MRF change smoothly over time. Time-varying models reveal the different conformational sub-states visited by the molecule and the features of the the energy barriers that separate them. The final algorithm learns a Markov Chain over MRFs which can be used to generate new trajectories and study to kinetics.

Background

Molecular dynamics simulation

Molecular Dynamics simulations involve integrating Newton's laws of motion for a set of atoms. Briefly, given a set of ^{-15 }sec), meaning that a 1 microsecond simulation requires one billion integration steps. In most circumstances, every 1000th to 10000th conformation is written to disc as an ordered series of

Traditional methods for analyzing MD data either monitor the dynamics of global statistics (e.g., the radius of gyration, total energy, etc), or else identify sub-states via a clustering the frames

More recently, Lange and Grubmüller introduced full correlation analysis

Markov Random Fields

A Markov Random Field _{1}, ..., _{n}} and a set of functions Θ over the nodes and edges of **X**). The topology of the graph determines the set of _{i }and _{j }are not connected by an edge in the graph, then any correlation between them is indirect. By 'indirect' we mean that the correlation between _{i }and _{j }(if any) can be explained in terms of a pathway of correlations (e.g., _{i }→ _{k }→ ··· → _{j}). Conversely, if _{i }and _{j }are connected by an edge, then the correlation is direct. Our algorithm automatically detects these conditional independencies and learns the sparsest possible model, subject to fitting the data.

Gaussian Graphical Models

A ^{-1 }is an ^{-1 }reveal the edges in the MRF. The inverse of the precision matrix, denoted by Σ, is the covariance matrix for a multivariate Gaussian distribution with mean

Gaussian distributions have a number of desirable properties including the availability of analytic expressions for a variety of quantities. For example, the probability of observing

where

or the KL-divergence between two different models:

A GGM can also be used to manipulate a subset of variables and then then compute the marginal densities for the remaining variables. For example, let **V **⊂ **X **be an arbitrary subset of variables **X **and let **W **be the complement set. We can condition the model by setting variables **V **to some particular value, **W **given

Here,

Algorithms

We now present three algorithms for learning various kinds of generative models from MD data.

**Input **The input to all three algorithms is a time-series of vectors

Algorithm 1

**Output **The first algorithm produces a Gaussian Graphical Model ^{-1 }(see below). Finally,

The algorithm produces the sparsest precision matrix that still fits the data (see below). It also guarantees that Σ^{-1 }is positive-definite, which means it can be inverted to produce the regularized covariance matrix (as opposed to the sample covariance, which is trivial to compute). This is important because Eqs 1-3 require the covariance matrix, Σ. We further note that a sparse precision matrix does not imply that the corresponding covariance matrix is sparse, nor does a sparse covariance imply that the corresponding precision matrix is sparse. That is, our algorithm isn't equivalent to simply thresholding the sample covariance matrix, and then inverting.

Learning regularized precision matrices

A straight-forward way of learning a GGM is to find the parameters (^{2}) parameters, each of which must be estimated from the data. This is relevant because the number of

Our algorithm addresses the problem of over-fitting by maximizing the following objective function:

Here, ||Σ^{-1}||_{1 }is the _{1 }norm of the precision matrix. The _{1 }norm is defined as the sum of the absolute values of the matrix elements. It can be interpreted as a measure of the complexity of the model. In particular, each non-zero element of Σ^{-1 }corresponds to a parameter in the model and must be estimated from the data. Thus, Eq. 6 establishes a tradeoff between the log likelihood of the data (the first term) and the complexity of the model (the second term). The scalar value λ controls this tradeoff such that higher values produce sparser precision matrices. This is our algorithm's only parameter. Its value can be computed analytically

Algorithmically, our algorithm maximizes Eq. 6 in an indirect fashion, by defining and then solving a convex optimization problem. Using the functional form of ^{-1 }can be rewritten as:

Noting that **ABC**) = **CAB**), the log-likelihood of Σ^{-1 }can then be rewritten as:

Next, using the definition of the sample covariance matrix,

we can define the matrix Σ^{-1 }that maximizes 6 as the solution to the following optimization problem:

We note that _{1 }regularization is equivalent to maximizing the likelihood under a Laplace prior and so the solution to Eq. 7 is a _{1 }regularization ensures additional desirable properties including

We now show that the optimization problem defined in Eq. 7 is smooth and convex and can therefore be solved optimally. First, we consider the dual form of the objective. To obtain the dual, we first rewrite the _{1}-norm as:

where ||**U**||∞ denotes the maximum absolute value element of the matrix **U**. Given this change of formulation, the primal form of the optimization problem can be rewritten as:

That is, the optimal Σ^{-1 }is the one that maximizes the worst case log likelihood over all additive perturbations of the covariance matrix.

Next, we exchange the

such that Σ^{-1 }= (**S **+ **U***)^{-1}.

After one last change of variables, **W **= **S **+ **U**, the dual form of Eq. 7 can now be defined as:

Eq. 9 is smooth and convex, and for small values of

Block Coordinate Descent

Given matrix **A**, let **A**_{\k\j }denote the matrix produced by removing column **A**_{j }also denote the column **A**_{jj }removed. The Block Coordinate Descent algorithm **W **at a time. The algorithm iteratively optimizes all columns until a convergence criteria is met. The **W**s produced in each iterations are strictly positive definite and so the regularized covariance matrix Σ =

**Algorithm 1 **Block Coordinate Descent

**Require**: Tolerance parameter ε, sample covariance **S**, and regularization parameter λ.

Initialize **W**^{(0)}:= **S **+ λ**I **where **I **is the identity matrix.

**repeat**

**for ****do**

**W**^{(j-1) }denotes the current iterate.}

Set **W**^{(j) }to **W**^{(j-1) }such that **W**_{j }is replaced by

**end for**

Set **W**^{(0) }= **W**^{(n)}

**until ****W**^{(0)})^{-1}**S**) - **W**^{(0)})^{-1}||_{1 }≤ ε.

**return W**^{(0)}

The time complexity of this algorithm is ^{4.5}/ε)

In summary, the algorithm produces a time-averaged model of the data by computing the sample mean and then constructing the optimal regularized Σ by solving Eq. 9 using Block Coordinate Decent. The regularized covariance matrix Σ is guaranteed to be invertible which means we can always compute the precision matrix, Σ^{-1}, which can be interpreted as a graph over the variables revealing the direct and indirect correlations between the variables.

Algorithm 2

The second algorithm is a straight-forward extension of the first. Instead of producing a time-averaged model, it produced time-varying model:

Let **D**^{(τ) }⊆ **D **denote the subset of frames in the MD trajectory that correspond to the τth window, 1 ≤

Here, S(

where _{k }are defined by a nonnegative kernel function. The choice of kernel function is specified by the user. In our experiments the kernel mixed the current window and the previous window with the current window having twice the weight of the previous. The time-varying model is then constructed by solving Eq. 9 for each **S**(

Algorithm 3

The final algorithm builds on the second algorithm. Recall that the second algorithm learns _{sym }=

Let

Experiments

We applied our algorithms to several molecular dynamics simulation trajectories. In this section, we illustrate some of the results obtained through this analysis. The algorithms were implemented in Matlab and run on a dual core T9600 Intel processor running at 2.8 Ghz. The wall-clock runtimes for all the experiments were on the order of seconds to about 10 minutes, depending on the size of the data set and parameter settings.

Algorithm 1: application to the early events of HIV entry

We applied the first algorithm to simulations of a complex (Figure

(Left) gp120 (blue) bound to CD4 (green)

(Left) gp120 (blue) bound to CD4 (green). (Right) The same complex bound to Ibalizumab (yellow and purple), a monoclonal antibody HIV entry inhibitor. Notice that Ibalizumab does not bind to gp120.

Ibalizumab's mechanism of action is poorly understood. As can be seen in Figure

Correlation networks

Figure

gp120-CD4 correlation networks learned with Algorithm 1

**gp120-CD4 correlation networks learned with Algorithm 1.** (Left) Edges learned by algorithm for the drug-free simulation. (Right) Edges learned by algorithm for the drug-bound simulation.

The probabilistic nature of the model means that it is possible to compute the likelihood of each data set under both models. Table

Log-likelihood

**
Data
**

**( Data|Unbound Model)**

**( Data|Drug - Bound Model)**

Unbound

-0.03

-0.19

Bound

-0.04

-0.29

Figure

gp120-CD4-Ibalizumab correlation networks learned with Algorithm 1

**gp120-CD4-Ibalizumab correlation networks learned with Algorithm 1.** Edges learned by algorithm for the drug-bound simulation. Here, all three models are shown.

Comparison to sub-optimal models

Our method is guaranteed to return an optimal model. Here we compare the models returned by our algorithm to those obtained by a reasonable, but nevertheless sub-optimal algorithm for generating sparse networks. For comparison, we inverted the _{1 }penalty is is much larger in each case (0.86 vs 15.1 for unbound; 0.75 vs 12.9 for bound). The difference in _{1 }penalties is due to the radically different choices of edges each method makes. Only 41% (resp. 31%) of the unbound (resp. bound) edges match the ones identified by our algorithm. Moreover, the thresholded sample precision matrices (Figure

Thresholded precision matrix models

**Thresholded precision matrix models.** (Left) Edges produced by thresholding inverse of sample covariance matrix for the drug-free simulation. (Right) Edges produced by thresholding inverse of sample covariance matrix for the drug-bound simulation. Notice that the edges lack the kind of structure seen in Figures

Perturbation analysis

Next, we demonstrate the use of inference to quantify the sensitivity of gp120 to structural perturbations in the drug. We conditioned the model learned from the trajectory with gp120, CD4 and Ibalizumab on the structure of the drug and then performed inference (Eq. 4) to compute the most likely configuration of remaining variables (i.e., those corresponding to gp120 and CD4). This was repeated for each frame in the trajectory. The residues with the highest average displacement are illustrated as red spheres in Figure

Sensitivity to perturbations

**Sensitivity to perturbations.** The red spheres mark the residues that are most sensitive to perturbations in the drug.

Algorithm 2: application to a 1 microsecond simulation of the engrailed homeodomain

We applied the second algorithm to a simulation of the engrailed homeodomain (Figure

Engrailed homeodomain

**Engrailed homeodomain.**

We performed three 50-microsecond simulations of the protein at 300, 330, and 350 degrees Kelvin. These simulations were performed on A

Figure

(A) Differential entropy of the 500 model learned from the engrailed trajectory

(A) Differential entropy of the 500 model learned from the engrailed trajectory. (B) Correlation network of the model with the smallest differential entropy (model 42). (C) Correlation network of the model with the largest differential entropy (model 342).

Figure

(A) Average log-likelihood of the frames from the

(A) Average log-likelihood of the frames from the

Figure

(A) KL-divergence between sequential models

(A) KL-divergence between sequential models. (B) Pairwise KL-divergences between models.

Algorithm 3: application to a 1 microsecond simulation of the engrailed homeodomain

Using the 500 models learned in the previous section, we computed the symmetric KL-divergence between all pairs of models. Recall that the KL-divergence (Eq. 3) is a measure of the difference between distributions. Figure

We then applied complete linkage clustering to the KL-divergence matrix. Complete linkage clustering minimizes the maximum distance between elements when merging clusters. We selected a total of 7 clusters based on the assumption that the number of sub-states visited by a sequence of

Representative structures for states 4 (green) and 6 (magenta)

**Representative structures for states 4 (green) and 6 (magenta).**

Finally, we estimated the parameters of a Markov chain over the 7 clusters by counting the number of times a model from the

State-transition matrix

**State-transition matrix.** The color indicates the log of the number of times state

Discussion

Many existing techniques for analyzing MD data are closely related to, or direct applications of Principal Components Analysis (PCA). Quasi-Harmonic Analysis (QHA)

PCA-based methods generally project the data onto a low-dimensional subspace spanned by the eigenvectors corresponding to the largest eigenvalues. This is done to simplify the data and because lower dimensional models tend to be more robust (i.e., less likely to over-fit the data). Our methods, in contrast, uses regularization when estimating the parameters of the model to achieve the same goals.

The eigenvectors produced by PCA-based methods contain useful information about how different regions of the system move in a coordinated fashion. In particular, the components of each vector quantify the degree of coupling between the covariates in that mode. However, the eigenvectors make no distinction between direct and indirect couplings. Moreover, eigenvectors are an inherently global description of dynamics. Our methods, in contrast, do not perform a change of basis and instead models the data in terms of a network of correlations. The resulting model, therefore, reveals which correlations are direct and which are indirect. Pathways in these networks may provide mechanistic insights into important phenomena, such as allosteric regulation. Our models can also be used to investigate motions that are localized to specific regions of the system.

Finally, we note that because our first algorithm produces a regularized estimate of the true covariance matrix, Σ, it could potentially be used as a pre-processing step for PCA-based methods, which normally take as input the sample covariance matrix.

Conclusions and future work

We have introduced three novel methods for analyzing Molecular Dynamics simulation data. Our algorithms learn regularized graphical models of the data which can then be used to: (i) investigate the networks of correlations in the data; (ii) sample novel configurations; or (iii) perform

There are a number of important areas for future research. Gaussian Graphical Models have a number of limitations, most notably that they encode uni-modal distributions and are best suited to modeling harmonic motions. Boltzmann distributions, in contrast, are usually multi-modal. Our third algorithm partially addresses this problem by creating a Markov chain over GGMs but the motions are still harmonic. Discrete distributions could be used to model anharmonic motions (e.g., by adapting the algorithm in

List of abbreviations used

GGM: Gaussian Graphical Model; KL: Kullback Leibler; MAP: maximum a posteriori; MD: Molecular dynamics; MRF: Markov Random Field; MSM: Markov State Model; PCA: Principal Components Analysis; QHA: Quasi-Harmonic Analysis.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All three authors contributed to the creation and implementation of the algorithms and writing the manuscript. N.S.R. and C.J.L. performed the experiments and analysis.

Acknowledgements

This work is supported in part by US NSF grant IIS-0905193. Use of the Anton machine was provided through an allocation from National Resource for Biomedical Supercomputing at the Pittsburgh Supercomputing Center via US NIH RC2GM093307.

This article has been published as part of