Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, USA

Department of Chemistry, University of Washington, Seattle, WA 98195, USA

Department of Chemistry, Stanford University, Stanford, CA 94305, USA

Department of Mathematics, Stanford University, Stanford, CA 94305, USA

Department of Computer Science, Stanford University, Stanford, CA 94305, USA

Abstract

Background

Markov state models have been widely used to study conformational changes of biological macromolecules. These models are built from short timescale simulations and then propagated to extract long timescale dynamics. However, the solvent information in molecular simulations are often ignored in current methods, because of the large number of solvent molecules in a system and the indistinguishability of solvent molecules upon their exchange.

Methods

We present a solvent signature that compactly summarizes the solvent distribution in the high-dimensional data, and then define a distance metric between different configurations using this signature. We next incorporate the solvent information into the construction of Markov state models and present a fast geometric clustering algorithm which combines both the solute-based and solvent-based distances.

Results

We have tested our method on several different molecular dynamical systems, including alanine dipeptide, carbon nanotube, and benzene rings. With the new solvent-based signatures, we are able to identify different solvent distributions near the solute. Furthermore, when the solute has a concave shape, we can also capture the water number inside the solute structure. Finally we have compared the performances of different Markov state models. The experiment results show that our approach improves the existing methods both in the computational running time and the metastability.

Conclusions

In this paper we have initiated an study to build Markov state models for molecular dynamical systems with solvent degrees of freedom. The methods we described should also be broadly applicable to a wide range of biomolecular simulation analyses.

Background

The simulation of biological processes at the molecular scale has the potential to give insight into a wide range of properties and phenomena that are important to science, engineering, and medicine -- with protein folding, or mis-folding, being perhaps the most famous example

There is an increasing need to mine such massive data sets in order to gain insight into the fundamental phenomena under study. From these data sets, the goal is to understand at some more macroscopic level the structure of the paths taken during the simulation. The key challenge facing dynamical simulation on the molecular scale is to overcome the gap between the timescales where interesting biologically relevant conformational changes occur (typically microseconds or even longer) and those we can simulate at atomic resolution (typically nanoseconds). The length of atomic simulations is limited by the need to take small time steps, which is determined by the high frequency motions.

Markov state models

To meet such a challenge, a lot of recent effort has been devoted to constructing stochastic kinetic models, often in the form of

In a MSM, the time evolution of a vector representing the population of each state can be calculated as ^{n}P

To build such dynamical models, it is necessary to map out the dominant long lived, kinetically metastable states and then determine the rates for transitioning between these states. A few different approaches have been developed to generate good state decompositions. If the low-dimensional manifold containing all the slow degrees of freedom is known a priori, then the configuration space can be partitioned into free energy basins to define these metastable states, such as by examination of the potential of mean force

In

Solvent degrees of freedom

Since the dynamics of biological macromolecules are usually coupled with the surrounding solvent, many molecular simulations involve both a solute and a solvent (typically water). Some previous works have shown the necessary of accounting for the solvent structure to accurately characterize the dynamics and free energy landscape of the biological macromolecule systems, such as the RNA hairpin-loop motif

Although people have recognized that solvent coordinates may be critical in some phenomena

In this paper, we propose to generalize the current methods to include the solvent degrees of freedom. We first present a new distance metric which encodes the solvent information in molecular configurations, and then incorporate it into the construction of MSMs. Finally we apply our method to several biological model systems and assess its performance.

Methods

Many of the dynamical systems which occur in biochemistry take place in very high dimensional spaces. Our main goal is to develop techniques to obtain the simplest kind of qualitative information about high-dimensional molecular dynamical systems. Perhaps the most significant piece of information one has about the data set is the distance metric which specifies the distances between pairs of points (molecular configurations). For macromolecules, a commonly used metric for estimating the distance between two molecules is the

Distance functions

In molecular simulations, a system consists of both a solute (macromolecule) and a solvent (water). Suppose the solute structure contains _{1}, _{2}, ..., _{m}_{1}, _{2}, ..., _{n}_{i }

We first point out two properties when comparing different configurations {

•

• {_{1}, y_{2}_{n}

To address the indistinguishability of solvent molecules upon their exchange, one may consider methods that compute the optimal matching between the solvent molecules, such as minimum cost flow ^{3}) time, which is slow for systems with thousands of solvent molecules. The computational cost can be reduced if we only focus on solvent molecules around the solute, such as its

We present a new distance function that measures the geometric similarity between different configuration. The idea is we compute some signatures/descriptors

1.

2. _{1}, _{2}, ..., _{n}

3. _{i}

To meet these properties, we define the signature _{i}_{i}_{1}, _{2}, _{m}, Y

Intuitively, the signature vector

Constructing Markov state models

In this section, we integrate the solvent information into the construction of MSMs. We will follow and extend the methods described in

Splitting

Modern computer simulations can easily generate data sets with millions of configurations, making analysis of these massive data sets computationally challenging. An important method for shrinking the data sets is to apply a clustering algorithm to obtain a family of clusters (microstates) of much smaller size than the original data set. Here each cluster should be small enough to ensure that the intra-state transitions between configurations in the same cluster are fast.

In the split step, all ^{4 }- 10^{7}) are grouped into ^{2 }- 10^{4}) based on their structural similarity. Due to the large size of the data set, it is more practical to apply a fast geometric clustering algorithm, such as the

Suppose we want to build a model with

More generally, we can generate _{1 }solute clusters and _{2 }solvent clusters (with _{1}_{2 }≥ _{2 }= 1, and the solvent-based model is a special case where _{1 }= 1. Note that in this case, the running time for geometric clustering becomes _{1 }+ _{2})

Lumping

Because the clustering algorithms do not produce clusters of any particular uniform shape or size, we have lost the original metric information after the split step. What one retains, however, is the computation of probabilities for transitioning from one microstate to another. This means that we retain a coarse version of the dynamics. In the next step, these microstates are lumped into macrostates based on their kinetic transitions in the trajectories. Since this step does not consider solute/solvent information about configurations, we simply follow the same approach described in

In the lump step, the ^{2}) so as to maximize the

In the original approach, a simulated annealing algorithm

Results and discussion

The method we described here would be generally applicable to a wide range of biomolecular simulation analyses. In this section, we pick several examples and test the performance of our method in these different models.

Solvent-based clusters

We first apply our method to a small alanine dipeptide system, which has been used as an example in the MSMBuilder

In this model, the solute structure Ace-Ala-Nme consists of 22 atoms and the solvent contains 885 H_{2}O. For each configuration, we extract 10 solute atoms _{1}, _{2}, ..., _{10}} consisting of all heavy atoms on the backbone chain (see Figure _{1}, _{2}, ..., _{885}} representing the water molecules. We next reduce the dimensionality of this point set {

Alanine dipeptide

**Alanine dipeptide**. (a) Solute structure. (b) Top 3 PCA directions for signature

Intuitively, the signature vector _{i}, Y

In protein backbone geometry, it is known that the torsion angles

In the above alanine dipeptide example, the solute structure is small and may in some sense be considered as a convex object, because the water molecules rarely enter the region inside the solute structure. We next turn to another example of carbon nanotube in water, whose solute atoms form a very concave structure. Because this model simulates water molecules going in and out of a carbon nanotube, it is a good test of whether the solvent distribution inside the solute structure can be captured by our method.

We have a 10 nanoseconds trajectory of carbon nanotube in water, with a frame rate of 1 picosecond. The solute _{2}O. In _{i}_{i}_{V}K_{i}_{i }

Carbon nanotube

**Carbon nanotube**. (a) Empty and full metastable states. (b) Water number inside the nanotube. (c) Pairwise distance matrix between solvent signatures (

However, the above computation of water number relies on the fact that the system dynamics depends on the distribution of water molecules inside the nanotube. In general, we have no prior knowledge about how to choose a proper region _{1}, _{2}, _{m}, Y

Figure

Comparing different models

We have defined three types of models in the construction of MSMs: (1) a solute-based model using RMSD distances, (2) a solvent-based model using solvent signatures, and (3) a combination model integrating both the them. In this section, we compare the performances of these different models. In particular, we use the metastability as a measure, which is also the objective function that we optimized in building MSMs.

Figure

Metastability of MSMs

**Metastability of MSMs**. (a) Alanine dipeptide. (b) Benzene rings.

For splitting, the

For lumping, we first split all configurations into

We have also verified this result on another data set for the collapse of benzene rings (see Figure _{2}O. The system is simulated for 100 nanoseconds, with a frame rate of 2 picoseconds. The experiment results for this benzene rings model are shown in Figure

Snapshots of different configurations in the benzene rings system

**Snapshots of different configurations in the benzene rings system**.

Conclusions

In this paper we have initiated an study to build Markov state models for molecular dynamical systems with solvent degrees of freedom. We have introduced a Gaussian-based signature to compactly represent the solvent distribution in the configuration space, and incorporated this information into the construction of MSMs to identify metastable solvent clusters. We have also tested our method on several different biological data sets and find that our approach improves the existing methods both in the computational running time and the metastability. We believe that the methods we described would be more generally applicable to a wide range of biomolecular simulations.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CG, HWC and LM executed this study and wrote the draft of this manuscript. VSP, GEC and LJG supervised this project.

Declarations

The publication costs for this article were funded by NSF grant DMS 0900700.

This article has been published as part of

Acknowledgements

We would like to thank Xuhui Huang for providing us with the simulation data for our experiments. We also wish to acknowledge the support of NSF grants DMS 0900700, DMS 0905823, IIS 0914833 and CCF 1011228, Air Force Office of Scientific Research grants FA9550-09-0-1-0531 and FA9550-09-1-0643, Office of Naval Research grant N00014-08-1-0931, as well as a research award from Google, Inc.