Department of Computer Science, San Francisco State University, San Francisco, CA 94132, USA

Abstract

Background

Discerning the similarity between molecules is a challenging problem in drug discovery as well as in molecular biology. The importance of this problem is due to the fact that the biochemical characteristics of a molecule are closely related to its structure. Therefore molecular similarity is a key notion in investigations targeting exploration of molecular structural space, query-retrieval in molecular databases, and structure-activity modelling. Determining molecular similarity is related to the choice of molecular representation. Currently, representations with high descriptive power and physical relevance like 3D surface-based descriptors are available. Information from such representations is both surface-based and volumetric. However, most techniques for determining molecular similarity tend to focus on idealized 2D graph-based descriptors due to the complexity that accompanies reasoning with more elaborate representations.

Results

This paper addresses the problem of determining similarity when molecules are described using complex surface-based representations. It proposes an intrinsic, spherical representation that systematically maps points on a molecular surface to points on a standard coordinate system (a sphere). Molecular surface properties such as shape, field strengths, and effects due to field super-positioningcan then be captured as distributions on the surface of the sphere. Surface-based molecular similarity is subsequently determined by computing the similarity of the surface-property distributions using a novel formulation of histogram-intersection. The similarity formulation is not only sensitive to the 3D distribution of the surface properties, but is also highly efficient to compute.

Conclusion

The proposed method obviates the computationally expensive step of molecular pose-optimisation, can incorporate conformational variations, and facilitates highly efficient determination of similarity by directly comparing molecular surfaces and surface-based properties. Retrieval performance, applications in structure-activity modeling of complex biological properties, and comparisons with existing research and commercial methods demonstrate the validity and effectiveness of the approach.

Background

Across all biological and pharmaceutical investigations, the discovery (or development) of molecules with desired biological activity is an important goal. Efforts to attain this goal are strongly driven by the notion of molecular similarity because in general similar molecules tend to behave similarly

Introduction to molecular representations and descriptors

In their simplest form, molecules can be represented using chemical formulae. However, different structures may yield the same formula even though they possess dissimilar physical or biochemical properties (e.g. in the case of isomers). Therefore, commonly employed representation frameworks tend to emphasize a more explicit characterization of the molecular structure and include (see Figure

Molecular Representations

**Molecular Representations**. Different molecular representations shown with the Benzene molecule as an example: (a): chemical (graphical) representation, (b) 2D graph and graph-traversal based string representations, (c) 3D graph-based representation, (d) surface-based representation. The molecular surface is obtained by rolling a probe-atom over a molecule as shown in (e). The complexity of surface-based representations can be discerned from (f) where the molecule Asprin in shown on the left and the molecule Capceisin on the right.

Molecular descriptors are computationally determinable characteristics of a molecule that describe specific molecular properties. Examples include physical-chemical descriptors such as the number of rotatable bonds, polar surface area, electronegativity, descriptors of molecular connectivity such as the Wiener number

Formulations for molecular query-retrieval and analysis of prior research

The problem of molecular query-retrieval can be approached from two primary and interrelated perspectives:

• Query formulation

Two main forms of formulating the query can be distinguished: (1)

• Molecular representation

Molecular representations have varying capabilities in terms of modelling biochemical characteristics of molecules. As noted earlier, surface-based representations/descriptors are more faithful to the actual physics of molecules than molecular graphs-based approaches

Early attempts at determining molecular similarity, like

The use of fixed-size representation vectors has lead to practical solutions for querying large molecular repositories. However, such approaches have several severe drawbacks: (1) They are limited to 2D information and incapable of being used for complex bio-chemically relevant representations/descriptors. (2) They are incapable of representing

Problem characteristics and challenges

The problem of determining the similarity of molecules when they are represented using complex 3D surface-based descriptors presents some unique challenges which include:

1.

2.

3.

Results

Three different types of experiments were conducted to study the efficacy of the proposed method: (1) Investigation of the method's accuracy in query-retrieval settings, (2) Evaluation of its performance (speed), and (3) Validation through applications in structure-activity modelling problems. Each experiment incorporated two stages: The first stage involved a direct application of the method on a data set with subsequent analysis of the results. In the second stage, a comparative study was performed by applying a state-of-the-art research or commercial technique on the same data set. Subsequently the results were analysed to evaluate the proposed approach.

Accuracy in query-retrieval settings

The method was tested in a query-retrieval setting on a subset of 5000 molecules randomly selected from the MDDR collection

Summary of results from the query-retrieval experiment.

**Method**

**Data Size**

**Number of Conformations**

**Accuracy**

ISIS

5000

none

100%

Proposed

5000

20/20

100%

Proposed

5000

20/20*

98.2%

Performance evaluation

The computational performance of the proposed approach was tested with respect to the Molecular Hashkeys algorithm

In our experiment, 30 molecules from the MDDR collection were compared against each other, with 20 conformers for the model and one for the query. Both the systems reported a 100% recognition rate on this subset of molecules. However, the time requirements were significantly different. A graph plotting the time required for the similarity computation with the proposed technique is shown in Figure

Comparison of computational performance

**Comparison of computational performance**. Computational performance of the proposed method (left) and comparison with the Molecular Hashkeys method [13], which is based on the Compass algorithm (right).

For a given molecular property and its corresponding property-histogram having

Given the size of molecular repositories, a key technical problem is the design of indexing techniques. This is due to the fact that even highly efficient matching techniques, such as the one presented, require distance comparisons which grow linearly with the number of molecules in the database. Indexing techniques can be broadly classified as (1)

Validation through application in structure-activity models

A structure-property model captures the relationship between the bio-chemical properties of a molecule and its physicochemical description _{i }as the function of its "chemical constitution":

_{i})

The basic elements needed for the development of a structure-property model are: (1) Assay results describing the bio-chemical property of interest, (2) a set of parameters describing the molecular structure and its physicochemical attributes, and (3) the learning formulation along with a statistical or machine learning technique.

As part of the validation experiments, similarity information derived using the proposed technique was used to model absorption through an in-vitro cell line. The data set consisted of 30 compounds that were tested using the Caco-2 assay. The Caco-2 (human colon adenocarcinoma cell line) provides a close approximation of

Two measures were used for evaluation of the results. The first is a ratio-scale measure called cross-validated ^{2 }and shows how well the model predicts data that was

Here, _{i }is the experimentally determined property of the molecule _{i}, _{i }is its predicted property, and

Kendall's

The assay values for twenty of the thirty compounds were made available for model construction and constituted the learning phase for the neural network. As part of the model construction step, the complete cross-correlation matrix of the descriptors was computed and the top eight least correlated descriptors used to learn the (empirical) mapping between the molecules and their permeability values. Learning was stopped when the cross-validated error became lower than a predefined threshold.

We begin by presenting the analysis of the method's performance in a leave-one-out cross-validated setting on the training set. In this setting, one compound was randomly excluded from the training set and the remaining compounds used to learn a model that predicted the permeability for the excluded compound. The results are shown in Figures ^{2 }equalled 0.97 and the value for Kendall's ^{2 }equalled 0.64 and Kendall's ^{2}) occurred because the original data had compounds showing no absorption. The models that were derived typically assigned very low (albeit non-zero) absorption values to these molecules, thus leading to lower values for Kendall's

Performance in structure-property modelling

**Performance in structure-property modelling**. Performance, comparison, and analysis of the proposed method in structure-property modelling. In (a) – (c), permeation of each compound is depicted by two adjacent bars with predicted values represented by light-blue bars on the left and measured values represented by the dark maroon bars on the right. The numbers on the X-axis identify each molecule used in the experiment and the Y-axis corresponds to the permeation values, measured in terms of flux-units. (a) Prediction results on the training set in a leave-one-out setting with the proposed method, (b) Prediction results on the training set in a leave-one-out setting with the similarity algorithm [13], (c) Performance of the proposed method on the test set. Figures 3(d) and 3(e) present leave-n-out cross-validated results demonstrating the robustness of the predictive model obtained using the proposed method. The correctness of the assignment of the molecules to the three classes "low permeability", "medium permeability", and "high permeability" is shown in (d), while the distribution of the prediction results in shown in (e).

In Figure

Conclusion

In this paper, we considered the problem of defining similarity between molecules based on complex surface-based representations. Such representations capture the physics of the molecules better than commonly used molecular-graph-based approaches and can therefore have significant relevance in molecular query-retrieval, similarity-based exploration of structural space, and structure-activity modelling. We have presented a novel approach for defining a standard coordinate system for describing complex surface-based molecular descriptions. For computing the similarity of molecules, we propose a novel formulation of histogram intersection which can take into account the distribution of surface properties in 3D space. Experimental results indicate that the similarity formulation can be used for highly-accurate query-retrieval and outperforms, in terms of computational speed, both existing research and commercially available solutions. The proposed approach was also validated by applying it in building structure-activity models for complex bio-chemical properties. The efficacy and computational efficiency of the proposed approach underline the important role it can play in querying and exploration of large molecular repositories.

Methods

We begin this section by describing how the molecular surfaces are derived and how at each point of the surface, donor and acceptor fields are defined. Next, the concept of a standard coordinate system for describing molecular surfaces is introduced. In this subsection we discuss the Gauss map and its derivatives: the Extended Gaussian Image and the Spherical Attribute Image. We subsequently describe how a sphere encapsulating the molecule is deformed to map the molecular surface to a standard spherical coordinate system. In the final sub-section, the histogram-intersection based surface matching algorithm is described and illustrated using a simple example.

Computing the molecular surface and surface properties

Starting from the atomic coordinates, the molecular surface (Connolly surface) is obtained by using the program MSRoll

The measurement of the donor field is done using the following three step procedure:

Step 1

The Hydrogen-bond donor atoms in the molecule are identified. Typically these are Nitrogen or Oxygen atoms with hydrogen on them. Other ways of identification like the PATTY-rule

Step 2

The donor field is defined as an isotropic Gaussian distribution and the field at point P_{j }due to an atom at position X_{i }having van der Walls radii _{i }is defined as

In Eq. (4)

Step 3

At a given surface point P_{j}, first, the field strength for each donor atom is computed. The direction of each field is given by a unit vector obtained by joining the corresponding atom to P_{j}. The resultant donor field at P_{j }is subsequently defined as the vector sum of all donor field vectors at this point.

The acceptor field is analogously determined. Typically Nitrogen or Oxygen atoms with a lone pair of electrons are considered as acceptors.

A standard coordinate system for surface-based molecular representations

A pre-requisite for comparing molecules described using surface-based representations is the capability to map points on the curved molecular surface to points on a standard coordinate system. Such a mapping was derived by Gauss

Definition 1

Let ^{3 }be an oriented surface in Euclidean space. Further, let

Illustration of the principle concepts in the proposed molecular representation and matching

**Illustration of the principle concepts in the proposed molecular representation and matching**. (a) The Gauss Map, (b) Embedding of a molecule in the tessellated sphere, (c) Intuition behind the surface matching approach: The three distributions contain an identical number of black and grey squares and can not be disambiguated by a property (colour)-based histogram. However, a histogram of pair-wise distances between similar colored squares, which captures their spatial distribution, can distinguish the third distribution from the first two. Such a characterization has the added advantage of being invariant to Euclidean transformations of the distribution.

The

The properties of the EGI, especially the Minkowski theorem provide the foundations for representing and comparing surface-based description of objects. However, an inherent problem of EGI-type mappings is their dependence on the Gauss map which is non-unique for non-convex shapes. Because of this, more than two points on an object surface may be mapped on the same point on the Gaussian sphere. Unfortunately, many molecules in their stable conformations induce surfaces that are non-convex and therefore the direct application of techniques from the EGI family is precluded for their representation and matching. To address this problem, we utilize the idea of the

Comparing surface-based molecular representations

We seek to define the similarity of two molecules in terms of the similarity of their surface-property distributions, described using histograms. The technique of

In the case of molecules, it is critical not just to account for the similarity of property distributions, but also the similarity of the spatial distribution of these properties on the molecular surface. Hence, a direct application of histogram intersection to compare the property distributions is by itself, insufficient. This issue is illustrated in Figure

Our approach uses the distribution of the pair-wise distances between points having similar property values to characterize the spatial distribution of the corresponding molecular property. Furthermore, we use histogram intersection to compute the similarity of the property distributions as well as the similarity of the spatial distributions. In addition to efficient computability and invariance to translations and rotations of the molecule, a significant advantage of this approach is its ability to characterize (and compare) the relative spatial distribution of surface properties, which act as pharmacophores. The main steps of the method are:

Step 1

For each specific property of the molecule, such as shape, donor field, or acceptor field, the property values across all the points on the surface of the tessellated sphere are determined. The range of values is then uniformly divided into a predefined number _{1}...P_{K }denote the _{L}, the property-histogram corresponding to the property _{L}, _{L}steps 2–4 are repeated.

Step 2

The points contained in property-bin

Step 3

The geodesic distance between all pairs of centroids is computed. We note that Steps 2–3 constitute a computationally cheaper alternative to computing the distances between all pairs of points in property-bin

Step 4

These distances are quantized in distance-bins which are defined in increments of one Angstrom in the range [0, _{L}. The content of a distance-bin denotes the number of points on the surface of the sphere that lie within a specific distance (equal to the range of the distance-bin) of each other and have values for the property P_{L }that fall within the range of property-bin _{L}.

Step 5

Consider two molecules _{1 }and _{2}, a property P_{L }along with the corresponding property histograms _{1 }and _{2 }respectively. The similarity _{m}, of the spatial distribution of points lying in property-bin _{1 }and _{2 }is defined as the histogram intersection of

In Eq. (5), the average of the two histogram intersections is taken to ensure symmetry. Further (denoting the indexing of the distance-bins by

Step 6

The similarity of two molecules _{1 }and _{2}, in context of the property P_{L }is denoted by _{L}(_{1}, _{2}) and is defined as the histogram intersection of the corresponding property-histograms _{m}. Formally:

Where (indexing the bins of the property-histogram H_{L }by the variable _{a }and _{b }is defined as:

Step 7

The similarity between two molecules _{1 }and _{2 }given _{1}...P_{K }is defined as the average similarity computed over all the _{full}(_{1}, _{2}).

Step 8

The overall similarity between the molecules is computed by taking into account molecular conformations; it is defined as the maximum value of _{full}(_{1}, _{2}) over the set of conformations each of the molecules can attain (see Eq. (9)). The conformations can be generated using a package such as CONCORD

Where _{i }and _{j }denote specific conformers of the molecules _{1 }and _{2 }respectively. Further, the sets _{1 }and _{2}.

Illustrative example

We use the point distributions shown in Figure _{m }of the spatial distribution of points lying in each of the bins of _{1 }= 0.5; _{2 }= 0. In Step-6, the similarity of the first and third distributions is therefore: _{L}(_{1}, _{3}) = (4 × 0.5 + 2 × 0)/6 = 0.33. The reader may trivially verify that _{L}(_{1}, _{3}) = 1.0.

Competing interests

The author declares that they have no competing interests.

Acknowledgements

The author thanks the anonymous reviewer(s) for their comments. These have lead to significant improvements in the presentation of the material. This research was partially funded by the National Science Foundation grant IIS-0644418.

This article has been published as part of