Department of Pharmacology, University of California San Diego, La Jolla, CA 92093, USA

San Diego Supercomputer Center University of California San Diego, La Jolla, CA 92093, USA

Abstract

Background

That the structure determines the function of proteins is a central paradigm in biology. However, protein functions are more directly related to cooperative effects at the residue and multi-residue scales. As such, current representations based on atomic coordinates can be considered inadequate. Bridging the gap between atomic-level structure and overall protein-level functionality requires parameterizations of the protein structure (and other physicochemical properties) in a quasi-continuous range, from a simple collection of unrelated amino acids coordinates to the highly synergistic organization of the whole protein entity, from a microscopic view in which each atom is completely resolved to a "macroscopic" description such as the one encoded in the three-dimensional protein shape.

Results

Here we propose such a parameterization and study its relationship to the standard Euclidian description based on amino acid representative coordinates. The representation uses multipoles associated with residue C

Conclusion

The results described here demonstrate how a granular description of the protein structure can be achieved using multipolar coefficients. The description has the additional advantage of being immediately generalizable for any residue-specific property therefore providing a unitary framework for the study and comparison of the spatial profile of various protein properties.

Background

The functions of a protein are determined by its three dimensional structure. It is believed that the functional space of all proteins can be spanned by combining a rather small number of structural units, termed folds. The number of different folds is small compared to the total number of proteins, of the order of 1000 for globular, water-soluble proteins

The comparison of structures is a central component of many research objectives. For formulations of the problem and a review of various methods of comparison see for example

A disadvantage of the

A fold is characterized by structural features at the multi-residue level. Even though these features are easily recognizable visually in many cases, there is no obvious quantitative way to relate them to the underlying atomic coordinates that exactly describe the structure of the protein. That is, atom coordinates only offer a local description of the structure, while the features defining the fold represent global, shape related properties. Starting with the very rich description given by the coordinates of the atoms that make the protein, one would then need a way to distill from this set of coordinates only the information that is directly associated with these general traits. We do not know of the existence of any systematic approach for the elimination of the redundant information, starting from the initial set of coordinates. Here, we adopt an approach that starts from the other end of the spectrum. Instead of starting from the atom description and discarding non-relevant features, we start from the global level with a very coarse description and refine it by adding descriptors for more and more detailed features.

The need for methods that use global descriptors for the comparison of protein structures has been recognized before. One such method

Our approach can be combined with other important parameters, for example, mass of the residue instead of the C

The notion of multipoles originates in physics

The organization of the paper is as follows. In the Results and Discussion section we first define the multipoles and present a qualitative motivation for their use as an alternative parameterization of the shape of the protein. Then we show how their tensorial properties can be used to define a distance function in the conformational space. Since the multipoles are dependent on the location and orientation of the system of axes, the following subsection is used to define a "canonical" reference frame to be used for the purpose of comparison. The concluding subsection is devoted to testing the method. We show that, given the biological relevant portion of the structure, the method successfully discriminates between the families in a test set of proteins from the protein kinase-like superfamily

Results and discussion

Mutlipolar representation

The notion of multipoles comes from physics where they are used to describe the field generated by the spatial distribution of a scalar quantity such as mass (gravitational field) or charge (electrostatic field) density. The potential of the field created by such a quantity, at a given distance outside the region it occupies in space, can be conveniently expressed in the form of a multipole expansion

Quantitatively, the multipoles associated with the space distribution of a scalar property (density of mass in our example) form a sequence of tensors over the three dimensional position space. The multipole of rank zero, or the monopole, is just the space integral of the scalar property. When the scalar property is the mass density then the monopole is the total mass of the set of atoms.

The multipole of rank one is proportional to the position of the center of mass. We will use it to set the origin of the coordinate system with respect to which all multipoles are calculated. Therefore, in our calculations, the multipole of rank one, (the dipole as it is commonly known), is always going to be a null vector. For completeness, we should mention that for the multipoles of a distribution of charge this can not always be done. If the total charge is zero there may be a non-zero dipole moment that can not be made to vanish by a translation of coordinates. This is however a technical problem and it has been addressed before

The multipole of rank two, or the quadrupole has nine Cartesian components. For our discrete distribution, the components are given by the following expression:

Here, _{i,j }= 1 for _{α,i }is the component _{α }represents the length of the _{α}_{α}.

For higher order multipoles, enumerating the Cartesian components in closed form is not a simple task. Moreover, there is a large number of symmetry properties obeyed by the Cartesian components and therefore keeping all components of a given multipole would be redundant. Instead, more commonly, the

For a discrete set of

where _{i}, _{i}, _{i }represent the spherical coordinates of atom _{lm }denotes a spherical harmonic function and the * denotes complex conjugation. For the definition and summary properties of these functions, see for example

The rank ^{2}. As

As a last remark, we will note that the set of multipoles that we use here is only a subset of a larger set which, in its entirety, uniquely describes the potential field surrounding the distribution of charge (for example)

Constructing a distance function in the protein conformation space

What makes the set of multipoles defined in Eq. (2) a good set of descriptors for comparison purposes is that they form a series of quantities with remarkable symmetry properties. Specifically, for any given rank _{lm}, **q**_{1 }the set of all components

where the last part of the equation follows from the definition of the multipoles and well known symmetry properties of the spherical harmonic functions **q**_{1},

_{l}(**q**_{1}, **q**_{1}**'**) = ||**q**_{1 }- **q**_{1}**'**||. (4)

There is no a priori prescription for combining distances in subspaces of different ranks

_{l}(**q**_{1}, **q**_{1}**'**) = ||**q**_{1 }- **q**_{1}**'**||^{1/l},

Note that, once the multipole components have been calculated, any reference to the original Cartesian coordinates disappears from the representation. As a consequence, unlike the

The notation we use in this formula for the "size" dependent factors (|_{0}|, |

This function will be used as a dissimilarity measure for proteins in our study. The upper limit in the summation is the maximum rank of the multipoles retained in the representation and determines the dimensionality of the representation and, implicitely, its level of detail.

The interest for reduced representations is manifest in the literature. From a shape perspective, similar to ours, such representations emerge in approaches such as that described in

Defining a canonical reference frame

The multipoles behave like vectorial quantities and the numerical values of their components depend on the location and orientation of the reference frame. For the comparison of structures to be meaningful, we need to either minimize the distance in Eq. (7) over all rigid transformations (translations and rotations) of one of the molecules, or to choose a standard for the reference frame with respect to which the multipolar configurations of the molecules are calculated

The problem of choosing such a standard arises in many research areas where 3D systems are involved

The first vector reduces to the relative position of the last amino acid with respect to the first, while the second one is a more complex quantity that is sensitive to the details of the path of the protein backbone. Except for special cases (for example when the two vectors in Eqs. (8, 9) are not well-defined, or they become parallel), these vectors are independent. Then, they can be orthonormalized and the resulting unit vectors will serve as the first two versors of our

The "canonical" reference frame defined by Eqs. (8, 9) is unique by construction. However, other unique definitions can be developed

Testing the multipole representation

To test our representation of protein structure, we performed a number of calculations with the goal of assessing both its discriminatory power and, where meaningful, its correlation with the Cartesian description.

Comparing biologically relevant molecules

As already stated, the use of multipoles opens the possibility of protein shape comparisons without the need for a pre-existing amino acid alignment. However, while technically our representation allows for the comparison of arbitrary collections of atoms, in biological applications, such as protein classification, not any comparison will make sense: we need to restrict the comparison to those portions of the proteins which are relevant to the problem, for example, the functional regions. In principle, the multipoles can be used in identifying corresponding domains in structures, however, as of this moment we do not have fully functional tools to do that. Therefore, as a benchmark for testing the method, we use a manual alignment of the catalytic cores from the protein kinase-like superfamily

We performed an all-against-all comparison of the proteins in the benchmark set, using as input coordinates of C

Matrix of the distances between the biological relevant units of the proteins in the set

**Matrix of the distances between the biological relevant units of the proteins in the set**. Multipole-based distance matrix calculated from C_{max }= 4 are retained in Eq. (7). Here, and in all other distance matrix representations, darker colors map into bigger distances. The upper six raws and left six columns represent inter-family distances while the rest of the matrix contains only distances between the kinase family members.

Even at the subfamily level with relatively little shape discrimination, the distance matrix retains some of its discriminatory power. A close examination reveals distinct patterns along the diagonal corresponding to the various groups of kinases in the test set (Figure

The discriminatory power at the family level is limited by unresolved portions of some structures. The lack of coordinates for parts of the polypeptide chain affects the calculated distances both directly (a missing piece of chain is seen as a difference in shape) and indirectly (a missing piece of chain leads to a different canonical reference frame). To reduce these perturbations, we chose to ignore in our calculations any portion of the alignment corresponding to missing parts in at least one of the proteins in the set. Most unresolved portions are relatively short (approximately 20 amino acids) and do not affect the shape dramatically.

Correlation with the Cartesian representation

The multipolar description offers a hierarchical approach to characterizing the shape of a molecule. While at the coarsest level there is no information about the shape, except that defined by the length of the chain, at the most refined level of details (when the number of multipole components is of the same order of magnitude as the original Cartesian coordinates) the description is as rich as the original amino acid coordinate set. At this end of the spatial spectrum we would expect a good correlation with results provided by the Cartesian coordinates. To empirically prove this, we need to devise experiments in which both representations can be applied and then compare the results.

An obvious choice is the comparison of aligned proteins, alignment being necessary for the

Case 1

In the first case, the

The two vectors in Eq. (10) denote the coordinates of aligned residues. The multipoles of the aligned portions of the proteins were calculated with the coordinates expressed in the canonical reference frame defined by Eqs. (8, 9).

The multipoles of each protein in a given pair were computed from the C

The correlation calculations are shown in Figure

Correlation coefficient between multipole and rmsd distances

**Correlation coefficient between multipole and rmsd distances**. Correlation coefficient of the multipole and _{max }retained in the description). All different pairs of distances from the 31 protein set are included in the comparison. The

In Figure _{max }= 4, the point of saturation in Figure

Comparison between multipole and rmsd distances (

**Comparison between multipole and rmsd distances ( Case 1)**. The distance matrices between aligned portions of proteins in the multipolar (a) and Cartesian coordinate (b) representations. Multipoles up to order four are retained in (a). The

For a better understanding of how the two descriptions correlate, we need to analyze more carefully the distances in the two representations. To make the discussion more quantitative, in Figure

Distances in a subset of representative proteins

**Distances in a subset of representative proteins**. The distance between all pairs of a subset of representatives: 1bo1, 1ia9, 1e8x, 1cja, 1nw1, 1j7u, 1cdk, 1csn and 1ir3. The upper dashed curve is the _{max }= 12.

Correlation coefficient between multipole and rmsd distances for a subset of pairs closer in rmsd

**Correlation coefficient between multipole and rmsd distances for a subset of pairs closer in rmsd**. Correlation coefficient as a function of the level of detail (highest multipole rank _{max }in the representation). In this calculations only six structures from the smaller set, those occupying the right side in Figure 4 are included: 1cja, 1nw1, 1j7u, 1cdk, 1csn and 1ir3. The distance between pairs of structures in this set are on average smaller than for the whole set in Figure 4.

Case 2

In a second set of calculations, both the multipole and

Comparison between multipole and rmsd distances in the "canonical" frame (

**Comparison between multipole and rmsd distances in the "canonical" frame ( Case 2)**. Multipole (a) and

Comparison between multipole and rmsd distances with superposition (

**Comparison between multipole and rmsd distances with superposition ( Case 2)**. Multipole (a) and

It is clear that, while the multipole description differs in some intrafamily details from the typical

Conclusion

In this paper we propose a new parameterization of protein structure which provides a new form of characterization and comparison. The approach uses components of the multipoles of consecutive ranks associated with C

We have shown:

• Once an approximate "superposition" has been calculated using our canonical reference frame, the multipole distance function is capable of discriminating between protein families.

• The multipole description allows for the adjustment of the level of detail of the comparison and, implicitly, it provides a systematic method for deriving reduced representations of the protein configuration space.

From a biological perspective, our tests show that the comparison based on multipoles is more robust with respect to intrafamily details and the results are more meaningful biologically. From the comparison tests with the Cartesian description, its robustness appears to be related in part to the use of a "canonical" reference frame for the comparison rather than the spatial superposition of the structures. Also, the visible relationship between the distance matrix in Figure

For illustration of the multipole method, we used the mass of the C

The use of alternative residue-specific quantities would provide a powerful tool for the comparison of proteins since the residue specific quantities allow an easier discrimination between structures with similar spatial location of the C

Our plans for further development and extension of the method include:

a) Rigorous definition of the notion of "canonical" reference frame. Our choice, based on features rigidly tied to the set of atoms is inspired by the body reference frames used in physical and engineering-sciences and is intuitive. However, the problem of comparing structures is different and criteria are needed for the identification of "good" reference frames and/or how they affect the protein comparison.

b) Algorithms for fast superposition by minimization of the multipolar distance would be needed as an alternative to the use of a "canonical" reference frame.

c) The definition of a global metric (Eq. 7) contains coefficients controlling the combination of multipoles of various orders. Further optimization of these coefficients for the purpose of protein comparison can lead to biologically more meaningfull metrics.

As a final remark, our representation allows an estimation of the number of degrees of freedom necessary to describe a given class of properties. The saturation of the correlation with the Cartesian representation marks the maximum number of degrees of freedom necessary to "macroscopically" distinguish structures within that class. Since the structure determines the whole biology of the proteins, one can infer from here that the same number of degrees of freedom describes the whole functional space of that class of proteins. The number obtained from such a correlation curve can be used to adjust the dimensionality of the representations used in protein comparisons.

Methods

The atomic coordinates of the selected members of the protein kinase-like superfamily were obtained from the ASTRAL database _{max }= 8 takes of the order of 100 ms.

Authors' contributions

AG developed the formalism, did the calculations and drafted the paper. PB provided the biological context and directed the application and testing of the formalism. Both authors read and approved the final manuscript.

Acknowledgements

We are grateful to Marian Anghel and Eric Scheeff for very useful discussions and to Yuting Jia for providing the programs for the spatial superposition of aligned structures. We are grateful for financial support from grant NIGMS GM63208.