The fast growing Protein Data Bank contains the three-dimensional description of more than 45000 protein- and nucleic-acid structures today. The large majority of the data in the PDB are measured by X-ray crystallography by thousands of researchers in millions of work-hours. Unfortunately, lots of structural errors, bad labels, missing atoms, falsely identified chains and groups make dificult the automated processing of this treasury of structural biological data.
After we performed a rigorous re-structuring of the whole PDB on graph-theoretical basis, we created the RS-PDB (Rich-Structure PDB) database. Using this cleaned and repaired database, we defined simplicial complexes on the heavy-atoms of the PDB, and analyzed the tetrahedra for geometric properties.
We have found surprisingly characteristic differences between simplices with atomic vertices of different types, and between the atomic neighborhoods – described also by simplices – of different ligand atoms in proteins.
The information stored in the Protein Data Bank  would make possible fully automated in silico studies if mislabeled chemical groups, broken protein- and nucleic acid chains and other errors were corrected. Even today, the newly submitted data is verified "by hand" by human experts. In an earlier work, we applied a rigorous cleaning and re-structuring procedure for the entries in the Protein Data Bank , and created the RS-PDB database. We made use of non-trivial mathematical, mainly graph-algorithms: Computing the InChI™ code [3,4] applied a graph-isomorphism testing, transforming aromatic notation to Kekule-notation used a non-bipartite graph-matching algorithm , breadth-first-search graph traversals  were used throughout the work , depth-first search  was used in building the ligand molecules and identifying ring structures, kd-trees  were applied for computing covalent bonds, and hashing  were utilized for the fast generation of protein-sequence ID's.
The resulting RS-PDB database is capable to serve intricate structural queries on all the three-dimensional protein structures known to mankind.
It is of basic importance to map the physico-chemical properties of protein-ligand binding sites, most importantly the Coulomb and Van der Waals forces, in order to predict protein-ligand binding, to design ligands for a given binding site on the surface on a protein, or in designing inhibitors or activators in enzymatic mechanisms. The exact description of the forces in question are deep quantum-chemical problems. The atomic environment of the binding sites clearly has strong effect to these forces; consequently, by examining the atomic environments of the ligands in the crystallographically verified protein-ligand complexes in the PDB would yield insight in binding mechanisms and biologically active molecule design. The first step in this direction need to be the analysis of the simplicial structures of the atoms, forming the protein structures themselves. The second step is the analysis of simplicial neighborhoods of the ligand atoms.
In the present work we define a certain simplicial decomposition on the heavy atoms of the protein structures in the PDB, and analyze some geometrical properties of the tetrahedra of different atomic composition. By this way we – first time in the literature – succeeded in defining a structure capable to answer topological questions concerning the distribution of volume and shape of heavy protein-atoms in the whole PDB. One of our main results is the identification of the volume-shape relation of tetrahedra of distinct atomic composition.
Even the refined, cleaned RS-PDB database  lacks important features, such as easy acceptance of queries such as: What atoms surround a certain (ligand- or protein-) atom in the structure? Which atoms are neighbouring with the atom/amino acid X in the protein? How many ligand-atoms are surrounded by exactly the tetrahedron with C-C-C-O atoms in its vertices? How frequent are the tetrahedra with vertices C-C-O-N? Are there differences in the shape of tetrahedra of different composition?
Note, that such queries cannot be answered from the amino-acid sequence of the protein, since they intrinsically depend on the tertiary structure of the protein. Consequently, one need to use some cleaned version of the PDB as the initial data.
We have chosen Delaunay decomposition in the discretization of the dataset in the RS-PDB database, since in this "tessellation", the tetrahedra are close to regular ones, and it is a natural and well defined notion, with a well-known algorithm for the generation of the tessellation.
Definition 1 Given a finite set of points A ⊆ R3, and a H ⊆ A such that the points of H are on the surface of a sphere and the sphere does not contain any further points of A, then the convex hull of H is called a Delaunay region.
Delaunay regions define a partition of the convex hull of A. If the points of A are in general position, (i.e., no five of the points are on the surface of a sphere), then all regions are tetrahedra.
Singh, Tropsha and Vaisman  applied Delaunay decomposition to protein-structures as follows: they selected A to be the set of Cα atoms of the protein, and analyzed the relationship between Delaunay regions volume and "tetrahedrality" and amino acid order in order to predict secondary protein structure.
They gave the following definition:
Definition 2 () The tetrahedrality of the tetrahedron with edge-lengths ℓ1, ℓ2, ℓ3, ℓ4, ℓ5, ℓ6 is defined
where ℓi is the length of edge i.
Note, that the tetrahedrality of the regular tetrahedron is 0.
Results and discussion
In what follows A ⊆ R3 is always a subset of the atoms of a protein, preferably heavy-atoms (i.e., non-hydrogen atoms) or just the Cα atoms.
Our complete test set was selected from the RS-PDB by the following criteria: the entry need to contain at least one protein, with no missing atoms, and the resolution of the structure has to be at least 2.2 Å. We have found 5,757 such entries in the RS-PDB database.
In contrast with the article , we have taken A to be the set of heavy atoms of the 5757 proteins. Note that in that case we cannot assume that points are in general position, as for example in a (perfect) benzene ring at least 6 carbon atoms lie on a sphere. However, we have found that – probably due to both imprecision of data in the PDB and minor perturbations in atomic positions – all regions are tetrahedra. In our test we – instead of examining the distribution of volume and tetrahedrality of regions separately – created density maps in both variables at the same time. The triple logarithmic plot can be seen on Figure 2. It is quite straightforward to see that at the boundary of the protein the tetrahedra tend to be more irregular and of larger volume, while in the inside of the protein, the tetrahedra are small, compact, and regular (see Figure 1). However, the more intricate analysis depicted on Figure 2 shows a distinctly characteristic distribution. One of our main results is the identification of regions of the plot of Figure 2, strictly characteristic to the vertex-composition of the tetrahedra involved.
Figure 2. The triple logarithmic plot of the density of Delaunay regions. A point with coordinates (x, y) on the plot corresponds to all Delaunay regions whose volume is 10(x ± 0.01) and tetrahedrality is 10(y ± 0.01) and the color of the point corresponds to log(z + 1) where z is the number of such regions. The white barplot on the bottom of the image is the same for volume only.
Labeling the vertices of the tetrahedra
After that we examined tetrahedra grouped according to the set of atoms in their vertices. All tetrahedra were assigned a label that is the merging of the 4 symbols assigned with the elements in the corners in alphabetic order. (For example a tetrahedra spanned by a nitrogen, two carbon atoms and an oxygen would be assigned the symbol: C_C_N_O_. Grouped by these labels, we listed the count of the tetrahedra in Table 1.
Table 1. The counts of different types of Delaunay tetrahedra in the test set of 5,757 PDB entries. Tetrahedron C_C_N_O_ (containing the peptide bound of amino acids) turns out to be the most frequent with 19, 463, 268 occurences in our test set. The frequency of other labels decrease exponentially.
Volume-shape distribution of different types of tetrahedra
We observed that splitting the density plot according to the composition of the vertex-sets of the Delaunay tetrahedra would show different patterns for different labels. This is one of our main results, depicted on Figure 3.
Figure 3. Separate drawing for different tetrahedra. We give here similar density maps as in Figure 2, but now separately drawn for tetrahedra with vertices C_C_N_O (inset A), C_C_O_S (inset B), C_N_O_S (inset C) and N_N_O_O (inset D). It is clear that different vertex-compositions implies different shape/volume distributions.
Ligand atoms in tetrahedra from proteins
Here we analyze the atomic environments of ligand atoms, bound to proteins. The atomic environment of each ligand atom will be identified as the vertices of a tetrahedron in a tetrahedral decomposition of the heavy atoms of the protein, containing the atom of the bound ligand.
By this approach we can describe uniformly and in a discreet manner the environment of ligand atoms in proteins. The classification is given by describing tetrahedra according to the atoms in their vertices, and by the atoms of the ligands the convex hull these tetrahedra contain (Figure 4). One of our main results is the statistical analysis of the frequencies of the separate ligand atoms in different types of tetrahedra, formed from protein atoms in Table 2 and Table 3.
Table 2. The classifications of the tetrahedra around metal ligand atoms. The tetrahedra not present contain no metal atoms.
Table 3. The classifications of the tetrahedra around frequent non-metallic ligand atoms. An atom is called frequent, if it appears in at least 100 entries in our data set.
We are using the ligand-identification technique described in , using the classification of monomer ID's given in  and . Concisely, we doubly checked if a ligand, even with more than one monomer ID's is one molecule or not, by comparing the bond tables from mmCIF and the atomic distances. The ligand was thrown out if recognized as a crystallization artifact, covalently bound (but non-protein-) or junk molecule .
In this work we prepared the simplicial decomposition of 5,757 protein structures, chosen from the Protein Data Bank by quality criteria such as every atom has coordinate (i.e., there are no missing atoms) and the resolution of the structure is at least 2.2 Å. The heavy atoms (that is, non-hydrogen atoms) of the structures were decomposed into Delaunay regions using the qhull algorithm . Next we depicted the tetrahedrality/volume relation in a triple logarithmic plot (Figure 2), and also counted the tetrahedra of different vertex-sets in Table 1. We found that tetrahedra with different atoms in their vertices populate different areas of the plot of Figure 2: Figure 3 gave our results. Figure 3 shows, that data-points, corresponding to tetrahedra of a given atomic composition assume well-characterizable positions in Figure 2. This result show the spatial preferences in tetrahedra of distinct composition in protein structures. By further exploring this avenue methods may appear in helping in silico protein folding studies. We also used the RS-PDB database  for finding crystallographically verified ligands in our test-set of 5,757 proteins. Next the tetrahedra, containing the atoms of these ligands were collected and given in Tables 2 and 3. We believe that these large-scale data will help in in silico identifying ligand-binding preferences in inhibitor design and in ligand binding prediction.
The authors declare that they have no competing interests.
Rafael Ördög designed and prepared the simplicial database, analyzed it with the triple-logarithmic plots of Figure 2, and Figure 3, and analyzed the data of tetrahedra of different atomic types and ligands. Zoltán Szabadka designed and prepared the RS-PDB database, including the cleaning methods, and helped the discretization. Vince Grolmusz initiated the simplicial decomposition of the protein spatial data, lead the work and wrote the paper.
This research was partially supported by the European Commission FP6 program "scrIN-SILICO" and by the Hungarian OTKA agency, under grant Nos. T046234 and NK67867. Parts of this work were done in cooperation with Uratim Ltd. and Math-for-Health LLC.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 1, 2008: Asia Pacific Bioinformatics Network (APBioNet) Sixth International Conference on Bioinformatics (InCo B2007). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S1.
[Annals of Discrete Mathematics, 29]
Communications of the ACM 1975, 18(9):509-517. Publisher Full Text
Journal of Computational Biology 1996, 3(2):213-222. PubMed Abstract
Barber CB, Dobkin DP, Huhdanpaa H: The Quickhull Algorithm for Convex Hulls. [http://citeseer.ist.psu.edu/article/barber95quickhull.html] webcite
ACM Transactions on Mathematical Software 1996, 22(4):469-483. Publisher Full Text