Keck Graduate Institute of Applied Life Sciences, Claremont, CA, USA

Abstract

Background

A structure alignment method based on a local geometric property is presented and its performance is tested in pairwise and multiple structure alignments. In this approach, the writhing number, a quantity originating from integral formulas of Vassiliev knot invariants, is used as a local geometric measure. This measure is used in a sliding window to calculate the local writhe down the length of the protein chain. By encoding the distribution of writhing numbers across all the structures in the protein databank (PDB), protein geometries are represented in a 20-letter alphabet. This encoding transforms the structure alignment problem into a sequence alignment problem and allows the well-established algorithms of sequence alignment to be employed. Such geometric alignments offer distinct advantages over structural alignments in Cartesian coordinates as it better handles structural subtleties associated with slight twists and bends that distort one structure relative to another.

Results

The performance of programs for pairwise local alignment (TLOCAL) and multiple alignment (TCLUSTALW) are readily adapted from existing code for Smith-Waterman pairwise alignment and for multiple sequence alignment using CLUSTALW. The alignment algorithms employed a blocked scoring matrix (TBLOSUM) generated using the frequency of changes in the geometric alphabet of a block of protein structures. TLOCAL was tested on a set of 10 difficult proteins and found to give high quality alignments that compare favorably to those generated by existing pairwise alignment programs. A set of protein comparison involving hinged structures was also analyzed and TLOCAL was seen to compare favorably to other alignment methods. TCLUSTALW was tested on a family of protein kinases and reveal conserved regions similar to those previously identified by a hand alignment.

Conclusion

These results show that the encoding of the writhing number as a geometric measure allow high quality structure alignments to be generated using standard algorithms of sequence alignment. This approach provides computationally efficient algorithms that allow fast database searching and multiple structure alignment. Because the geometric measure can employ different window sizes, the method allows the exploration of alignments on different, well-defined length scales.

Background

As the number of protein structures continues to grow, structure comparison techniques have become an increasingly crucial bioinformatics tool. Because protein structures evolve more slowly than protein sequences, structure comparison can be used to assess distant evolutionary relationships and common functions for pairs that do not have high sequence similarity (cf.

Rigid body superpositions with distance metrics are less than optimal because subtle twists or bends in a protein structure can have a profound influence on the scoring of the alignment. These are often corrected by considering local alignments or introducing gap penalties. Recently, a number of new algorithms have been developed that allow for the flexible alignment of local fragments

The use of such measures is similar in spirit to earlier work on the differential geometry of proteins (see

In the present work, we extend the consideration of non-distance related metrics to develop algorithms for structure alignment. The writhing number is used as a local geometric measure that describes the curvature of the protein backbone formed from short connected segments of α-carbon atoms. Originally defined to describe the topology of closed circular DNA, the definition of the writhing number has recently been extended to consider open polygonal chains. Using a sliding window, the writhing number is calculated along successive regions of the chain. This calculation provides a local geometric profile of each protein. The regions considered in this work encompass 4, 5, 6 and 10 α-carbons. The values for the writhing number at each different length scale are separately encoded into a 20-letter alphabet by partitioning the histogram of all segment values obtained from RCSB Protein Data Bank (PDB) into bins and assigning each bin a letter in the alphabet. This procedure allows standard sequence alignment algorithms to be used to compare the geometric profiles. Using this approach, we have successfully "re-sequenced" all 52,087 proteins available in the PDB at the time of this work and have stored them into our own database for quick access. Using a block alignment approach identical to that used in calculating the BLOSUM substitution matrix, a scoring matrix for substitutions in the geometric alphabet was determined. Using this matrix (referred to as TBLOSUM) and our resequenced structure data bank, standard sequence alignment methods were used to perform structure alignments. To validate this approach, the performance of the local Smith-Waterman alignment (TLOCAL) and the CLUSTALW (referred to as TCLUSTALW) were used to perform high quality pairwise alignment and multiple structure alignment, respectively. This performance compares favorably with existing methods.

Results

Pairwise alignment of "difficult" structures

Using a database of sequences encoded from writhing numbers and a block scoring matrix (see Methods), several test proteins were selected to optimize the performance of TLOCAL and compare it to other methods. Alignments of ten "difficult" pairs of structure

Table 1 shows the performance of the TLOCAL algorithm for different size windows for the ten "difficult" pairs [see

Comparison of topological alignments for different window sizes for a group of "difficult" proteins.

Click here for file

In Table 2, the quality of the alignments as measured by AFPRMSD is compared for TLOCAL's (window size of 5), CE and FATCAT [see ^{1/3 }_{g }also scales as: _{g }∝ ^{1/3 }where L is the protein length

Comparison of different alignment methods for "difficult" proteins.

Click here for file

and for simplicity we set _{0 }to 1 Å. This quantity is now used in Tables 1-3 to compare all alignments of different lengths. In all cases, the reduced AFPRMSD shows that TLOCAL outperforms FATCAT in all cases and CE in 6 out of 10 cases [see Additional file

Comparison of the effect on window size (TLOCAL) and alignment method on alignment scores for hinged proteins.

Click here for file

Pairwise alignment of hinged proteins

As an additional assessment of the performance of the local topological alignment algorithm, the performance on the alignment of structures with flexible or hinged regions was determined. The difficulty in aligning proteins with hinged regions motivated the development of new structural alignment programs; FATCAT

Multiple sequence alignment on a kinase superfamily

In addition to pairwise alignments performed by TLOCAL, the performance of the multiple structure alignment program (TCLUSTALW) was also examined. To evaluate the performance of TCLUSTALW, a family of protein kinases was aligned and the identified conserved regions were compared with those determined previously by a hand alignment. These 25 sequences include serine/threonine and tyrosine kinases provided by Scheeff and Bourne

The comparison of hand alignments and those resulting from TCLUSTALW are shown side-by-side in Table 4 [see

Multiple sequence alignment of the kinase superfamily - Comparison of topological MSA with hand MSA

Click here for file

Discussion

In this work, a geometric profile of an individual protein is created by calculating the writhing number of consecutive segments (sliding window) along the protein chain. The profile is then encoded into a geometric alphabet by associating a range of numerical values with different letters of the alphabet. This alphabet is determined by observing the histogram of the frequency of writhing values in all segments of all the proteins observed in the PDB. This histogram is partitioned into bins and a letter from the geometric alphabet is associated with each bin. The numerical range of the bins is adjusted so that each bin contains the same number of segments. Thus, if a segment is chosen at random, it would have an equal chance at falling into any one bin. Consequently, each letter in the "geometric alphabet" has an equal chance of occurring in a protein structure. The motivation of partitioning the histogram in this fashion is to maximize the information content of the alphabet. Other ways of encoding the writhing number could conceivably be more effective. For instance, some geometric features may be more relevant or distinctive than others and it might be important to carefully delineate the values of the writhing numbers associated with these features. Such level of detail has not been investigated to date and lacking such information, the maximum information entropy approach is taken as a good first approach to encoding the local topological information in the protein profile.

A second important issue is the size of the alphabet used to encode the writhing number, a continuous variable. In principle, the smaller the bin range the greater the information content. In the limit of the bin size approaching the inherent error in the writhing number, more information will no longer be captured by decreasing the bin size. This error limit could be obtained by the propagation of the experimental error of the α-carbon atom positions used in the calculation of the writhing number. However, in mapping the structural alignment problem into a sequence alignment problem, not only is an accurate encoding required but also an accurate scoring system must be obtained. As the alphabet is expanded, more data is needed to accurately determine the values of the substitution matrix. Additionally, the programs calculating alignments will become increasingly computationally intense. There will be a trade off between increasing resolution of the bins of the histogram and the concomitant loss of scoring accuracy and increase of computation intensity. Again, these issues have not as of yet been explored in depth. Our strategy has been to adopt the twenty letter alphabet common to existing protein sequence alignment and to investigate the performance of the topological alphabet and scoring system under these familiar conditions. Keeping with these conditions, the gap penalties are treated as adjustable parameters and are generally in the range of values used for sequence alignment.

Given these conditions, the structure alignment matches local geometric propensities between different proteins and aligns the topological sequence to optimize the score from these propensities. As such, no Cartesian spatial associations can be directly assigned to these alignments. This topological association rather than a direct physical association is at the heart of the method and allows the alignment to avoid the difficulties with spatial alignment of rigid bodies as exemplified by the problem with hinged proteins. While the geometric alignment method does not allow for the familiar three-dimensional viewing employed in most existing structural alignment algorithms, this approach directly addresses the deeper issue of comparing similar structural regions that are offset by intervening differences. The problem of properly assigning alignments on either side of a hinge region is then approachable by this method.

Difficulties such as those presented by hinged proteins call to question the very nature of the structure alignment problem. Several authors have suggested that the alignment problem as commonly posed is not a well-defined problem and may not have an optimal solution (cf.

To allow comparison with methods that use distance metrics as a measure of alignment quality, we employed the device of identifying AFPs from the topological alignment and using these segments as rigid bodies for a local structural alignment. The RMSD could then be calculated from the sum of all these local alignments. Using this measure, we observe that the Smith Waterman topological alignment, TLOCAL, compares favorably with CE and with FATCAT for both "difficult" protein pairs and for hinged proteins. This demonstrates the versatility of the method in handling situations that have traditionally been problematic for structure alignment methods. Despite the good performance with the AFPRMSD distance metric, one must bear in mind that such metrics are not optimized by the topological alignment and that this method is a distinctly different from distance-based alignment methods.

In addition to the versatility of handling pairwise alignments, the topological alignment method can easily be extended to areas of structural bioinformatics that have traditionally been very difficult because of their computational intensity. Two of these include fast database searching and multiple structure alignment. Our results using TCLUSTALW are particularly encouraging with the example of the alignment of TPK family members. Members of the TPK family all contain a Universal Core Domain consisting of a small, mostly β-sheeted N-terminal subdomain and a larger mostly α-helical C-terminal subdomain

Conclusion

This work shows initial encouraging results for developing a suite of structure alignment software tools based on a geometric encoding of protein structures. With a limited exploration of the parameters of the method, competitive performance of pairwise alignment has been demonstrated. Additionally, a computationally efficient and accurate multiple structure alignment has been achieved. The advantage of this method over other approaches is that it performs alignments on a well-defined length scale as dictated by the sliding window employed in generating the geometric alphabet. Current work is extending the method to rapid database searching using SBLAST, the structural equivalent of BLAST. Additional work will also focus on developing a range of substitution matrices based on different block and evolutionary models. Also, a more systematic exploration of alphabet size and segment size is currently underway. Thus, there is significant opportunity to further optimize this unique set of structural alignment software tools.

Methods

Calculating the writhing number

The writhing number can be calculated for chains of arbitrary length **r**_{13}, **r**_{14}, **r**_{23}, and **r**_{24}, as seen in Figure **r**_{12 }and **r**_{34 }is right or left handed, the writhing number is positive or negative.

Definition of vectors for a polygonal curve

**Definition of vectors for a polygonal curve**. Definition of vectors used in the computation of the writhe number of two segments of a polygonal curve. Points 1 and 2 define the first segment, and 3 and 4 the second. The vectors **r**_{13}, **r**_{14}, **r**_{23}, and **r**_{24}, are translated so that they originate at the center of a unit sphere. The area

To handle polygonal curves with more than four points, the writhing numbers for all the distinct pairs of vectors are added together. Thus, the writhing number

where

Ω_{i,j }= (arcsin(**a**_{i,j}·**b**_{i,j}) + arcsin(**b**_{i,j}·**c**_{i,j}) + arcsin(**c**_{i,j}·**d**_{i,j}) + arcsin(**d**_{i,j}·**a**_{i,j}))·**r**_{j,j+1 }× **r**_{i,i+1}·**r**_{i,j+1}) (3)

and

with **r**_{i,i+1 }representing the vector between points **r**_{j,j+1 }× **r**_{i,i+1})·**r**_{i,j+1}). Larger positive or negative values of

Defining an alphabet for the geometric measure

Using Equation 2, the writhing number for each window was calculated for all PDB proteins available from the RCSB Data Bank. The frequency of occurrence of writhing numbers calculated using a sliding window of 4, 5, 6 and 10 residues is shown in Figure

Distribution of writhing numbers across protein structures

**Distribution of writhing numbers across protein structures**. The distribution of writhing numbers from segments of all proteins in the PDB using a window size of 4, 5, 6 and 10. The histrogram was broken up into twenty regions of constant population (area under the curve). These 20 regions were used to define a topological alphabet. Notice that the range of writhing number increases with segment size.

the entropy function is maximized when _{i }= _{j}, where _{i }is the probability that an arbitrary writhing number is assigned to the ^{th }letter of the alphabet. Using the writhing number bins and their corresponding letters, all PDB proteins were encoded into the "geometric alphabet". The encoding of writhing numbers into a geometric alphabet ignores the identity of the amino acids themselves.

Calculating a block substitution matrix

A substitution matrix was calculated to score alphabet substitutions when comparing proteins structures encoded by the geometric alphabet. This matrix is referred to as TBLOSUM and was determined from multiple sequence alignments of closely related proteins found in the PDB. Using their SCOP classification, 44,234 proteins were grouped into 589 families as defined by their SCOP lineage (a list of these families can be obtained upon request to the authors). Only those families consisting of more than 20 members were considered. These proteins were aligned using CLUSTALW based on their original amino acid sequences using the default BLOSUM62 matrix, a gap opening penalty of 4 and a gap extension penalty of 1. Following the alignment, the geometric alphabet was superimposed upon the sequence. The statistics of geometric alphabet substitutions were determined for alignment blocks. The transitional frequencies for all possible transitions are given as:

These transition frequencies for each amino acid pair are summed across all blocks for all aligned families. The frequency of members of the alphabet is obtained by simply summing over respective transition frequencies. These single alphabet frequencies are used to calculate the expected number of transition frequencies, e_{XY}, assuming that alphabet pairs, X, Y, occur randomly with the members of a pair being proportional to their respective alphabet frequency. The score for any transition is the negative log-ratio of the observed frequency of the transition to the expected frequency of the transition, derived in the same manner of a BLOSUM substitution matrix:

If the transition between i and j occur more frequently than random, it is given a negative score. However, if the transition occurs less frequently than random, the transition is assigned a positive value. Figure

The BLOCK scoring matrix for the encoded writhing number

**The BLOCK scoring matrix for the encoded writhing number**. Diagram represents a color coded scoring matrix for an alphabet of 20 letters and a window size of 5.

Calculating RMSD scores for alignments

Pairwise geometric alignments using the local dynamic programming algorithm TLOCAL optimize the alignment score based on the new scoring system (TBLOSUM). As previously noted, this approach to protein alignment is not intended to minimize global RMSD. Rather, it aligns regions of the proteins that show similarity in local topological profiles and does not allow a direct Cartesian rendering of the alignment. To allow for a comparison of our method with other alignment techniques, we sought a simple way to compute a RMSD for an alignment based on the topological alignment. We used the topological alignment to identify aligned fragment pairs (AFPs). The RMSD of the AFPs are computed for any pair containing at least four aligned pairs. As an example, we consider the following:

VNLDW--Q-QWTW

TPLDWOPQRRWSY

For the five pairs making up the first AFP and the four pairs in the third AFP, we compute a composite RMSD score, but for the single pair of Qs in the middle, no RMSD can be computed and these are not considered in our AFP RMSD calculation. The RMSD values for the AFPs are calculated using the UCSF Chimera package from the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, which performs rigid translations and rotations to minimize the RMSD between the aligned residues of a pairwise alignment. We calculate RMSD for each AFP. The score for each block is squared and multiplied by its length in aligned pairs. The resulting numbers are summed and divided by the total number of aligned pairs in all the AFPs used. The square root of this number is taken as RMSD of the alignment. This procedure was applied to the CE alignments, as well as the TLOCAL alignments. One must bear in mind that CE was not designed to minimize the RMSD calculated in this fashion and is not optimized for this scoring function.

Performing the alignments

All alignments were performed using open source versions of the Smith-Waterman dynamic programming algorithm and CLUSTALW. The run time for these applications do not differ from those found in sequence alignment applications. The computationally intense aspect of the work is the encoding of the PDB coordinates into a library of geometric sequences. At the time of this work, the library consisted of 52,087 proteins with a database length of 15,072,799 amino acids. For a window size of 4, cpu run time was 4.71 hours. For a window size of 10, the run time was 98.76 hours. All results were obtained on an IRIX64 server with 16 CPUs with 14G available memory and 128 M swap.

Authors' contributions

AR worked on pairwise alignments and comparison to other methods. PC worked on multiple structure alignments. AR and PC worked on coding of writhe numbers and PC worked on libraries. GD provided the original impetus for the project and oversaw the project.

Acknowledgements

This work was supported by NIH grant 1P01GM 63208. The authors thank valuable discussion with Dr. Eric Scheeff and Dr. Phil Bourne and appreciate their sharing of the multiple alignment data. Structural alignments were preformed using the UCSF Chimera package from the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco (supported by NIH P41 RR-01081).