Abstract
Background
Many functional proteins have a symmetric structure. Most of these are multimeric complexes, which are made of nonsymmetric monomers arranged in a symmetric manner. However, there are also a large number of proteins that have a symmetric structure in the monomeric state. These internally symmetric proteins are interesting objects from the point of view of their folding, function, and evolution. Most algorithms that detect the internally symmetric proteins depend on finding repeating units of similar structure and do not use the symmetry information.
Results
We describe a new method, called SymD, for detecting symmetric protein structures. The SymD procedure works by comparing the structure to its own copy after the copy is circularly permuted by all possible number of residues. The procedure is relatively insensitive to symmetrybreaking insertions and deletions and amplifies positive signals from symmetry. It finds 70% to 80% of the TIM barrel fold domains in the ASTRAL 40 domain database and 100% of the betapropellers as symmetric. More globally, 10% to 15% of the proteins in the ASTRAL 40 domain database may be considered symmetric according to this procedure depending on the precise cutoff value used to measure the degree of perfection of the symmetry. Symmetrical proteins occur in all structural classes and can have a closed, circular structure, a cylindrical barrellike structure, or an open, helical structure.
Conclusions
SymD is a sensitive procedure for detecting internally symmetric protein structures. Using this procedure, we estimate that 10% to 15% of the known protein domains may be considered symmetric. We also report an initial, overall view of the types of symmetries and symmetric folds that occur in the protein domain structure universe.
Background
Many protein chains are made of repeating units of similar structure, which are often arranged in a beautifully symmetric manner. Some wellknown examples are the 8fold symmetric "TIM" barrel folds, the βblade propellers, the α/α superhelices, the leucinerich repeat horseshoeshape structures, etc. (See, for example, a review by Andrade et al. [1])
Occurrence of symmetric structures poses a number of questions: What sequence and energetic features make repeating units fold into a similar structure and cause them to arrange in a symmetric pattern? What is the biological function of such symmetric chains? How are they different from the symmetric structures of multimeric complexes, which are formed by symmetrically assembling nonsymmetric monomers? How many symmetric chains and what types of symmetry exist in the protein universe? What is their evolutionary history?
Symmetric structures also tend to cause problems for automatic domain partition programs, which may recognize a single repeating unit or several such units as a domain for some chains and the whole repeat set with the full symmetry for others of the similar structure. For structures with superhelical symmetry, automatic structure comparison can be a problem also because of the flexibility of the structure between the repeating units. One or a few units in two such structures can be recognized as similar, but the whole structures can often be sufficiently different for routine detection of the similarity.
In order to answer some of the questions posed above, we need to collect as many symmetric protein samples as possible. It is also useful to identify symmetric structures and separate them from the nonsymmetric structures before an automatic structure comparison or domain partition operation. For these and other reasons, a number of procedures have been developed for identifying symmetric structures over the years. These procedures can be broadly classified into two groups.
One class of methods finds repeats by using a structure alignment program that can report not just one optimal alignment but many other independent, suboptimal alignments between a pair of proteins. Repeats are found by running such a program on a protein and its own copy to find nontrivial selfalignments. An early work using this principle is by Kinoshita et al. [2] but more recent works by others [36] use essentially the same principle. These methods depend on the ability of the structure comparison programs to find suboptimal, yet still significant, structural alignments. Each individual repeat is found independent of others and the fact that the same motif is repeated in a regular pattern is not explicitly used.
Another class of methods explicitly exploits the periodic occurrence of repeats along the primary sequence. The method by Taylor et al. [7], its sequel DAVROS [8] and the method by Murray et al. [9] all start from the NbyN SAP [10] matrix, where N is the number of residues in the protein. Each element of this matrix gives a measure of the similarity of the structural environments of a pair of residues. Two segments of similar structures, without a gap in the sequence, will appear as two symmetrical diagonal lines of high scores in this matrix. The methods detect periodic occurrence of such lines using different mathematical devices (Fourier and wavelet transforms). The method by Chen et al. [11] is partly similar in that it also makes use of the periodic features of an NbyN matrix, although their matrix is different. Each matrix element in this case represents a subsequence of given length and position in the primary sequence. The value of the matrix element is one if the structure of the corresponding subsequence is found to be similar to that of at least one other subsequence of the same length, thus requiring an ungapped structural similarity, and zero otherwise. Since the prominent features appear differently in this matrix than in the SAP matrix above, a couple of different methods are used to recognize the repeats. At least one of these (the Pearson correlation) again depends on the periodic occurrence of prominent features along the primary sequence. Generally, this class of methods may depend too much on the regularity of the repeats; insertions or deletions, either within or between the repeats, reduce the signal and will make the detection difficult.
Here we describe a different method, SymD for Symmetry Detection, which makes use of the symmetry of the structure. In this method a protein structure is aligned to itself after circularly permuting the second copy by all possible number of residues. For each circular shift, we keep only one optimal, nonself structural alignment, fully allowing gaps and unaligned loops. We call this process the alignment scan. This has the effect of amplifying the signal for the symmetric transformation of the structure. Suppose the structure is twofold symmetric. Call the two similar parts A and B. All known procedures, with a possible exception of the methods that use OPAAS [6,12], will report the similarity of A to B and B to A separately, with a score corresponding to matching N/2 residues. However, if the second copy is circularly permuted by N/2 residues, it has the structure BA, and since the structure is symmetric, it will match the original structure AB in its entirety. Therefore, the alignment scan will report a score corresponding to N matched residues at the position N/2 against a background of no significant alignment at all other shift values. From the superposition matrix at the position of the maximum score, we also get the rotation angle (180° in this case), the position and orientation of the symmetry axis, and the translation along the symmetry axis when the symmetry is that of a helix. The boundaries of repeating units, and in fact all the residues that make up each repeating unit, can be obtained from the structurebased sequence alignment between the original and the circularly permuted sequences. We report some individual sample cases and also some statistics on the occurrence of symmetric structures found in the SCOP1.73 ASTRAL 40% set [13] using this procedure.
Results
Zscores from alignment scans
For a protein with N residues, the SymD procedure performs N3 structure alignments (alignment scan), each of which starts from the initial alignment which forces the original sequence to align with the same sequence but circularly permuted by n residues where n ranges from 1 to N3. This initial alignment is refined by structure superposition and sequence alignment cycles. (See Methods.)
Fig. 1 shows the Zscores based on the Tscore (a weighted number of aligned residues, see Methods) of all refined alignments from an alignment scan for six sample structures. The structures are shown in Fig. 2.
Figure 1. Zscores from alignment scans of 6 sample domains. The Zscore vs. rotation angle scatter plot for all alignments from the alignment scan for (a) d1s99a_, (b) d1wd3a2, (c) d1jofc_, (d) d1vzwa1, and (f) d2j8ka1. The red points are those whose rotation axis is within about 20° (cosθ > 0.95) of that of the point with the highest Zscore. Others are in black. Panel (e) for d1bk5a_, is an exception; here the Zscores are plotted against the average alignment shift (average of the residue serial number differences between aligned residues). Note that points with negative Zscores are not shown. Note also that the selfalignment peak, which would occur at angle 0° or zero average shift, is not visible in these plots. This is because the aligned pairs between residues whose serial numbers differ by 3 or fewer residues were not included in the Tscore calculation (see Methods).
Figure 2. Structures of the 6 sample domains. The ribbon rendering of the structure of the proteins of Fig. 1: (a) d1s99a_, a 2fold symmetric ferredoxinlike fold, (b) d1wd3a2, a 3fold symmetric betatrefoil, (c) d1jofc_, a 7bladed betapropeller, (d) d1vzwa1, a 2fold symmetric TIM barrel, (e) d1bk5a_, an alpha/alpha superhelix, and (f) d2j8ka1, a righthanded betahelix with square crosssection. The ribbons are colored in rainbow colors, starting from blue for the Nterminus and ending with red at the Cterminus. The calculated symmetry axis is shown as a black rod with a ball at the center in this and in all other figures.
D1s99a_ is the product of the B. subtilis YkoF gene, which is involved in the hydroxymethyl pyrimidine (HMP) salvage pathway. The structure (Fig. 2, panel a) is made of 2fold symmetric, tandem repeats of a ferredoxinlike fold [14]. Panel (a) of Fig. 1 is a scatter plot of the Zscore vs. the rotation angle for all the alignments from the alignment scan. It clearly shows one angle, 180°, at which the Zscores are much larger than at any other angles.
D1wd3a2 is the arabinosebinding domain of alphaLarabinofuranosidase B from Aspergillus kawachii. It has the structure (Fig. 2, panel b) of the 3fold symmetric betatrefoil fold [15]. The Zscore versus rotation angle scatter plot (Fig. 1, panel b) shows two peaks, at 120° and 240°, respectively.
D1jofc_ is a muconate lactonizing enzyme from Neurospora crassa. It has the 7fold symmetric 7bladed betapropeller structure (Fig. 2, panel c), in which the Cterminal end of the molecule comes near the Nterminus to complete the first repeating unit [16]. Therefore, when the sequence of the duplicated structure is circularly permuted by one repeating unit, all units align nearly as well as when the second sequence is not shifted at all. The resulting Zscore plot (Fig. 1, panel c) shows 6 peaks of nearly the same height at angles that are nearly multiples of 360°/7 = 52°.
D1vzwa1 is a bispecific phosphoribosyl isomerase (PriA) from Streptomyces coelicolor. It has a (β/α)_{8 }TIM barrel structure (Fig. 2, panel d) with a 2fold symmetry [17] and the less perfect 8fold symmetry of the (β/α)_{8 }barrels. The Zscore plot (Fig. 1, panel d) reflects these symmetries: It shows one major peak at 180° and a lower, but clearly recognizable, set of peaks at 45° intervals.
D1bk5a_ is the major portion of yeast Karyopherin alpha, which is a selective nuclear import factor. It has an ARMtype alphaalpha superhelical repeat structure (Fig. 2, panel e), in which 10 repeating units of 3 alphahelices each (except for the first unit which has only 2 helices) are arranged in a superhelical manner [18]. The Zscore plot shows 8 successively decreasing peaks at about 43 residue intervals (Fig. 1, panel e). A superhelical structure is an open structure in which the N and Cterminals are at opposite ends of the molecule. For such a structure, shifting the sequence of one molecule by a repeating unit reduces the number of matching units by one even when the sequence is circularly permuted. This explains the decrease in peak height at successively larger shifts. Ideally, there would be 9 peaks since the structure contains 10 repeats. However, the last peak is weak and does not rise above the background level.
Notice that the Zscores are plotted against the average alignment shift rather than the rotation angle for this structure. The average alignment shift [19] is the average of the residue serial number differences between aligned residue pairs after the alignment has been fully refined by RSE. For this structure, the Zscore plot shows a more regular pattern when plotted against the average alignment shift than when plotted against the rotation angle (plot not shown). This implies that the number of residues is more conserved than the relative orientation between the repeats in this structure.
D2j8ka1 is a protein made by the fusion of two pentapeptide repeat proteins, Np275 and Np276, from Nostoc Punctiforme [20]. It has the structure of a parallel betahelix with a square crosssection (Fig. 2, panel f). Let us label the successive corners of the square crosssection as C_{1}, C_{2}, etc. The basic symmetry of the structure, call it H_{1}, is a helical operation that matches C_{n }to C_{n+1}. The rotation angle of this symmetry is about 90° and the translation is small (1.3 Å, see below). The structure also has the symmetry H_{2 }= H_{1}^{2}, which matches C_{n }to C_{n+2}, H_{3 }= H_{1}^{3}, which matches C_{n }to C_{n+3}, etc. Among these compounded symmetries, the 4repeat symmetry, H_{4 }= H_{1}^{4}, is special since this operation matches one layer to the next layer. This is nearly a pure translation (of about 4.4 Å, see below), which essentially corresponds to the separation between the two parallel betastrands. It has a small rotation component of 2° (see below), which represents the slight lefthanded twist between successive layers (Fig. 2, panel f).
The Zscore plot that SymD program gives for this structure is shown in Fig. 1, panel f. Concentrating on only those points for which the Zscore is >10, the plot shows 4 sets of points, each set at approximately 90°, 180°, 270° or 360°. These sets correspond to the square crosssection of the structure. The point with the highest Zscore in the first set has the rotation angle of 88.6° and the translation of 1.29 Å. This is the H_{1 }symmetry. The other points in the same set represent H_{1}*H_{4}, H_{1}*H_{4}^{2}, etc. Similarly, the set of points in the second set, at around 180°, represents symmetries H_{2}, H_{2}*H_{4}, H_{2}*H_{4}^{2}, etc. and the third set at around 270° represents symmetries H_{3}, H_{3}*H_{4}, H_{3}*H_{4}^{2}, etc. The point with the highest Zscore in the last set of points has the rotation angle of 1.9° and the translation of 4.42 Å. This represents the H_{4 }symmetry, which essentially translates the whole molecule along the helix axis by one layer, with a small lefthanded twist. The other points in the same set at successively larger negative angles represent H_{4}^{2}, H_{4}^{3}, etc.
Number of symmetric structures in the universe of protein structural domains
The symmetries detected here are all pseudo or approximate symmetries. The imperfections in the symmetry arise both because the repeating units are not all exactly the same and because they are not arranged in a perfectly symmetrical manner. Therefore, the notion of symmetric structure requires using a cutoff value of some variable. We used the Zscore (see Methods) for this purpose.
SymD was run on all 9479 domains in the SCOP1.73 ASTRAL 40% set downloaded from the ASTRAL website http://astral.berkeley.edu/ webcite. The results of these runs (the highest Zscore and the rotation angle and the translation for the alignment with the highest Zscore) for all domains are given in the additional file 1: all_domains.xls. Fig. 3 shows the number of proteins that have a Zscore above the value indicated along the Xaxis. At a sufficiently low Zscore cutoff value, nearly all proteins have a Zscore higher than the cutoff value. As the Zscore cutoff value is raised, the number of proteins with a Zscore higher than the cutoff value initially decreases rapidly and then more gradually at higher cutoff values. The number changes smoothly, indicating that there are proteins with all degrees of perfection of symmetry.
Figure 3. Number of symmetric domains vs. Zscore cutoff value. The height of the bars give the number of proteins in the ASTRAL 40 domain dataset that have a Zscore higher than the cutoff value given by the Xaxis.
Additional file 1. all_domains. An Excel file that gives the maximum Zscore, and the rotation angle and the translation along the rotation axis corresponding to the alignment with the maximum Zscore, for each of the 9479 ASTRAL 40 domains.
Format: XLS Size: 1.3MB Download file
This file can be viewed with: Microsoft Excel Viewer
A better feel for the relation between symmetry and the Zscore can be obtained by considering the Zscore distribution of the alignments that are not likely to represent the symmetry of the protein. We selected two alignments among the N3 refined alignments from the alignment scan for each protein: the 'best' alignment and the best among the 'noise' alignments. The 'best' alignment is the one with the highest Zscore. The 'noise' alignments are those whose rotation axis makes more than about 20° angle (cosine of the angle less than 0.95) to that of the 'best' alignment. Fig. 4 shows the Zscore distribution among the 9479 proteins for the 'best' (red) and for the best 'noise' (black) alignments. The black curve shows that there are few 'noise' alignments with the Zscore above 10 and that the number increases sharply when Zscore falls below 8.
Figure 4. Zscore distributions. Distribution of the Zscores of the 'best' alignments (red) and the best among the 'noise' alignments (black). The number of 'noise' alignments is small above Zscore of 10 and rises sharply below Zscore of 8.
On this basis, we consider a protein to be symmetric when the Zscore of its 'best' alignment is above 8 or 10. The number of symmetric domains in the ASTRAL40 dataset is then 968 (10%) or 1385 (15%) depending on whether the Zscore cutoff value used is 10 or 8, respectively.
Symmetric SCOP folds
There are 1081 Folds in the ASTRAL 40% dataset we used. The number of symmetric domains in each of these Folds, using the Zscore cutoff values of 8 and 10, are given in the additional file 2: all_folds.xls. Many SCOP Folds contain both the domains that are judged to be symmetric and those that are not. The number of SCOP folds that contain at least one, 50%, or 100% of domains that are symmetric, using the zscore cutoff criteria of 10 or 8, are given in Table 1. For example, 8% or 13% of all SCOP Folds contain half or more domains that are judged to be symmetric. But many of these are singletons and other small folds (folds with small number of domains). If one considers only the 198 Folds that contain at least 10 domains, the number of Folds of which half or more of the domains are judged to be symmetric is 21 (11%) or 31 (16%) depending on the Zscore cutoff value used.
Table 1. Number of SCOP Folds that contain symmetric domains
Additional file 2. all_folds. An Excel file that gives the number of domains with Zscore 8 or higher, with Zscore 10 or higher, and total number of domains in each of the 1081 SCOP Folds.
Format: XLS Size: 133KB Download file
This file can be viewed with: Microsoft Excel Viewer
There are 23 or 33 Folds that contain 10 or more symmetric domains, using the Zscore cutoff value of 10 or 8, respectively. Fig. 5 shows the total number of domains and the number of symmetric domains in the 33 folds. The names of these folds are listed in Table 2.
Table 2. Names of SCOP Folds shown in Fig. 5
Figure 5. SCOP Folds that contain at least 10 symmetric domains. There are 33 SCOP Folds that contain at least 10 domains with a Zscore of 8 or better. The Xaxis represents these Folds. For each of these Folds, the Yaxis gives the total number of domains in the Fold (tip of the white bar), the number of domains with Zscore of 8 or better (tip of the grey bar), and the number of domains with Zscore of 10 or better (tip of the black bar). The Folds along the Xaxis were sorted according to the height of the black bars.
The fold with the most symmetric domains is the (β/α)_{8 }TIM barrel fold. SymD finds 223 (69%) to 268 (83%) of the 322 domains of this fold symmetric depending on the Zscore cutoff value used. Many of the structures with low Zscore are nonsymmetric because of the presence of one or more large extensions outside of the barrel. Others have distorted barrels. Two examples of such structures are shown in Fig. 6, to be compared with a symmetric TIM barrel shown in Fig. 2d.
Figure 6. Two TIM barrel domains with low Zscores. (a) D1gqia1 with Zscore of 3.03, which is the lowest among all Zscore of TIM barrel domains in the ASTRAL40 dataset. The domain has 561 residues, of which only the Nterminal 320 residues, colored in different shades of blue to green, may be considered to constitute the symmetric core. The remaining 241residue Cterminal extension, colored white, is not symmetric. (b) D1jfxa_ with the Zscore of 7.20, which is near the lower cutoff value of 8, in rainbow color from blue for the Nterminus to red for the Cterminus. This structure has a distorted TIM barrel fold: the fifth helix (yellow) is extended and barely discernible, 6^{th}, 7^{th}, and 8^{th }helices are missing and the 8^{th }strand is in reverse direction, being antiparallel to its neighboring 7^{th }and 1^{st }strands.
The alphaalpha superhelix fold (Fig. 2e) has the second most number of symmetric domains. Many of the domains in this fold that have low Zscores are either short or contain collections of helical hairpins that do not maintain a superhelical symmetry.
All 41 betatrefoil domains (Fig. 2b) are found to be symmetric at the Zscore cutoff level of 8 and all but 2 at the level of 10. All the betapropeller domains (Fig. 2c) are found to be symmetric at the Zscore cutoff level of 10. The 4 to 7blade propellers are included in Fig. 5; the six 8bladed betapropellers are not shown. All but two of the 20 transmembrane betabarrels are recognized as symmetric at the Zscore cutoff level of 10 and all but one at the level of 8. It should be noted that the betahairpins in these structures are often not all of the same lengths and some strands in some structures make excursions into the inside of the barrel.
SymD finds only a relatively small fraction of domains with the Ferredoxin fold symmetric. The basic unit of the Ferredoxin fold is an alphahelix followed by a betahairpin. A typical Ferredoxin fold contains two of these units, which ideally are related by a twofold symmetry. However, many structures are circularly permuted and have an extension at the N and/or Cterminus of the molecule. Also, the two helices are often positioned and oriented in an asymmetric manner. The most symmetric ones in this fold actually have four basic units, of which the symmetry relates one pair to the other pair. (See, for example, Fig. 2a.)
The fold with the largest number of domains in the ASTRAL domain dataset is the immunoglobulinlike betasandwich fold (fold #32 in Fig. 5). SymD finds most of the domains in this fold as nonsymmetric. Only two out of 369 domains have the Zscore greater than 10. These are shown in Fig. 7 along with another protein with a median Zscore for comparison.
Figure 7. Three domains with the immunoglobulin fold with high and medium Zscores. (a) D1ex0a2 (Zscore = 10.9, the highest). (b) D3frua1 (Zscore = 10.4, the second highest). (c) D1ogae1 (Zscore = 5.753, near the median). Each structure is colored in rainbow colors from blue for the Nterminus to red for the Cterminus. The viewing direction is down the calculated symmetry axis, indicated by black rod, which is not visible because of the black ball placed at the center of the axis.
Discussion
Characteristics of the SymD procedure
The simple method of doing the alignment scan using permuted structure has at least a couple of advantages compared to other reported methods. One is that at each stage, it finds the best alignment allowing gaps and unaligned regions. This is a significant advantage because repeating units in a symmetric structure often contain loops of varying lengths. Other algorithms that use information on the regularity of repeats [79,11] assume absence of long gaps. Obviously, the length of the insertion matters: If the insertion is sufficiently large, the whole structure becomes nonsymmetric because the nonsymmetric inserted part can no longer be ignored. In such cases, SymD will also declare the structure nonsymmetric, not because it does not find the symmetric repeating units but because the symmetric part is too small a part of the whole structure.
Another advantage is that this algorithm enhances positive signal by making use of the symmetry. For example, suppose a structure has six repeating units arranged in a 6fold symmetric manner, but one unit is rather different from the rest. Most programs will fail to recognize the 6fold symmetry because this one unit does not align well with the others. In contrast, SymD will find that 4 of the 6 units match at every 6fold position and could very well report the protein as symmetric. (Here again, the degree to which the one unit differs from the rest matters. If the one unit is not too different and the protein may still be considered symmetric overall, SymD will correctly declare the protein symmetric. On the other hand, if the one unit is very different, the whole structure is not truly 6fold symmetric and yet SymD could still declare it symmetric on the strength of the 4 matches out of 6. But we do not expect this to happen very often because, if one unit is so very different, it is likely that the rest of the molecule would also deviate from the 6fold symmetry and the SymD Zscore would correspondingly deteriorate.)
These two features probably contribute to the remarkable fact that SymD finds, for example, all 83 betapropeller structures in the ASTRAL 40% dataset as symmetrical without exception (see Results).
On the other hand, SymD algorithm is designed to detect global symmetry of a protein structure. If the protein contains a symmetric part, but is not symmetric as a whole because of the presence of other parts or other domains, SymD will tend to declare the protein nonsymmetric and not recognize the symmetric substructure. The algorithm can probably be modified to recognize local symmetry, but this will be the subject of a future study.
Number of symmetric proteins in the ASTRAL40 domain dataset
The number of symmetric proteins in a database depends both on the property one uses to measure the degree of symmetry and on the cutoff value of this property. We used in this study the Zscore of the Tscore, which is a weighted number of aligned residues. In order to convert the Tscore to the Zscore, we used a background distribution that was obtained from alignments of similar sized proteins whose rotation axis is significantly different from that of the best alignment (see Methods). A better method would probably use information other than just the size of the protein. For example, one could use protein class or secondary structure content information or perhaps each proteinspecific background. We have not explored other quantities or other ways of obtaining the Zscore in this study.
Using the Zscores, we found that 10% or 15% of the proteins in the ASTRAL40 dataset are symmetric depending on whether the Zscore cutoff value of 10 or 8 is used, respectively. This is comparable to the 14% found for proteins that contain duplicated sequence segments in all known protein sequences [21].
The number of symmetric folds is even more difficult to decide because many folds contain both symmetric and nonsymmetric domains. But the fraction of symmetric folds is roughly in the similar range using a couple of different measures. For example, the fraction of folds that contain more than 50% of symmetric domains is 11% or 16%, using the two Zscore cutoff values, among the 198 folds that contain 10 or more domains (Table 1).
Comparison with other programs
The number of symmetric proteins found by SymD is compared with that of several other programs in Table 3. The comparison is not completely satisfactory because the datasets used are different and only small number of numerical results are available in most cases. However, the Table shows that SymD finds a similar or more number of domains as symmetric, except when compared to GANGSTA+ by Guerler et al. [5], which finds more ferredoxinlike and immunoglogulinlike domains as symmetric than SymD does.
Table 3. Comparison with other methods
Since Guerler et al. used the same dataset as we did and the full results are available on their web site, we made a more detailed comparison for the 8 folds listed in their Table 1. The results of this comparison are given in the additional file 3: SymDGangstaData.xls, which gives the numerical values of the symmetry measures (Zscore for SymD and the fraction of sequentially aligned residues for GANGSTA+) by both programs for each domain of each fold, additional file 4: SymDGangsta.ppt, which gives the scatter plots of the symmetry measures, and additional file 5: SymDGangstaTable.xls, which gives a summary table of the numbers of domains that are considered symmetric or nonsymmetric by the two programs. The scatter plots show generally good correlation between the two symmetry measures except for the fourhelix up and down bundle fold. (See SymDGangsta.ppt.) The number of symmetric proteins depends on the cutoff value used for both methods. For the ferredoxinlike and the immunoglobulinlike folds, only a small adjustments in the cutoff values will be sufficient to make the number of symmetric proteins the same. (See SymDGangsta.ppt file.) However, the correlation is not perfect and the list of symmetric proteins would not be the same even after the cutoff values were adjusted to make the number of symmetric proteins the same. For the betatrefoil, 7bladed betapropeller, and the TIM barrel folds, small adjustments of the cutoff values will not bring the number of symmetric proteins the same.
Additional file 3. SymDGangstaData. An Excel file that contains 8 sheets in addition to a cover sheet. Each of these 8 sheets is for a fold in Table 1 of Guerler et al. [5]. Each sheet gives FSAR (the fraction of sequentially aligned residues) from GANGSTA+, Zscore from SymD, and the ASTRAL domain name for each domain of the fold. It also gives the scatter plot between the two symmetry measures and the correlation coefficient between them.
Format: XLS Size: 228KB Download file
This file can be viewed with: Microsoft Excel Viewer
Additional file 4. SymDGangsta. A Powerpoint file that contains 8 slides, each showing the scatter plot of the SymD and GANGSTA+ symmetry measures for each fold.
Format: PPT Size: 2.6MB Download file
This file can be viewed with: Microsoft PowerPoint Viewer
Additional file 5. SymDGangstaTable. An Excel file that gives a summary table of the number of domains in each fold, grouped into symmetric/nonsymmetric sets by SymD/GANGSTA+.
Format: XLS Size: 20KB Download file
This file can be viewed with: Microsoft Excel Viewer
Types of symmetry
We observed many different types of symmetry by visual inspection of the structures. The pattern of the Zscores given by alignment scan procedure contains the symmetry information as described in the Results section. We are currently working on developing a robust, automatic procedure for determining symmetry types from such data.
Symmetric structures can be open or closed. In a closed structure, the N and Ctermini of the molecule are physically close to each other and the symmetry is purely, or nearly purely, rotational. The amount of rotation is an integer fraction of 360°. Structures with 3 to 8fold symmetries have been observed. These can be all alphahelical (e.g. alphaalpha toroid, Fig. 8a), all beta (e.g. betatrefoil, Fig. 2b; betapropellers, Fig. 2c) or a mixture (e.g. TIM barrel, Fig. 2d).
Figure 8. Four more examples of symmetric domains. (a) D1dceb_ is the Rab geranylgeranyltransferase bchain from rat, exhibiting a 6fold symmetric alpha/alpha toroid structure [28,29]. Each repeating unit is colored differently to show the 6 repeating units. (b) D1cz5a1 is VATN, the Nterminal domain of VAT protein, of the archaebacterium Thermoplasma acidophilum. It has a double psi betabarrel structure with 2fold symmetry [29]. The symmetric core of the protein is colored blue for the Nterminal half and red for the Cterminal half. (c) D1uynx_ is the outer membrane translocator domain of an autotransporter from Neisseria meningitidis [30]. Its structure is 12 stranded betabarrel made of 6 upanddown betahairpins. The ribbon representation is colored in rainbow colors, from blue to red for the N to Cterminal residues. The symmetry with the highest Zscore is a 2fold axis. But there are numerous other rotation and screw symmetries with different angles and pitches for this structure. (d) Human ribonuclease inhibitor [31], d1z7xw1. It has a leucinerich repeat (LRR) fold. Coloring is rainbow coloring, blue to red from N to Cterminus.
The 2fold symmetric structures can be of two types. A twofold symmetry occurs when the Nterminal and the Cterminal halves of the molecule attain a similar structure. In many such structures, the two halves fold more or less independently and form two subdomains. The structure shown in Fig. 2a is an example of such a structure. In other cases, however, the two interact intimately over most of their length, resulting in an intertwined structure, like the double psi betabarrel shown in Fig. 8b.
A special type is the transmembrane betabarrels, which are made of upanddown betahairpins that twist around the surface of a cylindrical barrel (Fig. 8c). These can have more than 8fold symmetry, or more than 16 betastrands, but all are closed structures in which the N and Ctermini come close to each other. Those with long barrels also have screw symmetries with many different rotation angles.
An open structure has a helical symmetry and the N and Ctermini of the molecule are at opposite ends of the molecule. Such a structure can be all alphahelical (e.g. alphaalpha superhelix, Fig. 2e), all betastrands (e.g. betahelix, Fig. 2f), or a mixture (e.g. leucinerich repeats, Fig. 8d). Typically there are a large number of repeating units and the rotation angle is not an integer fraction of 360°.
We have seen only monoaxial symmetries in this work. But this is probably because we used the SymD algorithm to find one symmetry axis and ignored other signals that might arise from a second symmetry axis. A possible indication of the existence of a second axis for some structures is the curious high Zscores for many 'noise' alignments seen in Fig. 9. (See Methods.) This will be the subject of a future study.
Figure 9. Alignment scores of 'noise' alignments. The alignment score averaged over all nonself 'noise' alignments for each protein is plotted against the size of the protein. Allalpha and allbeta classes are shown in red and green, respectively. These classes include alphahelix bundles (e.g. d1f68a_) and betasandwiches (e.g. d1cida1), which produce high Tscores in the 100 to 200residue range. The black points are for proteins other than allalpha and allbeta classes.
Conclusions
We have described the principle of and the initial results obtained from the newly developed program SymD. SymD is a true symmetry detection program in contrast to many other procedures used for the same purpose but which actually simply detect structural repeats. The procedure is sensitive because (1) it allows detection of symmetry even when the structure contains symmetrybreaking insertions or deletions either within or between the repeating units and (2) it amplifies symmetric signal. The procedure yields both the sequence and structural alignments after each symmetry operation. The sequence alignment gives information on the residues that make up the repeating units. The structural alignment, or the structure transformation matrix, gives the information on the direction and position of the symmetry axis, the rotation angle, and the pitch if the symmetry is that of a helix. The procedure can detect more than one symmetry for a given molecule as described in the case of a 2fold symmetric TIM barrel and a betahelix structures.
A SymD run on the nearly 10,000 domains in the ASTRAL 40% domain database yielded a preliminary overall view of symmetric structures in the known domain structure world. It can be estimated that between 10% and 15% of the domains are symmetric. Many SCOP folds contain both symmetric and notsosymmetric domains. The symmetries observed are broadly of two types, closed and open. In symmetric closed structures, the N and Ctermini of the molecule come close together and the structure has a purely rotational symmetry. Most of these have 3 to 8fold rotational symmetries, but the transmembrane betabarrels can have higher symmetries. In the symmetric open structures, the structure has a helical symmetry and the N and Ctermini are at the opposite ends of the molecule. Structures with a 2fold rotational symmetry do not fit either category well; they can have either a closed (intertwined) or an open structure.
Methods
Alignment Scan procedure
The alignment scan procedure works as follows. First, make a duplicate of the structure of interest. Call the original structure A and the duplicate B. Then generate the kth initial sequence alignment by circularly permuting the sequence of structure B by k residues. (See Fig. 10.) This makes residue i of structure B to be aligned with residue i + k of structure A. If i + k becomes larger than N, residue i is left unaligned in the initial alignment, where N is the total number of residues of the protein. Then, the RSE program is run with this initial alignment to obtain a refined structurebased sequence alignment. The procedure is repeated for all values of k, from 1 to Ns, where s = 3. For each value of k, the Tscore (see below) is kept, as well as the transformation matrix for the optimal structural superposition and the refined sequence alignment.
Figure 10. The alignment scan procedure. In this illustration, the protein is made of 5 residues. The first row of 5 boxes represents the original sequence. The second row of 5 boxes represents the circularly permuted sequence in which the last residue is moved to the front. This box is shaded to indicate the permutation. The initial alignment aligns residues 2, 3, 4 and 5 of the first sequence to the residues 1, 2, 3 and 4 of the second sequence, respectively. The residue 1 of the first sequence and residue 5 of the second sequence is not considered as a part of the initial alignment. This alignment is fed to the RSE routine, which refines it by structure superpositionsequence alignment cycles and produces a new, refined alignment output, indicated as n1. This process is repeated after circularly permuting the sequence by one more position each time. The refined alignment outputs are labeled by the number of positions of the initial permutation. This example shows that the initial alignment for the last cycle has just one pair of residues aligned. In real case, the cycle stops when the number of residue pairs in the initial alignment is less than 3.
The RSE procedure used here has been described [22]. It takes an initial sequence alignment between two structures and refines it based on structural superposition of the two structures. Briefly, the procedure consists of iterating a twostep cycle. In the first step of the cycle, the best structural superposition is obtained for the given sequence alignment by minimizing the distances between Cα atoms of aligned residue pairs using Kabsch's procedure [23,24]. In the second step, an updated sequence alignment is obtained from the superposed structures using the SE algorithm. The iteration stops when either there is no change in the sequence alignment or when a set number of iterations have been made. The final alignment reported is the one that produced the best Tscore among all cycles.
The SE procedure has also been described [25]. The procedure finds a sequence alignment from a given pair of superimposed structures. It works by finding seed alignments, extending them, and then selecting a consistent set of the extended seed alignments. The procedure does not use a gap penalty and generates alignments that are generally more accurate than those that use the dynamic programming algorithm. Another virtue of SE is its speed, which enables one to run several hundred runs for a typical protein within a fraction of a second, which is required for the alignment scan procedure.
We used Tscore, which is similar to the TMscore [26], as a measure of the quality of alignment. This is a weighted number of aligned residues given by the following formula:
where d_{ij }is the distance between the Cα atoms of two aligned residues i and j, one from structure A and the other from structure B, respectively, d_{o }= 2.0 Å, and the summation is over all aligned residue pairs, except those that are s residues or less apart in the original sequence, where s = 3. This latter condition was introduced in order to discourage selfalignments.
Finding the axis of rotation from the transformation matrix
For each refined alignment, the transformation matrix that transforms the duplicated structure B to optimally superimpose the original structure A is obtained and saved. This transformation matrix contains information on the position and orientation of the rotation axis, the rotation angle, and any translation along the rotation axis. The mathematical procedure for calculating these properties from the transformation matrix is given in the additional file 6: rotation_axis.pdf.
Additional file 6. rotation_axis. A PDF file that described the mathematical procedure used for obtaining the rotation angle, the translation along the rotation axis, and the position of the rotation axis from the transformation matrix for optimal structural superposition.
Format: PDF Size: 508KB Download file
This file can be viewed with: Adobe Acrobat Reader
The Zscore calculation
The Zscore of an alignment was calculated by comparing the Tscore of the alignment against the Tscore expected when the protein was not symmetrical. In order to obtain an estimate of this latter Tscore, we first define the 'best' alignment for a protein as the alignment that produced the highest Tscore among the N3 refined alignments. Then, a refined alignment k is considered a 'noise' alignment if (1) the cosine of the angle between the rotation axes for the kth and the 'best' alignments is less than 0.95 and (2) none of the aligned residue pairs are within s = 3 residues of self (i.e. ij > s for all aligned residue pairs i and j). The condition (2) was included in the definition of 'noise' alignment only for the Zscore calculation.
Fig. 9 shows the average Tscore (all Tscores are too many to plot) of all 'noise' alignments, defined as above, for each protein as a function of the size of the protein. It turns out that some of the allα and allβ proteins have high 'noise' Tscores compared to others. We chose to exclude these proteins for the following curvefitting procedure. With the remaining proteins (represented by the black points in Fig. 9), we computed the average Tscore and the standard deviation of all 'noise' alignments within each 11residue sliding window of protein sizes and then fit an exponential curve, y = a + b(1  exp(cN)), through them, where N is the number of residues of the protein. This procedure yielded the following size dependence:
where and σ(N) are the average and the standard deviation of the 'noise' Tscores of proteins of size N residues.
The Zscore of a protein of size N was then obtained by
Program availability
The program SymD will be made available for download from the web site http://lmbbi.nci.nih.gov webcite or by contacting the corresponding author.
Authors' contributions
CK wrote the computer program and obtained most of the results. JB performed some summarizing calculations and prepared most of the figures with structural images. BL conceived of the study, supervised the research, and wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We thank the laboratory members Dr Todd Taylor, Dr Dukka KC, Ms. ChinHsien Tai, and Dr B. K. Sathyanarayana for their continued interest and support. Molecular graphics images were produced using the UCSF Chimera [27] package from the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco (supported by NIH P41 RR01081). This research was supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
References

Andrade MA, PerezIratxeta C, Ponting CP: Protein repeats: structures, functions, and evolution.
Journal of structural biology 2001, 134(23):117131. PubMed Abstract  Publisher Full Text

Kinoshita K, Kidera A, Go N: Diversity of functions of proteins with internal symmetry in spatial arrangement of secondary structural elements.
Protein Sci 1999, 8(6):12101217. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Abraham AL, Pothier J, Rocha EP: Alternative to homooligomerisation: the creation of local symmetry in proteins by internal amplification.
J Mol Biol 2009, 394(3):522534. PubMed Abstract  Publisher Full Text

Abraham AL, Rocha EP, Pothier J: a detector of internal repeats in sequences and structures.
Bioinformatics 2008, 24(13):15361537. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Guerler A, Wang C, Knapp EW: Symmetric structures in the universe of protein folds.
J Chem Inf Model 2009, 49(9):21472151. PubMed Abstract  Publisher Full Text

Shih ES, Hwang MJ: Alternative alignments from comparison of protein structures.
Proteins: Struc Funct Genet 2004, 56(3):519527. Publisher Full Text

Taylor WR, Heringa J, Baud F, Flores TP: A Fourier analysis of symmetry in protein structure.
Protein Eng 2002, 15(2):7989. PubMed Abstract  Publisher Full Text

Murray KB, Taylor WR, Thornton JM: Toward the detection and validation of repeats in protein structure.
Proteins: Struc Funct Genet 2004, 57(2):365380. Publisher Full Text

Murray KB, Gorse D, Thornton JM: Wavelet transforms for the characterization and detection of repeating motifs.
J Mol Biol 2002, 316(2):341363. PubMed Abstract  Publisher Full Text

Taylor WR: Protein structure comparison using iterated double dynamic programming.
Protein Sci 1999, 8(3):654665. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chen H, Huang Y, Xiao Y: A simple method of identifying symmetric substructures of proteins.
Comput Biol Chem 2009, 33(1):100107. PubMed Abstract  Publisher Full Text

Shih ES, Gan RC, Hwang MJ: OPAAS: a web server for optimal, permuted, and other alternative alignments of protein structures.
Nucleic Acids Res 2006, (34 Web Server):W9598. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004.
Nucleic Acids Res 2004, (32 Database):D189192. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Devedjiev Y, Surendranath Y, Derewenda U, Gabrys A, Cooper DR, Zhang RG, Lezondra L, Joachimiak A, Derewenda ZS: The structure and ligand binding properties of the B. subtilis YkoF gene product, a member of a novel family of thiamin/HMPbinding proteins.
J Mol Biol 2004, 343(2):395406. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Miyanaga A, Koseki T, Matsuzawa H, Wakagi T, Shoun H, Fushinobu S: Crystal structure of a family 54 alphaLarabinofuranosidase reveals a novel carbohydratebinding module that can bind arabinose.
J Biol Chem 2004, 279(43):4490744914. PubMed Abstract  Publisher Full Text

Kajander T, Merckel MC, Thompson A, Deacon AM, Mazur P, Kozarich JW, Goldman A: The structure of Neurospora crassa 3carboxycis,cismuconate lactonizing enzyme, a beta propeller cycloisomerase.
Structure 2002, 10(4):483492. PubMed Abstract  Publisher Full Text

Kuper J, Doenges C, Wilmanns M: Twofold repeated (betaalpha)4 halfbarrels may provide a molecular tool for dual substrate specificity.
EMBO Rep 2005, 6(2):134139. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Conti E, Uy M, Leighton L, Blobel G, Kuriyan J: Crystallographic analysis of the recognition of a nuclear localization signal by the nuclear import factor karyopherin alpha.
Cell 1998, 94(2):193204. PubMed Abstract  Publisher Full Text

Kim C, Lee B: Accuracy of structurebased sequence alignment of automatic methods.
BMC Bioinformatics 2007, 8:355. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Vetting MW, Hegde SS, Hazleton KZ, Blanchard JS: Structural characterization of the fusion of two pentapeptide repeat proteins, Np275 and Np276, from Nostoc punctiforme: resurrection of an ancestral protein.
Protein Sci 2007, 16(4):755760. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D: A census of protein repeats.
J Mol Biol 1999, 293(1):151160. PubMed Abstract  Publisher Full Text

Kim C, Tai CH, Lee B: Iterative refinement of structurebased sequence alignments by Seed Extension.
BMC bioinformatics 2009, 10:210. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kabsch W: A solution for the best rotation to relate two sets of vectors.
Acta Crystallographica Section A 1976, 32(5):922923. Publisher Full Text

Kabsch W: A discussion of the solution for the best rotation to relate two sets of vectors.
Acta Crystallographica Section A 1978, 34(5):827828. Publisher Full Text

Tai CH, Vincent JJ, Kim C, Lee B: an algorithm for deriving sequence alignment from a pair of superimposed structures.
BMC bioinformatics 2009, 10(Suppl 1):S4. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhang Y, Skolnick J: a protein structure alignment algorithm based on the TMscore.
Nucleic Acids Res 2005, 33(7):23022309. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimeraa visualization system for exploratory research and analysis.
J Comput Chem 2004, 25(13):16051612. PubMed Abstract  Publisher Full Text

Zhang H, Seabra MC, Deisenhofer J: Crystal structure of Rab geranylgeranyltransferase at 2.0 A resolution.
Structure 2000, 8(3):241251. PubMed Abstract  Publisher Full Text

Coles M, Diercks T, Liermann J, Groger A, Rockel B, Baumeister W, Koretke KK, Lupas A, Peters J, Kessler H: The solution structure of VATN reveals a 'missing link' in the evolution of complex enzymes from a simple betaalphabetabeta element.
Curr Biol 1999, 9(20):11581168. PubMed Abstract  Publisher Full Text

Oomen CJ, van Ulsen P, van Gelder P, Feijen M, Tommassen J, Gros P: Structure of the translocator domain of a bacterial autotransporter.
EMBO J 2004, 23(6):12571266. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Johnson RJ, McCoy JG, Bingman CA, Phillips GN, Raines RT: Inhibition of human pancreatic ribonuclease by the human ribonuclease inhibitor protein.
J Mol Biol 2007, 368(2):434449. PubMed Abstract  Publisher Full Text  PubMed Central Full Text