Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces

Ikeda, Kazuyoshi; Hirokawa, Takatsugu; Higo, Junichi; Tomii, Kentaro

doi:10.1186/1472-6807-8-37

Research article
Open access
Published: 13 August 2008

Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces

Kazuyoshi Ikeda^1,3,4,
Takatsugu Hirokawa¹,
Junichi Higo^2,3 &
…
Kentaro Tomii^1,5

BMC Structural Biology volume 8, Article number: 37 (2008) Cite this article

4524 Accesses
5 Citations
Metrics details

Abstract

Background

Many studies have examined rules governing two aspects of protein structures: short segments and proteins' structural domains. Nevertheless, the organization and nature of the conformational space of segments with intermediate length between short segments and domains remain unclear. Conformational spaces of intermediate length segments probably differ from those of short segments. We investigated the identification and characterization of the boundary(s) between peptide-like (short segment) and protein-like (long segment) distributions. We generated ensembles embedded in globular proteins comprising segments 10–50 residues long. We explored the relationships between the conformational distribution of segments and their lengths, and also protein structural classes using principal component analysis based on the intra-segment C_α-C_α atomic distances.

Results

Our statistical analyses of segment conformations and length revealed critical dual transitions in their conformational distribution with segments derived from all four structural classes. Dual transitions were identified with the intermediate phase between the short segments and domains. Consequently, protein segment universes were categorized. i) Short segments (10–22 residues) showed a distribution with a high frequency of secondary structure clusters. ii) Medium segments (23–26 residues) showed a distribution corresponding to an intermediate state of transitions. iii) Long segments (27–50 residues) showed a distribution converging on one huge cluster containing compact conformations with a smaller radius of gyration. This distribution reflects the protein structures' organization and protein domains' origin. Three major conformational components (radius of gyration, structural symmetry with respect to the N-terminal and C-terminal halves, and single-turn/two-turn structure) well define most of the segment universes. Furthermore, we identified several conformational components that were unique to each structural class. Those characteristics suggest that protein segment conformation is described by compositions of the three common structural variables with large contributions and specific structural variables with small contributions.

Conclusion

The present results of the analyses of four protein structural classes show the universal role of three major components as segment conformational descriptors. The obtained perspectives of distribution changes related to the segment lengths using the three key components suggest both the adequacy and the possibility of further progress on the prediction strategies used in the recent de novo structure-prediction methods.

Background

Vast amounts of three-dimensional (3D) protein data from structural genomic studies and other individual efforts have been added to our knowledge, thereby enhancing our understanding of protein structures. To date, only two extremes of protein structural data have been studied. One extreme includes local features of proteins: those of short protein segments, typically of 10 residues long or less. The other extreme includes global features of proteins: protein folds or structural domains.

Regarding the short protein segments, abundant research examples exist partly because of the existence of variations of methods to analyze the local features of proteins. Various measures, such as RMSDs after structural superposition [1–3], C_α-C_α atomic distances coupled with the torsion angles [4, 5], dihedral angles [6], and so on have been used to define the conformational similarity of protein segments. Different clustering techniques, such as k-means clustering [7, 8], hierarchical methods [9], competitive learning [6, 10], and other methods [11], have been used to describe the organization of the segments' conformational space. The abundance of research results in this area is also partly attributable to various applications of the clustering results of the short segments. A set of representatives from the resulting clusters are often called structural building blocks (SBBs). Even when using different procedures, clustering resolutions of SBBs can be categorized into only a few levels depending mainly on their respective applications, such as structural modeling, verification, comparison, and prediction [6, 12]. The most dominant cluster of the short segments, which is common in all studies, corresponds to α-helices, whereas the variability of β-strands is observed at the high-resolution clustering. Regarding global features of proteins, understanding of their organization and analysis of the protein-fold (or structural domain) space studies are progressing well.

As reviewed recently [13], both hierarchical and continuous aspects of fold space have been realized. Regarding hierarchical classification, widely used databases such as CATH [14] and SCOP [15] have been constructed. Other databases such as FSSP [16] and VAST [17] have been developed. They are based on continuous measurements of protein structural similarity. Several studies have provided insights into the nature of fold space. Holm and Sander first described the conformational distribution of protein folds in a fold universe with multi-dimensional scaling methods based on an all-on-all comparison using the Dali program [18]. Using the same measurement, Hou et al. [19] showed visual representations of the protein fold universe and identified three major components which characterize the fold space: secondary structure compositions, chain topologies, and the protein domain size.

Compared to these two extremes, limited surveys have been done on the conformational space of medium size segments between protein short segments and folds. Specifically, supersecondary structures such as α-hairpin, βαβ-unit, and β-hairpin are typical structural motifs of medium size; those motifs have been analyzed. For example, Salem et al. reported that most superfolds contain a higher proportion of their α-helical or β-strand residues in one such supersecondary structure [20]. Szustakowski et al. built a dictionary of supersecondary structures [21]. Kurgan and Kedarisetti studied regularity among twilight zone protein structures at the level of the sequence segments that correspond to the secondary structure fragments of varying length [22]. However, the organization and statistical properties of the whole conformational space of medium-to-long segments remain unclear. Statistical and systematic analyses should be done on the 'segment universe' from short to long lengths to bridge this gap.

Our previous study identified structural clusters and visualized the uneven distribution of short segments in the conformational spaces of 6–22 residues, where known and novel secondary-structure motifs are distributed as isolated clusters [23]. The general features of the segment distribution were consistent for these lengths. However, the question we sought to answer is: Do spaces of long segments differ from those of short segments? In this study, we explore the relationships between the conformational distribution of segments and their length: 10–50 residues, thereby providing a global view of a 'segment universe' and showing critical dual changes (i.e. dual transitions) of the distribution shape in the conformational space of short to long segments. The critical changes might reflect changes of the protein structures' organization. Therefore, the present results suggest the adequacy and the possibility of further progress of the hierarchical treatment used in the recent de novo structure prediction methods. Furthermore, by comparing conformational components among structural classes (i.e., all-α, all-β, α/β, and α+β), we demonstrate the specificity and generality of protein fold classes.

Results

Transitions of segment distribution: short, medium, and long segments

The coverage of segments in cluster(s) was calculated as described below. A densely populated region in the 3D principal component analysis (PCA) space was defined as a cluster [23]. Given a density threshold, the segments are classifiable into two groups: those in regions of a density larger than the threshold and those outside the regions. The coverage of segments in clusters is defined as a ratio of the segments in the regions to all the segments.

Figure 1a portrays the coverage of segments versus the density threshold for the conformational spaces of 10, 20, 30, 40, and 50 residue lengths. The coverage curves exhibited a transition from concave shapes for short lengths (10 and 20 residues long) to convex ones for long lengths (30, 40, and 50 residues long). Notably, the differences of coverage at a density of 0.2 or less show a transition between the short and long segments. For instance, at a density of 0.1, the coverage is only 16.3% for 10 residues, although the coverage is greater than 50% for 30 residues. In addition, at a density of 0.01, the coverage for 10 residues is 45.6%, although coverage for 30 residues is 91.9%. These quantitatively indicate that the density gradient of the conformational space changes markedly with segment elongation.

Further analyses of the coverage graphs between the short and long segments were meaningful to discover the boundaries of distribution changes. Figure 1b shows coverage curves for lengths of 21–30 residues. The dual and critical transitions, with an intermediate phase for segment lengths of 23–26 residues, can be recognized clearly, as presented in Fig. 1b. The transitions at intermediate length are also characterized by the distributional alteration of the radius of gyration of segments in the populated region with density of 0.10–0.35 (Fig. 2). To adjust the effect of different segment lengths, we defined here the relative score (F_Rg) of the radius of gyration for a segment as (Rg_(i,j)- Min Rg_(j))/(Max Rg_(j)- Min Rg_(j)), where Rg_(i,j)denotes the radius of gyration of a segment i with length j, and where Max Rg_(j)and Min Rg_(j)represent the maximal and minimal radius of gyration of the entire segment dataset with length j. Based on these observations, the segment length is categorized into the following three groups: short (10–22 residues), medium (23–26 residues), and long (27–50 residues). We were able to show that changes in the density gradient are associated with distributional alterations in the segment universe in subsequent analyses of visualizing the 3D PCA space. In fact, the difference in the coverage between lengths of 10 and 30 residues was attributable to the increase in the volume for the most populated region, as discussed below. The typical global images of segment universes from the three categories are depicted in Fig. 3d. The segment universes here were generated by the first three principal components derived from the entire segment dataset: PC^all1, PC^all2, and PC^all3 (see Methods).

Short length (10–22 residues long)

The conformational space of short segments showed a distribution with an extreme density gradient that originated from secondary structure clusters: α-helix and β-strand clusters were discriminated using a density of 0.01 (shown in orange in Fig. 3a). Between the lengths of 10 and 20 residues, spatial arrangements of the segment distribution, especially for α-helical, β-strand, and β-hairpin clusters, were conserved in short conformational spaces. The highly populated core of the α-helix cluster exhibited a density of 0.1 (shown as magenta in Fig. 3a), consisting of completed α-helical segments. The surrounding area of the central region consisted of various types of helical conformations including helix-capping motifs [12]. The central region of the β-strand cluster consisted of fully extended segments that originated mainly from β-sheets and loop regions. The β-hairpin conformations were separated into several clusters at a density of 0.005. Then they were discriminated using the coordinate c₂ along PC^all2 (see Methods for the definition of c₂). The β-hairpin clusters showed a symmetrical relationship related to the N-terminal and C-terminal halves. They were arranged symmetrically around an edge of segment universes of short length.

Medium length (23–26 residues long)

The segment distribution for medium lengths differed from that for short lengths. The distributional change from short to medium lengths is characterized using a diminishing β-strand cluster and a growing α-helix cluster. The overall distribution was shortened in the direction of PC^all1, and enlarged in the direction of PC^all2 and PC^all3. In the segment universe of 26 residues, the α-helix cluster was discriminated using a density of 0.1 (magenta in Fig. 3b). Interestingly, the shape of the α-helix cluster was a ring (designated as a helix ring cluster). The helix ring cluster that is specific to the medium-length universe consisted not only of the extended α-helices but also of various α-helical conformations, as presented in the inset of Fig. 3b. This cluster included conformations that had originated mainly from all-α, α/β, and α +β proteins (Fig. 4a). The average content of the α-helical residues per segment in the helix ring cluster was about 50% (Fig. 4b); 24.9% of all segments were included within the helix ring cluster. The long-α-helical segments, whose conformation was not compact, were located near the origin of the conformational space (red in Fig. 3b). In contrast, the α-hairpin conformations with a small radius of gyration were located on the opposite side of the position on PC^all1. The various α-hairpin conformations with the different turn positions were located symmetrically along PC^all2. For medium lengths, the β-strand clusters were diminished because long extended β-strands are rarely found in proteins. The β-hairpin conformations were located symmetrically along PC^all2, although the cluster separation of β-hairpins was not clear in medium lengths.

Long length (27–50 residues long)

Conformational spaces for the long lengths were further shortened in the direction of PC^all1 and enlarged in that of PC^all3. The segment distribution converged on a large populated region that exhibited a density of 0.1 (magenta in Fig. 3c) in the conformational space. With a length of 30 residues, there were two clusters consisting of compact segments and long α-helical segments, respectively, with densities of 0.35 (red in Fig. 3c) in the populated region. The emergence of the compact-segment cluster was attributable to an increase in various types of segments with a small radius of gyration (see inset of Fig. 3c). Various types of conformations are mixed up in the compact-segment cluster. The α-hairpins are derived mainly from all-α proteins. The compact β-sheet structures are derived mainly from all-β proteins. Compact conformations of other types are derived from α/β and α +β proteins (Fig. 4c). About 2% of all segments were included in the compact-segment cluster for 27-residue length. In contrast, long α-helical segments with a large radius of gyration were located on the opposite side of the cluster of the compact segments along the PC^all1 axis. For lengths greater than 30 residues, the proportion of the conformations with a small radius of gyration in the compact-segment cluster increased rapidly to around 14% for 50-residue lengths. Those conformations were derived from various folds (Fig. 4c). The supersecondary structures, such as βαβ units and β-sheets, were included in the compact-segment cluster (Fig. 4d).

Contribution ratios of principal axes

Distributional alterations were observed associated with the changes of segment length. For principal component analyses, the contribution ratios (see Methods for the contribution ratios) of the principal components (i.e. PC axes) to the entire distribution indicate how well the PC axes can cover the variation in the original data. Figure 5 portrays contribution ratios of the first five PC axes (PC^all1 – PC^all5) for segment lengths of 10–50 residues. Even with a length of 43 residues, the cumulative contribution ratio of the first three PC axes, Q₁₂₃ (= Q₁ + Q₂ + Q₃), was greater than 60%, although Q₁₂₃ decreased constantly with increased segment length. Each of Q₄ and Q₅ was always less than 8%. The contribution ratios for higher-order PC axes than PC^all5 did not exceed 5% for the examined segment lengths. Therefore, it is sufficient to use only the first three PC axes (or the first five PC axes occasionally) to explain the original structural variation.

With respect to the individual contribution ratios (Q₁-Q₃) of the first three PC axes, Q₁ was overwhelmingly higher than those of the other PC axes up to 50-residue length (Fig. 5), which indicates that PC^all1 is a meaningful and fundamental descriptor for segment conformation. Actually, Q₁ decreased rapidly, and Q₂ increased in the short segment lengths (i.e. 10–22 residues). Thereafter, both Q₁ and Q₂ decreased slowly. In addition, Q₃ increased gradually with lengths up to 33 residues, with a maximum value of 11.5%.

Investigation of structural properties of conformational axes

An eigenvector was analyzed for each PC axis with a triangle map to elucidate the physical and conformational meaning of the PC axes of the conformational space of the short to long segments. The eigenvector can be regarded as a collective variable to describe the segment conformation. Figure 6 shows triangle maps of the first five PC axes (PC^all1 – PC^all5) for short (10 residues), medium (26 residues), and long segments (30 residues). The triangle map clearly portrays residue pairs, with large or small deviations of C_α-C_α distances along each PC axis from the average distance <q_i>. In the triangle map, positive (red) and negative (blue) areas correspond to residue pairs with mutually inverse deviations. The patterns of red and blue areas are conserved in the universes of short to long segments, indicating that conformational deviations related to the PC axes are conserved among the universes. Figure 7 depicts the conformational changes along the PC axis using colored arrows.

Actually, PC^all1 corresponds to the change of the radius of gyration (Rg). The triangle map for PC^all1 has only one positive area, shown as red in Fig. 6, which is located near the residue pairs at the N-terminal and C-terminal sides. This single area indicates that the distant residue pairs in the sequence have a larger conformational deviation along PC^all1. The correlation coefficient of the conformational deviation along PC^all1 with Rg was greater than 0.9 in segment lengths of 10–50 residues (Fig. 8). The arrows in Fig. 7 point to the center of the segment, which indicates clearly that the conformational changes along PC^all1 are involved with expansions or compressions of the conformation. For short lengths, PC^all1 also shows a strong correlation with the changes of the segment end-to-end distance (D_end), which is defined as the C_α-C_α distance between the first and last residues of segments. Correlation between PC^all1 and D_endslowly weakened with increased segment length: 0.91 for 10 residues, 0.79 for 26 residues, and 0.77 for 30 residues.

The PC^all2 correlates to a degree of structural symmetry (D_sym) of a segment with respect to the N-terminal and C-terminal halves. The D_symis defined as follows: Given a distance matrix for a segment, where element (i,j) is the distance (denoted as r_ij) between C_α atoms of residue i and j. Then, the degree of structural symmetry is defined as the sum of the squared differences of symmetric elements in a distance matrix for a segment: D_sym= Σ_{1 ≤ i < j ≤ n} (r_ij- r_{n-(j-1)n-(i-1)})², where n is the segment length. The triangle map for PC^all2 was separated into one positive area (red) and one negative area (blue). The correlation coefficient of the conformational deviation along PC^all2 with structural symmetry, D_symwas greater than 0.90 in the segment lengths of 10–50 residues (Fig. 8). Both conformations displayed mirrored symmetry about a plane constructed by PC^all1 and PC^all3 when two conformations were picked from opposite positions along PC^all2. The segment conformations picked up along PC^all2 are shown in Figs. 3a–3c.

The PC^all3 correlated with a physical indicator that describes a conformational transition between structures with one turn and ones with two turns (PC^all3 in Fig. 6). The picked conformations along PC^all3 indicate that segregation of a β-hairpin structure exists along with conformational changes by PC^all3. We defined the physical indicator (D_mn+mc) of the β-hairpin formation: D_mn+mcis the sum of the norms of two vectors, which were generated by the middle point of the segment for both the N-terminal and C-terminal residues: D_mn+mc= $| \vec{d_{m n}} + \vec{d_{m c}} |$ , where $\vec{d_{m n}}$ and $\vec{d_{m c}}$ respectively denote the vectors from the midpoint to the N-terminal and C-terminal residues of the segment. Good correlation was found between PC^all3 and D_mn+mc(Fig. 8). The correlation coefficient was greater than 0.7 for the 10–50 residues. The triangle map of PC^all3 indicated a separation of one positive area (red) and two negative areas (blue). It is noteworthy that the triangle map of PC^all3 for short segments differed slightly from those of medium and long segments. A positive area is visible near the residue pair of the N-terminal and C-terminal in the short map, suggesting that PC^all3 has a (negative) correlation with D_end. For medium and long lengths, the positive area was close to the center of the triangle map. Therefore, the correlation between PC^all3 and D_mn+mc/D_endwas necessarily smaller in medium and long lengths.

The triangle map of PC^all4 had one negative area and one positive area. The positive area, located at the map center, suggests that PC^all4 is correlated with the radius of gyration (_midRg) of the middle region of the segment – except for both the N-terminal and C-terminal quarter portions – in the medium and long segments. The respective correlation coefficients for the 26 and 30 residue lengths were 0.73 and 0.72. The PC^all4 also has a weak (negative) correlation with D_end. The respective correlation coefficients between PC^all4 and D_endfor the 26 and 30 residue lengths were -0.45 and -0.42.

We identified no simple physical indicator for conformational changes along PC^all5. However, visual inspection from conformations picked along PC^all5 suggests that PC^all5 is a conformational axis that represents segregated β-sheet structures. Conformations picked up from both ends on PC^all5 are depicted in Fig. 6. In the triangle map for PC^all5, two positive and two negative areas exist along the diagonal line, which might indicate that PC^all5 segregates segment conformations with double turns. The PC^β5 contribution ratio, which was derived from all-β proteins, was higher than that derived from other structural classes, which suggests that PC5 is important for describing the structural variation of β-structures.

Segment universes derived from different structural classes

The segment universes described above are those derived from proteins of the four structural classes. Therefore, decomposition of the universe into four classes is helpful to evaluate the influence of each structural class on the segment universe. To this end, a segment universe was constructed for each structural class separately, and compared the PC axes derived from each universe with those of all segments (i.e., PC^all1-PC^all3). The first three largest eigenvectors of each structural class were also compared respectively with PC^all1, PC^all2, and PC^all3 to elucidate the structural properties of PC axes derived from each universe.

Figure 9 depicts the contribution ratios of the first three PC axes, PC^x1 -PC^x3 (x = α, β, α +β, or α/β), in each structural class. The marks on the curves in Fig. 9 indicate that the correlation coefficient (v^x_i·v^all_i) between PC^x1 -PC^x3 and PC^all1 -PC^all3 (i.e., i = 1, 2, 3) is greater than 0.7, which was used here as a threshold of conservation of structural properties. The properties of the first two PC axes corresponding to the PC^all1 and PC^all2 were highly conserved in all four structural classes. The characteristics of PC^all3 were also conserved in all four structural classes, although exceptions were apparent for the 20-residue-long and 10–16-residue-long all-α and all-β classes. Therefore, it is confirmed that the first three PC axes (Rg, symmetry, and one/two turn(s)) are important in almost all cases to describe the conformation of segments embedded in globular proteins.

However, the curves for the contribution ratios of both all-α and all-β classes (see two panels of Fig. 9) differ clearly from those of PC^all1 – PC^all3 (i.e. Q₁ – Q₃ in Fig. 5). The Q^α₁, contribution ratio was always higher than 40%, which indicates that the distribution of the all-α segments has a large deviation with respect to Rg. In contrast, the Q^β₁ contribution ratio decreased rapidly with increasing segment length. The value of Q^α₂ increased moderately with increasing segment length. In contrast, the Q^β₂ had a maximum value greater than 20% at a length of 22 residues. This rapid increase of Q^β₂ might reflect a typical feature for β-sheet conformations. For PC3, the curves for the contribution ratios of the all-α and all-β classes also mutually differed. Although Q^β₃ peaked at a length of 35 residues, Q^α₃ peaked with a short length, which indicates that the structural variable based on PC^all3 is important for β-segments longer than 30 residues. In contrast, the behaviors of the contribution ratios for both α+β and α/β classes along with the segment length resembled each other. They were also similar to Q₁-Q₃ in Fig. 5 because those structural classes are mixtures of α-helices and β-sheets.

Subsequently, PC axes that were specific for each structural class were examined. For this analysis, the PC axis was defined as a "class-specific" one when a PC axis from a structural class showed no similarity with the first 20 PC axes from the other three structural classes (see Methods). The first 10 PC axes of each class were investigated for the short (10 residues), medium (26 residues), and long (30 residues) segments. Ten class-specific conformational axes were identified and consisted of one (PC^β10) for the short length, eight for the medium, and one (PC^α8) for the long. The eight class-specific axes for the medium-length segments are PC^α5, PC^α8, and PC^α10 for all-α, PC^β10 for all-β, PC^α+β9 and PC^α+β10 for α+β, and PC^α/β8 and PC^α/β10 for α/β. Four examples out of eight are depicted in Fig. 10. A clear correlation of these PC axes is difficult to discern according to simple physical or structural quantities. Figure 10a shows that the PC^α8 describes a structural change of three (both ends and the middle portion) parts of α-segments. The PC^α/β8 is related to βαβ motifs, which is the most fundamental structural unit for α/β proteins.

Discussion

Investigation of the protein segment universe is an important subject for bioinformatics. Results of this study show that the segment universe can be categorized naturally into three regimes: short, medium, and long. A main finding of this study is that the three regimes are clearly demarcated by critical changes in the shape of the segment distribution in the conformational space. Preceding studies demonstrated that the average length of α-helix is 14 residues [24] and that for β-strand is five residues [25]. Results of the present study show that transitional segment lengths (22 and 26 residues long) do not coincide with these average lengths. Therefore, a single secondary structure element does not characterize the shape of the segment distribution. The appearance of the medium length regime segregates the segment fold universe into three. The combination of secondary-structure elements is important to characterize not only the medium-length segment universe but also the entire segment fold universe.

Meanwhile, loops, which make up 30% of the protein structures [26], are also expected to take a larger role to form some unique conformations by connecting secondary-structure elements in the medium to the long-length segment universe than short one. The segments in the cluster of the medium to long-length universe tend to contain more loop regions than those of the short segment universe, as shown in Figs. 4b and 4d, and have a wider variety of origins (Figs. 4a and 4c). For example, the segments in the cluster with density of 0.35–1.0 of the universe of 30 residues length are derived from 461 proteins out of all 600 representatives used for this study (see Additional File 1). Longer loops that possess extended conformations are located on the opposite side of the compact-segment cluster along PC^all1 in the medium to long segment universe (Figs. 3b and 3c). Instead of discrete clusters, they appear to constitute a rather continuous distribution. Some analyses examine short loops with respect to their completeness [27, 28] and elaborate classification [26, 29]. In the analysis of short segments, our method also captured some loop conformation classes, such as joint loops connecting two helices, and exposed and extended loops participated in protein-protein interactions [23].

A natural boundary was identified, in this study, between the peptide-like and protein-like distributions between the lengths of 23 and 26 residues using actual conformations of protein segments. This observation with respect to the boundary is consistent with the results described by Shen et al. [30], even though they used a sphere-packing model to estimate a minimal domain size of about 20 residues. A recent study by Sawada and Honda [31] also identified a boundary at 10–20 residue length by calculating the structural diversity of segments. They discretized the conformational space using a single-pass clustering method. In contrast, we observed the density distribution to uncover differences of conformational space between short and long segments. The segment conformational space for lengths of 10–22 residues provided a distribution with an extreme density gradient towards the secondary structure, such as the α-helix, β-strand, and β-hairpin clusters, which are expected to belong to the peptide-like conformational regime. This conformational variation reflects that short segments embedded in globular proteins are mainly stabilized by the physicochemical property of the peptide. On the other hand, the segment conformational spaces for lengths of 27 residues or more have a distribution that is dominated by compact segments, which suggests a protein-like distribution (protein-like conformational regime). This distribution arises from the hydrophobic effect imparted by the solvent molecules, which is of great importance for structural stability in long segments derived from globular proteins. If this is the case, our observations support the de novo structure prediction methods, so-called fragment assembling methods, that have been developed recently [32–35]. These approaches are usually based on the prediction of local segment conformations followed by assembly of segments, and are generally used to separate criteria at each step; sequence similarity or secondary structural propensity for the prediction of segment conformations, and non-local energy terms for the assembling step. These strategies used in the de novo prediction methods seems to be consistent with the results shown here. Results of our analyses clearly show such a hierarchical organization of protein structures, and indicate that preparing segment libraries up to around 20 residues long would be helpful for such methods.

These results indicate that the structural meanings for the conformational axes (i.e., the radius of gyration for PC^all1, structural symmetry related to the N-terminal and C-terminal halves for PC^all2, and a single-turn/two-turn structure for PC^all3) are conserved in the different lengths and structural classes. This fact suggests that these conformational components are key structural variables for protein segments. On the other hand, when conformational axes among the four structural classes were compared, we were able to identify several conformational axes that were specific to each structural class, especially in the medium length range. In fact, a distribution change for medium lengths was observed, involving an increase in compact segments. Those segments included supersecondary structures such as α-hairpins, parts of the β-sheets, and βαβ units. These results might be related to the specificity of the structural class or fold of the contents of supersecondary structures [20]. Typical supersecondary structural motifs, α-hairpin, β-hairpin, and βαβ are, respectively, the basic structural units for the all-α, all-β, and α/β proteins. These motifs are often shared within the structural classes. Therefore, the contribution ratios observed for the class-specific conformational axes were high. Class-specific conformational axes were rarely observed in short and long lengths, probably because short segments are too nonspecific and are often shared over different structural classes; long segments are too specific and have very low contribution ratios for conformational axes that are specific for each structural class.

The currently found class-specific conformational axes provide a hint to solve a difficulty in classifying diverse sets of protein structures. Both α/β and α+β classes are known to show a substantial overlap. In the CATH classification, α/β and α+β classes are treated as one structural class as α-β class. Classifying α/β and α+β proteins is sometimes a difficult problem, although several classification [19, 36, 37] and also prediction [38, 39] schemes have been proposed. The present study showed that α/β and α+β classes have similar characteristics of universes, and also have unique ones at the same time. For example, our results show that PC^α/β8, whose contribution ratio was 1.4%, was associated only with the βαβ motif. In the α+β class, no axis was strongly correlated with PC^α/β8 (see Additional File 2), which is a clear example of the difference in structural variables between α+β and α/β classes originating from class-specific supersecondary structures. Consequently, projecting segments onto a conformational subspace using the axis PC^α/β8 could be useful for objectively dividing protein domains of α-β class into α/β and α+β classes. A considerable localization of segments derived from α/β proteins in a PCA subspace is observed (see Additional Files 3 and 4).

An effective method must be developed for conformational sampling for de novo prediction methods. The resulting structural variables analyzed in this study would be helpful for additional progress in de novo structure prediction. For example, testing the distribution of segments or models in terms of the degree of symmetry using the descriptor (D_sym) might be useful to verify the completeness of sampling of the conformational space. Using a filtering threshold or function (generally used in fragment assembling methods for selecting proper models) that is tolerant of the radius of gyration might be useful for improving the prediction of all-α proteins because the contribution ratio, Q^α₁, of PC^α 1 corresponding to the radius of gyration (Rg) is larger than those of the other structural classes in the medium and long segments. Consequently, projecting segments of models onto a conformational subspace constructed by PC^x(where x = α, β, α/β, α+β, or all) axes might be helpful for filtering out models and assigning a protein to a structural class.

Conclusion

In this study, the dual critical transitions in the protein segment universe from short to long length are shown. Our observations are related to the transitions proposed by the significance of two-stage treatment in de novo structure prediction. Considering the hierarchical organization of a protein segment universe that we have shown, we suggest the efficacy of using the evaluation functions that is secondary-structure-directed for sampling local structures less than 23 residues long. We also suggest the suitability of evaluating protein-like features of models using another function (e.g. Rg) for longer segments. Changing the criteria of filtering for each structural class will enhance the effectiveness of the conformation sampling process. Through these analyses, we have demonstrated that our clustering methodology is useful to identify a distinctive distribution shift of conformational space between short and long segments and that distribution changes depend on structural classes.

Methods

Preparing the segment libraries

One representative from each fold group of the SCOP database (ver. 1.63) [15] was chosen to obtain a segment library without a bias of usage of the folds. The representatives cover the four major structural classes (all-α, all-β, α+β, and α/β), because we are interested in and specifically examine characterization of the nature of segments embedded in usual size globular proteins. Small proteins of less than 50 residues and non-single chain proteins with less than 100 residues were excluded, as were membrane proteins. It is expected that those proteins possess different structural properties from those of usual size globular proteins and induce biased results. In all, 600 representatives were used for this study (all-α, 150; all-β, 116; α+β, 219; α/β, 115; see Additional File 5). Dividing the protein structures into segments with a sliding window by one residue along the sequence generated a segment library of arbitrary length. We prepared a segment library for each length of 10–50 residues to generate conformational spaces of short-to-long segments. In such cases, segments with incomplete coordinate data (e.g., having an unusual covalent-bond length or lacking main-chain atoms) were excluded. Furthermore, to elucidate differences among the conformational spaces derived from the four major structural classes, we generated a segment library for each class.

Construction and visualization of conformational space

We previously reported a method for constructing and visualizing the conformational space of protein segments using principal component analysis based on intra-segment C_α-C_α atomic distances [23]. Briefly, atomic distances of all C_α-C_α pairs for each segment in a segment library of an arbitrary length were calculated first. A distance is designated as q_i, where i is the index for the C_α-C_α pair, i = 1, ..., n(n - 1)/2, and n is the segment length, as expressed by the number of residues in the segment. Subsequently, a set of eigenvectors and eigenvalues were obtained by diagonalizing a variance-covariance matrix, C, that was calculated as C_ij= (<(q_i- <q_i>)(q_j- <q_j>)> = <(q_iq_j- q_i<q_j>- <q_i>q_j+ <q_i><q_j>)> = <q_iq_j>- <q_i><q_j>- <q_i><q_j> + <q_i><q_j> =) <q_iq_j>- <q_i><q_j>, where the average <...> is taken over the segments. Two equations, C v_i= λ_iv_iand v_i·v_j= δ_ij, are satisfied. Eigenvectors with larger eigenvalues are more important in the study of the conformational varieties of the segments. Eigenvalues are arranged in descending order: λ_i> λ_jif i <j. The contribution ratio of the i-th PCA element (i.e. the i-th eigenvector) to the whole conformational distribution is given as Q_i= λ_i/Σ_k^allλ_k. The eigenvectors, which are called PC^x1, PC^x2, PC^x3, ...etc., were used as conformational axes to construct a segment conformational space, a PCA space, in which x indicates a segment dataset: x = α, β, α/β, α +β, or all). The indicator "x = all" is given when conformational axes are generated by the whole segment dataset. The origin of the PCA space is set on the average C_α-C_α atomic distances: <q> = [<q₁>, <q₂>, <q₃>, ..., <q_n>]. This enables ready comparison of conformational distributions between constructed universes. Any position (i.e. any segment structure) in the PCA space can be expressed using a linear combination of eigenvectors as c_k= Σ_n^all(q- <q>)·v_kλ_k^1/2, where c_kis a coordinate (i.e. projection of q) on the PC axis k. Using the first three eigenvectors (PC^x1, PC^x2, PC^x3), a three-dimensional (3D) PCA space can be constructed.

We defined a vector, r, to express the position of each segment in the 3D PCA space: r= [c₁, c₂, c₃]. After projection of the segments on the 3D PCA space, the distribution of segments in the 3D PCA space was visualized using the following procedure. The 3D space was divided into N bins (total N³ cubes). The bin size was defined as (max [c₁] - min [c₁])/N, where N = 36, and max [c₁] and min [c₁] respectively signify the maximum and minimum of the coordinates of the segments along the first principal component axis. The number (i.e. frequency) of segments detected in a cube represents the density (i.e. probability) of segments to be found in the cube. The density of each cube, ρ was normalized by the maximum density, ρ_max among the cubes so that the maximal value of normalized density (we call this density in the text) is set to 1 (refer to eq. (3) in [23]). Four levels of contour surfaces (i.e. iso-density surfaces) were depicted to visualize the 3D PCA space. The density values for those surfaces were set respectively as 0.005, 0.01, 0.1, and 0.35.

We also separately constructed the universe for four structural classes to assess differences among their conformational spaces. For this study, we specifically examined the first 10 PC axes of each structural class because the 10 PC axes are more important than the other axes with respect to capturing the differences in the conformational axes. Although the eigenvectors from the same structural class are mutually uncorrelated (i.e., v^x_i·v^x_j= 0, where i ≠ j and x = α, β, α/β, or α+β), the eigenvectors from different structural classes might have some correlation (i.e., v^x_i·v^y_j≠ 0, where x ≠ y). The PC axis is defined as the conformational component specific to the structural class when a PC axis from a structural class has no similarity to the first 20 PC axes from the other structural classes with a correlation coefficient > 0.8 (i.e. v^x_i·v^y_j> 0.8).

References

Matsuo Y, Kanehisa M: An approach to systematic detection of protein structural motifs. Comput Appl Biosci 1993, 9(2):153–159.
CAS Google Scholar
Unger R, Sussman JL: The importance of short structural motifs in protein structure analysis. J Comput Aided Mol Des 1993, 7(4):457–472. 10.1007/BF02337561
Article CAS Google Scholar
Micheletti C, Seno F, Maritan A: Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins 2000, 40(4):662–674. 10.1002/1097-0134(20000901)40:4<662::AID-PROT90>3.0.CO;2-F
Article CAS Google Scholar
Prestrelski SJ, Byler DM, Liebman MN: Generation of a substructure library for the description and classification of protein secondary structure. II. Application to spectra-structure correlations in Fourier transform infrared spectroscopy. Proteins 1992, 14(4):440–450. 10.1002/prot.340140405
Article CAS Google Scholar
Rackovsky S: Quantitative organization of the known protein x-ray structures. I. Methods and short-length-scale results. Proteins 1990, 7(4):378–402. 10.1002/prot.340070409
Article CAS Google Scholar
de Brevern AG, Etchebest C, Hazout S: Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 2000, 41(3):271–287. 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-Z
Article CAS Google Scholar
Fetrow JS, Palumbo MJ, Berg G: Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins 1997, 27(2):249–271. 10.1002/(SICI)1097-0134(199702)27:2<249::AID-PROT11>3.0.CO;2-M
Article CAS Google Scholar
Sander O, Sommer I, Lengauer T: Local protein structure prediction using discriminative models. BMC Bioinformatics 2006, 7: 14. 10.1186/1471-2105-7-14
Article Google Scholar
Rooman MJ, Rodriguez J, Wodak SJ: Automatic definition of recurrent local structure motifs in proteins. J Mol Biol 1990, 213(2):327–336. 10.1016/S0022-2836(05)80194-9
Article CAS Google Scholar
Schuchhardt J, Schneider G, Reichelt J, Schomburg D, Wrede P: Local structural motifs of protein backbones are classified by self-organizing neural networks. Protein Eng 1996, 9(10):833–842. 10.1093/protein/9.10.833
Article CAS Google Scholar
Hunter CG, Subramaniam S: Protein fragment clustering and canonical local shapes. Proteins 2003, 50(4):580–588. 10.1002/prot.10309
Article CAS Google Scholar
Tomii K, Kanehisa M: Systematic detection of protein structural motifs. In Pattern discovery in biomolecular data. Edited by: Wang JTL, Shapiro BA, Shasha D. New York: Oxford University Press; 1999:97–110.
Google Scholar
Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol 2006, 16(3):393–398. 10.1016/j.sbi.2006.04.007
Article CAS Google Scholar
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8
Article CAS Google Scholar
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.
CAS Google Scholar
Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G: A database of protein structure families with common folding motifs. Protein Sci 1992, 1(12):1691–1698.
Article CAS Google Scholar
Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385. 10.1016/S0959-440X(96)80058-3
Article CAS Google Scholar
Holm L, Sander C: Mapping the protein universe. Science 1996, 273(5275):595–603. 10.1126/science.273.5275.595
Article CAS Google Scholar
Hou J, Sims GE, Zhang C, Kim SH: A global representation of the protein fold space. Proc Natl Acad Sci USA 2003, 100(5):2386–2390. 10.1073/pnas.2628030100
Article CAS Google Scholar
Salem GM, Hutchinson EG, Orengo CA, Thornton JM: Correlation of observed fold frequency with the occurrence of local structural motifs. J Mol Biol 1999, 287(5):969–981. 10.1006/jmbi.1999.2642
Article CAS Google Scholar
Szustakowski JD, Kasif S, Weng Z: Less is more: towards an optimal universal description of protein folds. Bioinformatics 2005, 21(Suppl 2):ii66–71. 10.1093/bioinformatics/bti1111
Article CAS Google Scholar
Kurgan L, Kedarisetti KD: Sequence representation and prediction of protein secondary structure for structural motifs in twilight zone proteins. Protein J 2006, 25(7–8):463–474. 10.1007/s10930-006-9029-0
Article CAS Google Scholar
Ikeda K, Tomii K, Yokomizo T, Mitomo D, Maruyama K, Suzuki S, Higo J: Visualization of conformational distribution of short to medium size segments in globular proteins and identification of local structural motifs. Protein Sci 2005, 14(5):1253–1265. 10.1110/ps.04956305
Article CAS Google Scholar
Kumar S, Bansal M: Structural and sequence characteristics of long alpha helices in globular proteins. Biophys J 1996, 71(3):1574–1586.
Article CAS Google Scholar
Penel S, Morrison RG, Dobson PD, Mortishire-Smith RJ, Doig AJ: Length preferences and periodicity in beta-strands. Antiparallel edge beta-sheets are more likely to finish in non-hydrogen bonded rings. Protein Eng 2003, 16(12):957–961. 10.1093/protein/gzg147
Article CAS Google Scholar
Donate LE, Rufino SD, Canard LH, Blundell TL: Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci 1996, 5(12):2600–2616.
Article CAS Google Scholar
Fidelis K, Stern PS, Bacon D, Moult J: Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng 1994, 7(8):953–960. 10.1093/protein/7.8.953
Article CAS Google Scholar
Lessel U, Schomburg D: Creation and characterization of a new, non-redundant fragment data bank. Protein Eng 1997, 10(6):659–664. 10.1093/protein/10.6.659
Article CAS Google Scholar
Wojcik J, Mornon JP, Chomilier J: New efficient statistical sequence-dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. J Mol Biol 1999, 289(5):1469–1490. 10.1006/jmbi.1999.2826
Article CAS Google Scholar
Shen M-y, Davis FP, Sali A: The optimal size of a globular protein domain: A simple sphere-packing model. Chemical Physics Letters 2005, 405(1–3):224–228. 10.1016/j.cplett.2005.02.029
Article CAS Google Scholar
Sawada Y, Honda S: Structural diversity of protein segments follows a power-law distribution. Biophys J 2006, 91(4):1213–1223. 10.1529/biophysj.105.076661
Article CAS Google Scholar
Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D: De novo prediction of three-dimensional structures for major protein families. J Mol Biol 2002, 322(1):65–78. 10.1016/S0022-2836(02)00698-8
Article CAS Google Scholar
Chikenji G, Fujitsuka Y, Takada S: A reversible fragment assembly method for de novo protein structure prediction. The Journal of Chemical Physics 2003, 119(13):6895–6903. 10.1063/1.1597474
Article CAS Google Scholar
Lee J, Kim S-Y, Lee J: Protein structure prediction based on fragment assembly and parameter optimization. Biophysical Chemistry 2005, 115(2–3):209–214. 10.1016/j.bpc.2004.12.046
Article CAS Google Scholar
Bujnicki JM: Protein-structure prediction by recombination of fragments. Chembiochem 2006, 7(1):19–27. 10.1002/cbic.200500235
Article CAS Google Scholar
Michie AD, Orengo CA, Thornton JM: Analysis of domain structural class using an automated class assignment protocol. J Mol Biol 1996, 262(2):168–185. 10.1006/jmbi.1996.0506
Article CAS Google Scholar
Kurgan LA, Zhang T, Zhang H, Shen S, Ruan J: Secondary structure-based assignment of the protein structural classes. Amino Acids 2008.
Google Scholar
Chou KC: Progress in protein structural class prediction and its impact to bioinformatics and proteomics. Curr Protein Pept Sci 2005, 6(5):423–436. 10.2174/138920305774329368
Article CAS Google Scholar
Kurgan L, Cios K, Chen K: SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 2008, 9: 226. 10.1186/1471-2105-9-226
Article Google Scholar
Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211
Article CAS Google Scholar

Download references

Acknowledgements

KI and JH were partly supported by BIRD of Japan Science and Technology Agency (JST).

Author information

Authors and Affiliations

Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-42 Aomi, Koto-ku, Tokyo, 135-0064, Japan
Kazuyoshi Ikeda, Takatsugu Hirokawa & Kentaro Tomii
School of Life Science, Tokyo University of Pharmacy and Life Science, 1432-1 Horinouchi, Hachioji, Tokyo, 192-0392, Japan
Junichi Higo
Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency, Chiyoda-ku, Tokyo, 102-8666, Japan
Kazuyoshi Ikeda & Junichi Higo
PharmaDesign, Inc., 2-19-8 Hacchobori, Chuo-ku, Tokyo, 104-0032, Japan
Kazuyoshi Ikeda
461 Koshland Hall, University of California, Berkeley, CA, 94720-3102, USA
Kentaro Tomii

Authors

Kazuyoshi Ikeda
View author publications
You can also search for this author in PubMed Google Scholar
Takatsugu Hirokawa
View author publications
You can also search for this author in PubMed Google Scholar
Junichi Higo
View author publications
You can also search for this author in PubMed Google Scholar
Kentaro Tomii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Junichi Higo or Kentaro Tomii.

Additional information

Authors' contributions

This study was conceived and carried out by KI, who also analyzed the results and drafted the manuscript. HT approved the study and participated in the discussion. JH participated in the design and coordination of the study. He also helped to write the manuscript. KT participated in the design and discussions of the study and wrote the manuscript. KI and JH developed the methodology. All authors read and approved the final manuscript.

Electronic supplementary material

12900_2008_202_MOESM1_ESM.xls

Additional file 1: Origins of segments in the cluster of 30 residue length. Distributions of the origins of segments in the cluster of the universe of 30 residues length are shown. (XLS 18 KB)

12900_2008_202_MOESM2_ESM.pdf

Additional file 2: Correlation with the first 10 PC axes of α/β class of the medium (26 residue) segments. Maximal correlation coefficients between the first 10 PC axes of α/β class and PC axes of the other three structural classes are shown. (PDF 49 KB)

12900_2008_202_MOESM3_ESM.pdf

Additional file 3: Class-specific region for α/β segments on the PC^α/β8-PC^α/β3 plane. Distributions of segments of α/β structural class proteins for the medium length are shown. (PDF 79 KB)

12900_2008_202_MOESM4_ESM.xls

Additional file 4: Discrimination of segments from the α/β structural class in the PC^α/β8-PC^α/β3 plane. Specificity and coverage rates of segments of α/β structural class proteins in the PC^α/β8-PC^α/β3 plane are presented. (XLS 18 KB)

12900_2008_202_MOESM5_ESM.pdf

Additional file 5: List of PDB ids used in this study. The PDB and SCOP IDs of proteins used in this study are listed. (PDF 47 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ikeda, K., Hirokawa, T., Higo, J. et al. Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces. BMC Struct Biol 8, 37 (2008). https://doi.org/10.1186/1472-6807-8-37

Download citation

Received: 25 April 2008
Accepted: 13 August 2008
Published: 13 August 2008
DOI: https://doi.org/10.1186/1472-6807-8-37

Protein-segment universe exhibiting transitions at intermediate segment length in conformational subspaces

Abstract

Background

Results

Conclusion

Background

Results

Transitions of segment distribution: short, medium, and long segments

Short length (10–22 residues long)

Medium length (23–26 residues long)

Long length (27–50 residues long)

Contribution ratios of principal axes

Investigation of structural properties of conformational axes

Segment universes derived from different structural classes

Discussion

Conclusion

Methods

Preparing the segment libraries

Construction and visualization of conformational space

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Structural Biology

Contact us